Skip to content

Conversation

@dayo09
Copy link
Contributor

@dayo09 dayo09 commented Sep 12, 2025

Let's add convert matmul to linear pass.
This commit...
refactors mm serialization logic and make convert_matmul_to_linear pass
introduces new CompileConfig attribute convert_lhs/rhs_const_mm_to_fc.

TICO-DCO-1.0-Signed-off-by: Dayoung Lee dayoung.lee@samsung.com


For #339

@dayo09 dayo09 force-pushed the 0912-mm-to-linear branch 2 times, most recently from fddc18a to 72a3617 Compare September 12, 2025 11:55
Comment on lines +92 to +93
def get_compile_config(self):
return CompileConfigV1(convert_lhs_const_mm_to_fc=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glistening @seockho-kim Using this compile config will enable matmul op with lhs const node conversion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dayo09 Could you improve it to handle bmm, too?

@tag.use_onert
class BmmTest(TestModuleBase):
    def __init__(self):
        super().__init__()
        self.weight = torch.randn(2, 3, 4)

    def forward(self, rhs):
        out = self.weight @ rhs
        return out

    def get_example_inputs(self):
        return (torch.randn(2, 4, 5),), {}

    def get_compile_config(self):
        return CompileConfigV1(convert_lhs_const_mm_to_fc=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seockho-kim Above case is not supported because matmul to fc conversion can be done only if weight is 2dim. Circle FullyConnected operation assumes its weight to be in rank 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I gave you wrong example.
I mean bmm(batch=1) case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add it in the next PR !

* Linear has better quantization accuracy (NPU backend)
Due to ONE compiler's quantization policy;
FullyConnected(=Linear) uses per-channel quantization for weight and per-tensor for input.
BatchMatmul(=matmul) uses per-tensor quantization for both rhs and lhs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, a new generation of NPU would support cwq for matmul.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinevening Do you mean 3rd generation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

inputs = [input, other]
outputs = [node]

if not is_const(other) and prior_latency:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prior_latency is not used anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinevening Yes, the old feature is basically part of the new one.

# BEFORE prior_latency==False (default)
# AFTER (default)
If rhs is const: conversion ON
else: conversion OFF

# BEFORE prior_latency==True 
# AFTER convert_rhs_const_mm_to_fc ==False 
always: conversion OFF

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then it would be possible to remove that arg and related codes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed ;-D

return fc_node


class ConvertLhsConstMatmulToLinear(Converter):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should it consider lhs and rhs? The left and right matter when converting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhs4670go onert doesn't run lhs const matmul. So this PR converts matmul to fullyconnected for onert, conditionally with config

@mhs4670go
Copy link
Contributor

@glistening I saw this comment. Is this pass still needed?

@dayo09
Copy link
Contributor Author

dayo09 commented Sep 17, 2025

I saw Samsung/ONE#16064 (comment). Is this pass still needed?

@glistening @seockho-kim How do you think whether this pass is needed still? If it does, I plan to add pass for lowering bmm to mm pass in another pr. Please share your opinions.

@glistening
Copy link
Contributor

glistening commented Sep 18, 2025

I saw Samsung/ONE#16064 (comment). Is this pass still needed?

@dayo09 Sure. I need your PR :). The maintainer of ONERT will keep the constraints (not allowing lhs const). My PR (#356) and commenting out is a workaround to solve the next steps in onert. If you're in hurry or it takes much work, please feel free to inform me.

@dayo09
Copy link
Contributor Author

dayo09 commented Sep 18, 2025

@jinevening @seockho-kim @mhs4670go PTAL :-D

Comment on lines +65 to +67
""" """

def __init__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
""" """
def __init__(self):
def __init__(self):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to describe what error is expected. NNFW_STATUS_ERROR is a bit ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is how onert throws. It should match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think using docstring or comments is also enough.

jinevening
jinevening previously approved these changes Sep 18, 2025
Copy link
Contributor

@jinevening jinevening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +23 to +25
convert_lhs_const_mm_to_fc: bool = False
convert_rhs_const_mm_to_fc: bool = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, just convert_const_mm_to_fc could be simpler choice. Do you have any reasons that chose this design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rhs_const_mm_to_fc doesn't have trade-off because tranpose is foldable to const, but lhs_const_mm_to_fc requires potential latency trade-off. Therefore, the user needs separate decisions on each case.

@glistening
Copy link
Contributor

glistening commented Sep 18, 2025

If I understand correctly, TICO will generate matmul(const lhs, rhs) to linear(...) + transpose1() + transpose2() where transpose1() + transpose2() does nothing. I don't fully understand why it happens. Who and why batchmatmul + transpose is generated for matmul? Does the initial aten graph has batchmatmul + transpose? Or TICO does?

To remove the redundant transpose x 2, it requires to add a pass in circle2circle to fuse batchmatmul + transpose to 1 fullyconnected. It would be nice if TICO can do this.

@mhs4670go
Copy link
Contributor

It would be nice if TICO can do this.

We've decided to delegate graph optimization to one-optimize in order to avoid code duplication. Is it hard to use one-optimize?

mhs4670go
mhs4670go previously approved these changes Sep 18, 2025
Copy link
Contributor

@mhs4670go mhs4670go left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glistening
Copy link
Contributor

glistening commented Sep 18, 2025

Yes, I agree that circle2circle is better for circle level optimization.
It does better in that batchmatmul is converted to fullyconnected based on config.

( I would like to check it works. However, as usual, I cannot access GitHub.
I will check later after I managed to check out this PR. )

I was just curious why TICO or torch.export choose the inefficient operation sequence for simple matmul.

@glistening
Copy link
Contributor

glistening commented Sep 18, 2025

Hmm. I succeed to gh pr checkout 341.
I converted my model, but there is no change.
It still emit batchmatmul.

@seockho-kim Does this PR solve the same issue in gemma3?

@mhs4670go
Copy link
Contributor

mhs4670go commented Sep 18, 2025

@dayo09 Conflicts should be resolved.

@glistening

After you apply this PR, you should give the configuration to use introduced feature. Did you do it like below? Seems that bmm to mm conversion is also needed.

config = tico.CompileConfigV1()
config.convert_lhs_const_mm_to_fc = True
circle_model = tico.convert(torch_module, example_inputs, config = config)

dayo09 and others added 3 commits September 18, 2025 16:45
Let's add convert matmul to linear pass.
This commit...
 refactors mm serialization logic and make convert_matmul_to_linear pass
 introduces new CompileConfig attribute convert_lhs/rhs_const_mm_to_fc.

TICO-DCO-1.0-Signed-off-by: Dayoung Lee <dayoung.lee@samsung.com>
Co-authored-by: Hyukjin Jeong <hj1.jeong@samsung.com>
@dayo09 dayo09 dismissed stale reviews from mhs4670go and jinevening via 4588a1b September 18, 2025 07:45
@dayo09
Copy link
Contributor Author

dayo09 commented Sep 18, 2025

@glistening Sorry for ambiguousness in my comment.

If it does, I plan to add pass for lowering bmm to mm pass in another pr. Please share your opinions.

I mean, this PR doesn't support bmm to mm YET, it needs further PR. batch-1 bmm to mm is separate feature so I planned to do it in the next pr.

@dayo09
Copy link
Contributor Author

dayo09 commented Sep 18, 2025

@@jinevening @seockho-kim @mhs4670go It's rebased. PTAL again 😅

@seockho-kim
Copy link
Contributor

Hmm. I succeed to gh pr checkout 341. I converted my model, but there is no change. It still emit batchmatmul.

@seockho-kim Does this PR solve the same issue in gemma3?

This pr does not solve issue in gemma3, because bmm is not changed.
I'm waiting for another PR.

Copy link
Contributor

@seockho-kim seockho-kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dayo09
Copy link
Contributor Author

dayo09 commented Sep 18, 2025

On a second thought, just adding 1-batch bmm to mm may include redundant reshape, so it's not a general pass.

@jinevening @mhs4670go Do you think conversion of 1-batch bmm to mm is allowed or may I introduce the pass optionally for matmul to linear case?

@dayo09 dayo09 merged commit 16276d7 into Samsung:main Sep 18, 2025
6 checks passed
@mhs4670go
Copy link
Contributor

Do you think conversion of 1-batch bmm to mm is allowed or may I introduce the pass optionally for matmul to linear case?

I think it should be optional because it's not kind of circle legalization.

@jinevening
Copy link
Contributor

Do you think conversion of 1-batch bmm to mm is allowed or may I introduce the pass optionally for matmul to linear case?

+1 for the latter

@glistening
Copy link
Contributor

glistening commented Sep 19, 2025

If TICO cannot convert my case by its own and needs to run circle level optimizer:

  • It would be good to write a single pass in circle level optimizer. Rather than introducing two pass — one in TICO, another one in circle optimizer.
  • I don't expect circle2circle has such pass
  • I don't like circle2circle — c++ solution. Maybe it would be better to introduce my circle level graph optimizer written in python using flatbuffer provided object api.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants