Fixing CSE of hoisted encoding ops. #21921

benvanik · 2025-09-10T18:53:48Z

A bug in the fixed-point iteration applying cleanup during the stream phase and the IREE::Stream::AsyncTransferOp not being marked pure were preventing us from CSEing entire DAGs of work in hoisted initializers.

Example before (same parameter encoded twice and then stored in memory as two distinct buffers):

  util.initializer {
    %cst = stream.tensor.constant on(#hal.device.affinity<@__device_0>) : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<constant> = #stream.parameter.named<"model"::"blk.0.attn_q.weight:qs"> : tensor<4096x4096xf8E4M3FNUZ>
    %0 = stream.resource.size %cst : !stream.resource<constant>
    util.global.store %cst, @"__auto.blk.0.attn_q.weight:qs" : !stream.resource<constant>
    util.global.store %0, @"__auto.blk.0.attn_q.weight:qs__size" : index
    %__auto.blk.0.attn_q.weight3Aqs__size = util.global.load @"__auto.blk.0.attn_q.weight:qs__size" : index
    %__auto.blk.0.attn_q.weight3Aqs = util.global.load @"__auto.blk.0.attn_q.weight:qs" : !stream.resource<constant>
    %1 = stream.async.transfer %__auto.blk.0.attn_q.weight3Aqs : !stream.resource<constant>{%__auto.blk.0.attn_q.weight3Aqs__size} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size}
    %2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<4096x4096xf8E4M3FNUZ, #encoding> : index
    %3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
    %4 = stream.async.transfer %3 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
    util.global.store %4, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded : !stream.resource<constant>
    util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded__size : index
    %__auto.blk.0.attn_q.weight3Aqs__size_0 = util.global.load @"__auto.blk.0.attn_q.weight:qs__size" : index
    %__auto.blk.0.attn_q.weight3Aqs_1 = util.global.load @"__auto.blk.0.attn_q.weight:qs" : !stream.resource<constant>
    %5 = stream.async.transfer %__auto.blk.0.attn_q.weight3Aqs_1 : !stream.resource<constant>{%__auto.blk.0.attn_q.weight3Aqs__size_0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size_0}
    %6 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<4096x4096xf8E4M3FNUZ, #encoding> : index
    %7 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %5 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size_0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%6}
    %8 = stream.async.transfer %7 : !stream.resource<*>{%6} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%6}
    util.global.store %8, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0 : !stream.resource<constant>
    util.global.store %6, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0__size : index
    util.return
  }

Example after (parameter encoded once and stored in memory once):

  util.initializer {
    %cst = stream.tensor.constant on(#hal.device.affinity<@__device_0>) : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<constant> = #stream.parameter.named<"model"::"blk.0.attn_q.weight:qs"> : tensor<4096x4096xf8E4M3FNUZ>
    %0 = stream.resource.size %cst : !stream.resource<constant>
    %1 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
    %2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<4096x4096xf8E4M3FNUZ, #encoding> : index
    %3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
    %4 = stream.async.transfer %3 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
    util.global.store %4, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded : !stream.resource<constant>
    util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded__size : index
    util.return
  }

The fixed-point change should be safe but the AsyncTransferOp change may have some unforeseen impacts due to how long it has not been marked pure. There's several places where we treat AsyncTransferOp as pure (removing duplicates, folding chains of transfers, etc) so my hope is this is more of an inconsistency we never noticed.

Fixes part of #21659 (duplication of the same encoded parameter).

hanhanW

I ran the command with the repro again, and I found that they are not folded away. My input IR is slightly different from yours: https://gist.github.com/hanhanW/1508cc24a8725d69e69f2d063b8317d8

Before:

  util.initializer {
    %cst = stream.tensor.constant on(#hal.device.affinity<@__device_0>) : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<constant> = #stream.parameter.named<"model"::"blk.0.attn_q.weight:qs"> : tensor<4096x4096xf8E4M3FNUZ>
    %0 = stream.resource.size %cst : !stream.resource<constant>
    %1 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
    %2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<4096x4096xf8E4M3FNUZ, #encoding> : index
    %3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
    %4 = stream.async.transfer %3 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
    %5 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
    %6 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %5 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
    %7 = stream.async.transfer %6 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
    util.global.store %cst, @"__auto.blk.0.attn_q.weight:qs" : !stream.resource<constant>
    util.global.store %0, @"__auto.blk.0.attn_q.weight:qs__size" : index
    util.global.store %4, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded : !stream.resource<constant>
    util.global.store %7, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0 : !stream.resource<constant>
    util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0__size : index
    util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded__size : index
    util.return
  }

After running iree-opt --iree-util-fold-globals ~/repro.mlir:

  util.initializer {
    %cst = stream.tensor.constant on(#hal.device.affinity<@__device_0>) : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<constant> = #stream.parameter.named<"model"::"blk.0.attn_q.weight:qs"> : tensor<4096x4096xf8E4M3FNUZ>
    %0 = stream.resource.size %cst : !stream.resource<constant>
    %1 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
    %2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<4096x4096xf8E4M3FNUZ, #encoding> : index
    %3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
    %4 = stream.async.transfer %3 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
    %5 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
    %6 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %5 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
    %7 = stream.async.transfer %6 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
    util.global.store %4, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded : !stream.resource<constant>
    util.global.store %7, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0 : !stream.resource<constant>
    util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0__size : index
    util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded__size : index                                                                                                                                                                          util.return
  }

The difference is that mine has an additional transfer from the global to the encoding op.

    %1 = stream.async.transfer %__auto.blk.0.attn_q.weight3Aqs : !stream.resource<constant>{%__auto.blk.0.attn_q.weight3Aqs__size} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size}
    %3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}

    %cst = stream.tensor.constant on(#hal.device.affinity<@__device_0>) : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<constant> = #stream.parameter.named<"model"::"blk.0.attn_q.weight:qs"> : tensor<4096x4096xf8E4M3FNUZ>
    %1 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
    %3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}

compiler/src/iree/compiler/Dialect/Util/Transforms/ApplyPatterns.cpp

hanhanW · 2025-09-10T22:25:57Z

Here is the full dump until encoding specialization: https://gist.github.com/hanhanW/932bfac88b53ef295ec7a0fe2f670f56

The intent was that the fixed-point iterator would run the nested passes until things stopped changing. This is critical for clean up after the fold/fuse globals passes but they were not telling the fixed-point iterator pass that they changed things and thus we only ever ran a single iteration.

It always should have been as we definitely treat it as pure but for some reason it wasn't. I can't quite see what would break with this behavior but after 4 years it's likely something bears load on this op not being CSE-able.

benvanik · 2025-09-11T14:46:00Z

ah yeah - yours has the extra transfer because you're just running one pass in isolation and not the CSE pass that removes the extra transfer :) --compile-from=flow on the IR you produces the results I put above.

hanhanW

Thanks, I confirmed that it fixes the dup issue. The IR looks slightly cleaner after you rebase.

benvanik · 2025-09-11T23:05:17Z

thanks for testing it!

benvanik added the compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) label Sep 10, 2025

benvanik marked this pull request as ready for review September 10, 2025 21:34

benvanik requested a review from hanhanW September 10, 2025 21:34

hanhanW reviewed Sep 10, 2025

View reviewed changes

compiler/src/iree/compiler/Dialect/Util/Transforms/ApplyPatterns.cpp Outdated Show resolved Hide resolved

benvanik added 2 commits September 11, 2025 07:45

Marking IREE::Stream::AsyncTransferOp pure.

4d999bb

It always should have been as we definitely treat it as pure but for some reason it wasn't. I can't quite see what would break with this behavior but after 4 years it's likely something bears load on this op not being CSE-able.

benvanik force-pushed the users/benvanik/cleanup-combine branch from f841b31 to 4d999bb Compare September 11, 2025 14:45

benvanik requested a review from hanhanW September 11, 2025 14:46

hanhanW approved these changes Sep 11, 2025

View reviewed changes

benvanik merged commit 5274550 into main Sep 11, 2025
71 of 73 checks passed

benvanik deleted the users/benvanik/cleanup-combine branch September 11, 2025 23:05

hanhanW mentioned this pull request Sep 11, 2025

[GPU][DT] Encode Weights in Initializers w/o Increased Memory Footprint #21659

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixing CSE of hoisted encoding ops. #21921

Fixing CSE of hoisted encoding ops. #21921

benvanik commented Sep 10, 2025 •

edited

Loading

Uh oh!

hanhanW left a comment •

edited

Loading

Uh oh!

Uh oh!

hanhanW commented Sep 10, 2025

Uh oh!

benvanik commented Sep 11, 2025 •

edited

Loading

Uh oh!

hanhanW left a comment

Uh oh!

benvanik commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

Fixing CSE of hoisted encoding ops. #21921

Fixing CSE of hoisted encoding ops. #21921

Conversation

benvanik commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanhanW left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hanhanW commented Sep 10, 2025

Uh oh!

benvanik commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanhanW left a comment

Choose a reason for hiding this comment

Uh oh!

benvanik commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

benvanik commented Sep 10, 2025 •

edited

Loading

hanhanW left a comment •

edited

Loading

benvanik commented Sep 11, 2025 •

edited

Loading