Skip to content

Conversation

benvanik
Copy link
Collaborator

@benvanik benvanik commented Sep 10, 2025

A bug in the fixed-point iteration applying cleanup during the stream phase and the IREE::Stream::AsyncTransferOp not being marked pure were preventing us from CSEing entire DAGs of work in hoisted initializers.

Example before (same parameter encoded twice and then stored in memory as two distinct buffers):

  util.initializer {
    %cst = stream.tensor.constant on(#hal.device.affinity<@__device_0>) : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<constant> = #stream.parameter.named<"model"::"blk.0.attn_q.weight:qs"> : tensor<4096x4096xf8E4M3FNUZ>
    %0 = stream.resource.size %cst : !stream.resource<constant>
    util.global.store %cst, @"__auto.blk.0.attn_q.weight:qs" : !stream.resource<constant>
    util.global.store %0, @"__auto.blk.0.attn_q.weight:qs__size" : index
    %__auto.blk.0.attn_q.weight3Aqs__size = util.global.load @"__auto.blk.0.attn_q.weight:qs__size" : index
    %__auto.blk.0.attn_q.weight3Aqs = util.global.load @"__auto.blk.0.attn_q.weight:qs" : !stream.resource<constant>
    %1 = stream.async.transfer %__auto.blk.0.attn_q.weight3Aqs : !stream.resource<constant>{%__auto.blk.0.attn_q.weight3Aqs__size} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size}
    %2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<4096x4096xf8E4M3FNUZ, #encoding> : index
    %3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
    %4 = stream.async.transfer %3 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
    util.global.store %4, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded : !stream.resource<constant>
    util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded__size : index
    %__auto.blk.0.attn_q.weight3Aqs__size_0 = util.global.load @"__auto.blk.0.attn_q.weight:qs__size" : index
    %__auto.blk.0.attn_q.weight3Aqs_1 = util.global.load @"__auto.blk.0.attn_q.weight:qs" : !stream.resource<constant>
    %5 = stream.async.transfer %__auto.blk.0.attn_q.weight3Aqs_1 : !stream.resource<constant>{%__auto.blk.0.attn_q.weight3Aqs__size_0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size_0}
    %6 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<4096x4096xf8E4M3FNUZ, #encoding> : index
    %7 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %5 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size_0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%6}
    %8 = stream.async.transfer %7 : !stream.resource<*>{%6} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%6}
    util.global.store %8, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0 : !stream.resource<constant>
    util.global.store %6, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0__size : index
    util.return
  }

Example after (parameter encoded once and stored in memory once):

  util.initializer {
    %cst = stream.tensor.constant on(#hal.device.affinity<@__device_0>) : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<constant> = #stream.parameter.named<"model"::"blk.0.attn_q.weight:qs"> : tensor<4096x4096xf8E4M3FNUZ>
    %0 = stream.resource.size %cst : !stream.resource<constant>
    %1 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
    %2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<4096x4096xf8E4M3FNUZ, #encoding> : index
    %3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
    %4 = stream.async.transfer %3 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
    util.global.store %4, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded : !stream.resource<constant>
    util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded__size : index
    util.return
  }

The fixed-point change should be safe but the AsyncTransferOp change may have some unforeseen impacts due to how long it has not been marked pure. There's several places where we treat AsyncTransferOp as pure (removing duplicates, folding chains of transfers, etc) so my hope is this is more of an inconsistency we never noticed.

Fixes part of #21659 (duplication of the same encoded parameter).

@benvanik benvanik added the compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) label Sep 10, 2025
@benvanik benvanik marked this pull request as ready for review September 10, 2025 21:34
@benvanik benvanik requested a review from hanhanW September 10, 2025 21:34
Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the command with the repro again, and I found that they are not folded away. My input IR is slightly different from yours: https://gist.github.com/hanhanW/1508cc24a8725d69e69f2d063b8317d8

Before:

  util.initializer {
    %cst = stream.tensor.constant on(#hal.device.affinity<@__device_0>) : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<constant> = #stream.parameter.named<"model"::"blk.0.attn_q.weight:qs"> : tensor<4096x4096xf8E4M3FNUZ>
    %0 = stream.resource.size %cst : !stream.resource<constant>
    %1 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
    %2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<4096x4096xf8E4M3FNUZ, #encoding> : index
    %3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
    %4 = stream.async.transfer %3 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
    %5 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
    %6 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %5 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
    %7 = stream.async.transfer %6 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
    util.global.store %cst, @"__auto.blk.0.attn_q.weight:qs" : !stream.resource<constant>
    util.global.store %0, @"__auto.blk.0.attn_q.weight:qs__size" : index
    util.global.store %4, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded : !stream.resource<constant>
    util.global.store %7, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0 : !stream.resource<constant>
    util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0__size : index
    util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded__size : index
    util.return
  }

After running iree-opt --iree-util-fold-globals ~/repro.mlir:

  util.initializer {
    %cst = stream.tensor.constant on(#hal.device.affinity<@__device_0>) : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<constant> = #stream.parameter.named<"model"::"blk.0.attn_q.weight:qs"> : tensor<4096x4096xf8E4M3FNUZ>
    %0 = stream.resource.size %cst : !stream.resource<constant>
    %1 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
    %2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<4096x4096xf8E4M3FNUZ, #encoding> : index
    %3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
    %4 = stream.async.transfer %3 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
    %5 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
    %6 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %5 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
    %7 = stream.async.transfer %6 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
    util.global.store %4, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded : !stream.resource<constant>
    util.global.store %7, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0 : !stream.resource<constant>
    util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0__size : index
    util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded__size : index                                                                                                                                                                          util.return
  }

The difference is that mine has an additional transfer from the global to the encoding op.

    %1 = stream.async.transfer %__auto.blk.0.attn_q.weight3Aqs : !stream.resource<constant>{%__auto.blk.0.attn_q.weight3Aqs__size} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size}
    %3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
    %cst = stream.tensor.constant on(#hal.device.affinity<@__device_0>) : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<constant> = #stream.parameter.named<"model"::"blk.0.attn_q.weight:qs"> : tensor<4096x4096xf8E4M3FNUZ>
    %1 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
    %3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}

@hanhanW
Copy link
Contributor

hanhanW commented Sep 10, 2025

Here is the full dump until encoding specialization: https://gist.github.com/hanhanW/932bfac88b53ef295ec7a0fe2f670f56

The intent was that the fixed-point iterator would run the nested
passes until things stopped changing. This is critical for clean up
after the fold/fuse globals passes but they were not telling the
fixed-point iterator pass that they changed things and thus we only
ever ran a single iteration.
It always should have been as we definitely treat it as pure but for
some reason it wasn't. I can't quite see what would break with this
behavior but after 4 years it's likely something bears load on this
op not being CSE-able.
@benvanik benvanik force-pushed the users/benvanik/cleanup-combine branch from f841b31 to 4d999bb Compare September 11, 2025 14:45
@benvanik
Copy link
Collaborator Author

benvanik commented Sep 11, 2025

ah yeah - yours has the extra transfer because you're just running one pass in isolation and not the CSE pass that removes the extra transfer :) --compile-from=flow on the IR you produces the results I put above.

@benvanik benvanik requested a review from hanhanW September 11, 2025 14:46
Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I confirmed that it fixes the dup issue. The IR looks slightly cleaner after you rebase.

@benvanik
Copy link
Collaborator Author

thanks for testing it!

@benvanik benvanik merged commit 5274550 into main Sep 11, 2025
71 of 73 checks passed
@benvanik benvanik deleted the users/benvanik/cleanup-combine branch September 11, 2025 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants