-
Notifications
You must be signed in to change notification settings - Fork 761
Fixing CSE of hoisted encoding ops. #21921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the command with the repro again, and I found that they are not folded away. My input IR is slightly different from yours: https://gist.github.com/hanhanW/1508cc24a8725d69e69f2d063b8317d8
Before:
util.initializer {
%cst = stream.tensor.constant on(#hal.device.affinity<@__device_0>) : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<constant> = #stream.parameter.named<"model"::"blk.0.attn_q.weight:qs"> : tensor<4096x4096xf8E4M3FNUZ>
%0 = stream.resource.size %cst : !stream.resource<constant>
%1 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
%2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<4096x4096xf8E4M3FNUZ, #encoding> : index
%3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
%4 = stream.async.transfer %3 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
%5 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
%6 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %5 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
%7 = stream.async.transfer %6 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
util.global.store %cst, @"__auto.blk.0.attn_q.weight:qs" : !stream.resource<constant>
util.global.store %0, @"__auto.blk.0.attn_q.weight:qs__size" : index
util.global.store %4, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded : !stream.resource<constant>
util.global.store %7, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0 : !stream.resource<constant>
util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0__size : index
util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded__size : index
util.return
}
After running iree-opt --iree-util-fold-globals ~/repro.mlir
:
util.initializer {
%cst = stream.tensor.constant on(#hal.device.affinity<@__device_0>) : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<constant> = #stream.parameter.named<"model"::"blk.0.attn_q.weight:qs"> : tensor<4096x4096xf8E4M3FNUZ>
%0 = stream.resource.size %cst : !stream.resource<constant>
%1 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
%2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<4096x4096xf8E4M3FNUZ, #encoding> : index
%3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
%4 = stream.async.transfer %3 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
%5 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
%6 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %5 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
%7 = stream.async.transfer %6 : !stream.resource<*>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<constant>{%2}
util.global.store %4, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded : !stream.resource<constant>
util.global.store %7, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0 : !stream.resource<constant>
util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded_0__size : index
util.global.store %2, @__hoisted_tensor_4096x4096xf8E4M3FNUZ__encoded__size : index util.return
}
The difference is that mine has an additional transfer from the global to the encoding op.
%1 = stream.async.transfer %__auto.blk.0.attn_q.weight3Aqs : !stream.resource<constant>{%__auto.blk.0.attn_q.weight3Aqs__size} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size}
%3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%__auto.blk.0.attn_q.weight3Aqs__size} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
%cst = stream.tensor.constant on(#hal.device.affinity<@__device_0>) : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<constant> = #stream.parameter.named<"model"::"blk.0.attn_q.weight:qs"> : tensor<4096x4096xf8E4M3FNUZ>
%1 = stream.async.transfer %cst : !stream.resource<constant>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
%3 = stream.tensor.encode on(#hal.device.affinity<@__device_0>) %1 : tensor<4096x4096xf8E4M3FNUZ> in !stream.resource<*>{%0} -> tensor<4096x4096xf8E4M3FNUZ, #encoding> in !stream.resource<*>{%2}
compiler/src/iree/compiler/Dialect/Util/Transforms/ApplyPatterns.cpp
Outdated
Show resolved
Hide resolved
Here is the full dump until encoding specialization: https://gist.github.com/hanhanW/932bfac88b53ef295ec7a0fe2f670f56 |
The intent was that the fixed-point iterator would run the nested passes until things stopped changing. This is critical for clean up after the fold/fuse globals passes but they were not telling the fixed-point iterator pass that they changed things and thus we only ever ran a single iteration.
It always should have been as we definitely treat it as pure but for some reason it wasn't. I can't quite see what would break with this behavior but after 4 years it's likely something bears load on this op not being CSE-able.
f841b31
to
4d999bb
Compare
ah yeah - yours has the extra transfer because you're just running one pass in isolation and not the CSE pass that removes the extra transfer :) --compile-from=flow on the IR you produces the results I put above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I confirmed that it fixes the dup issue. The IR looks slightly cleaner after you rebase.
thanks for testing it! |
A bug in the fixed-point iteration applying cleanup during the stream phase and the
IREE::Stream::AsyncTransferOp
not being marked pure were preventing us from CSEing entire DAGs of work in hoisted initializers.Example before (same parameter encoded twice and then stored in memory as two distinct buffers):
Example after (parameter encoded once and stored in memory once):
The fixed-point change should be safe but the AsyncTransferOp change may have some unforeseen impacts due to how long it has not been marked pure. There's several places where we treat AsyncTransferOp as pure (removing duplicates, folding chains of transfers, etc) so my hope is this is more of an inconsistency we never noticed.
Fixes part of #21659 (duplication of the same encoded parameter).