Skip to content

Conversation

mingMelody
Copy link
Contributor

PR types

Performance optimization

PR changes

Models

Description

在昆仑芯P800运行过程中,float16的tile操作会使得昆仑芯P800运行效率低下,将其更改为expand操作解决该问题

截屏2025-09-25 23 55 01

运行单元效率对比

import paddle
import time

paddle.set_device('xpu')
num_attention_heads_per_partition = 32
num_multi_query_groups_per_partition = 2
hidden_size_per_attention_head = 128
multiplier = num_attention_heads_per_partition // num_multi_query_groups_per_partition

B, S, G, D = 27, 1, num_multi_query_groups_per_partition, hidden_size_per_attention_head
key_layer_in = paddle.randn(B, S, G, D, dtype='float16')
value_layer_in = paddle.randn(B, S, G, D, dtype='float16')
# -------- type1 (tile+reshape) --------
key_layer = key_layer_in.clone()
value_layer = value_layer_in.clone()

start = time.time()
key_layer = key_layer.unsqueeze(-2).tile([1, 1, 1, multiplier, 1])
key_layer = key_layer.reshape(
    key_layer.shape[:2] + [num_attention_heads_per_partition, hidden_size_per_attention_head]
)
value_layer = value_layer.unsqueeze(-2).tile([1, 1, 1, multiplier, 1])
value_layer = value_layer.reshape(
    value_layer.shape[:2] + [num_attention_heads_per_partition, hidden_size_per_attention_head]
)
end = time.time()
print(f"type1 cost time: {end - start:.6f}s")
print("type1 key_layer shape:", key_layer.shape)

# -------- type2 (expand+reshape) --------
key_layer = key_layer_in.clone()
value_layer = value_layer_in.clone()

start = time.time()
key_layer = key_layer.unsqueeze(-2).expand(B, S, G, multiplier, D)
key_layer = key_layer.reshape(B, S, num_attention_heads_per_partition, hidden_size_per_attention_head)
value_layer = value_layer.unsqueeze(-2).expand(B, S, G, multiplier, D)
value_layer = value_layer.reshape(B, S, num_attention_heads_per_partition, hidden_size_per_attention_head)
end = time.time()
print(f"type2 cost time: {end - start:.6f}s")
截屏2025-09-25 23 58 45

Copy link

paddle-bot bot commented Sep 25, 2025

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant