Skip to content

Conversation

ikawrakow
Copy link
Owner

The PR improves AVX2 performance for the trellis quants IQ3_KT and IQ4_KT recently added in PR #441.
The results below are for LLaMA-3.1-8B on a Ryzen-5975WX CPU.

IQ3_KT

N_KV S_PP t/s (main) S_PP t/s (PR) PP speedup S_TG t/s (main) S_TG t/s (PR) TG speedup
0 61.98 71.59 1.155 11.17 13.30 1.191
512 61.27 70.79 1.155 11.10 13.19 1.188
1024 60.48 69.93 1.156 11.04 13.10 1.187
1536 59.94 69.15 1.154 10.95 12.96 1.184
2048 59.48 68.55 1.152 10.87 12.85 1.182

IQ4_KT

N_KV S_PP t/s (main) S_PP t/s (PR) PP speedup S_TG t/s (main) S_TG t/s (PR) TG speedup
0 44.32 64.91 1.465 9.36 11.69 1.249
512 43.90 64.12 1.461 9.26 11.56 1.248
1024 43.60 63.39 1.454 9.19 11.47 1.248
1536 43.32 62.86 1.451 9.12 11.37 1.247
2048 43.07 62.37 1.448 9.06 11.28 1.245

CPU performance is still much lower than other quantization types. But memory bandwidth is far from saturated, so PP and TG will be better on a faster CPU with more cores.

@ikawrakow ikawrakow merged commit a2c42f9 into main May 24, 2025
@ikawrakow ikawrakow mentioned this pull request Jun 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant