-
Notifications
You must be signed in to change notification settings - Fork 140
Better TG performance for GQA models (CPU) #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Here another comparison to mainline, this time for Gemma3-12B-Instruct. Only runs with FA enabled, |
Nice.
Do you still have the raw markdown results? I know PP wasn't affected by this PR but I'm curious where it stands vs mainline.
I wonder if they cross over at higher contexts the gap does seem to be closing here. |
Mainline PP performance with FA is embarrassing. I also picked the fastest mainline quant that receives an extraordinary amount of attention ( Gemma3-12B-InstructAt 16k tokens mainline TG performance is indeed slightly better than
LLaMA-3.1-8B-InstructHere mainline does not do well for PP or TG. Mainline TG is 55.5% of
Btw, my surprise at the 6X drop in PP performance for DeepSeek-V3/R1 that I expressed elsewhere was based on results such as these. |
It is really nice being able to use FA here and benefit.
For gemma this also makes the most sense as they released QAT versions of Q4_0 (this being the best version for 12B, some measurements here).
Thanks for doing that.
Ya surprisingly the newer run with higher KV performed better looking at both.
Here's the visual generated with the python script in the sweep-bench example folder, in order to see the crossover point.
Yes both model the PP graphs just show ik_llama clearly above mainline.
PP graph again not very interesting but TG is interesting showing the different curves.
Ya that architecture's performance surprises me too like when I saw peak batched TG performance for Deepseek being higher than PP performance instead of just approaching it like I normally observe. |
This PR adds improved TG performance on the CPU for GQA models (LLaMA-2+, Gemma, etc.).
We see performance gains with and without FA. The gains without FA are fairly minor and come from a different way of distributing the work between the threads for the
K*Q
andV*softmax(K*Q)
matrix multiplications. The performance gains with FA enabled are very significant, and FA now outperforms no-FA also for TG.Here is an example for LLaMA-3.1-8B-Instruct. Model is quantized with
Q4_0
, KV cache isQ8_0
(V-cache isf16
when FA is not enabled). Results are for a R$yzen-5975WX CPU (vanillaAVX2
). Also included for comparison are mainlinellama.cpp
results (build 5139) with FA enabled shown with orange symbols. Results are obtained withllama-sweet-bench
usingThe x-axis is
N_KV
, the number of tokens in the KV cache.