-
Notifications
You must be signed in to change notification settings - Fork 140
Improve DeepSeek batched processing speed #282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It does have some benefit at long contexts.
Running sweep bench and will post full results with graph when they finish, but right now but early results look promising, table with early values below
I see you pushed another commit, should I stop this test and recompile and run the new commit? |
This will only affect results for |
What would be very interesting is to run PP benchmarks with DeepSeek-V3/R1 with
This will help understand if the crossover between "TG optimized" and "PP optimized" is somehow dependent on the number of heads, or if it is just a (perhaps somewhat computer dependent) constant. I can see arguments for both options, so the only way to understand is to just test. |
Running now, each config is going to take at least 50 minutes (based on my estimation from the beginning of the first run), I may not be around to post it till later. |
@ikawrakow Here's the benchmark you asked for: On d12f4a1 with
On ec4bc75 with
On ec4bc75 with
I'm going to reboot my machine now to enable 1GB hugepages and mitigations=off and run a sweep-bench to see if TG performance increases. |
Thanks, this is great! It looks like a threshold of 128 tokens is not a bad choice for DeepSeek-R1 as well. |
I was looking into the batched processing performance dips observed by @saood06 here and I saw this for DeepSeek-Lite:
Commandline was
It took me a while to figure out the reason for the dramatic drop in performance between a batch size of 16 and a batch size of 20. I was suspecting that something goes wrong how the work is being distributed between the threads. But at the end it turned out that it is due to the way the compute graph is built: when
n_token > n_head
we switch to "PP optimized" processing, which means we go from FA withDk = 576, Dv = 512
toDk = 192, Dv = 128
, which requires two additional matrix multiplications. For DeepSeek-Liten_head = 16
, so with steps of 4 for the batch size 20 is exactly where the switch is made. I'm not sure what the rationale was for selecting this specific transition point (the optimization came from the mainline llama.cpp PR, but it clearly kills performance. If we look at prompt processing performance using "PP optimized" vs "TG optimized" DeepSeek compute graphs, we see this picture:I.e., "TG optimized" is better than "PP optimized" for prompt lengths up to 64 tokens, and is not too far behind at 128 tokens. So, we can easily solve the performance drop by using "TG optimized" up to
n_prompt = 128
. By doing that, we get this result:The calculations take quite some time, so I didn't have the patience to run beyond batch size of 100 to see the exact crossover point. But eyeballing the graph, it looks like 128 is a good choice for DeepSeek-Lite. DeepSeek-V3/R1 have 128 heads, so this PR will not change the behavior for this models. But it isn't clear to me if one shouldn't use a larger threshold for the "TG optimized" -> "PP optimized" transition.
Concerning DeepSeek-R1, there is a small change in this PR that I hope will reduce the performance dips observed by @saood06