Skip to content

Conversation

ikawrakow
Copy link
Owner

Adding IQ4_KS with 4 interleaved rows.

We get very signifiant performance gains on ARM_NEON and good gains on AVX2/Zen4.

Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads IQ4_KS IQ4_KS_R4 Speedup
ARM_NEON 8 67.29 ± 1.02 124.91 ± 0.62 1.856
Zen4 16 180.42 ± 0.68 266.05 ± 0.45 1.475
AVX2 32 201.79 ± 0.48 245.37 ± 0.52 1.216

We get decent performance gains for TG as well.
Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads IQ4_KS IQ4_KS_R4 Speedup
ARM_NEON 2 10.84 ± 0.01 12.55 ± 0.00 1.158
4 19.81 ± 0.12 22.06 ± 0.06 1.114
8 25.74 ± 0.47 26.47 ± 0.21 1.039
Zen4 1 6.18 ± 0.00 7.97 ± 0.11 1.290
2 11.73 ± 0.02 13.43 ± 0.00 1.145
4 13.09 ± 1.13 14.46 ± 0.00 1.105
AVX2 2 4.74 ± 0.00 7.30 ± 0.00 1.540
4 8.75 ± 0.00 11.39 ± 0.00 1.302
8 12.38 ± 0.01 12.73 ± 0.00 1.028

@ikawrakow ikawrakow merged commit dfa12b7 into main Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant