Skip to content

Conversation

ikawrakow
Copy link
Owner

The table shows PPL comparisons between the main branch and this PR for LLaMA-v1-7B1(L1-7B in the table), LLaMA-v2-7B1 (L2-7B), Mistral-7B1 (M-7B), LLaMA-3.1-8B-Instruct (L3-8B), and DeepSeek-V2-Lite (DSL). Context is always 512 tokens. Also given are the quantization times (Q-time for short in the table) in seconds on a Ryzen-7950X CPU. Tested is "pure" quantization (i.e., using the --pure option of llama-quantize) with token embeddings and output tensor set to Q8_0. The quantization command line is

./bin/llama-quantize --imatrix $imatrix --token-embedding-type q8_0 --output-tensor-type q8_0 --pure $model $output iq2_xs
Model Quantization PPL (main) PPL (this PR) Q-time (main) Q-time (this PR)
L1-7B IQ2_XS 8.2767 8.2773 N/A2 N/A2
L2-7B IQ2_XS 8.0856 8.1669 156.4 132.6
M-7B IQ2_XS 7.3882 7.3447 169.1 143.3
L3-8B IQ2_XS 13.4294 13.0922 171.3 145.8
DSL IQ2_XS 9.8273 9.4692 302.7 257.0

All models are improved except LLaMA-v2 (but I might have given it too much importance when fine tuning the hyper parameters in the original IQ2_XS PR). Quantization time is reduced by about 18%.


1 Why use such ancient models? The LLaMA-v1 models were the basis for k-quants development. I-quants were developed using LLaMA-v1, LLaMA-v2 and Mistral-7B. In my experience, if a quantization technique does well on all 3 of these, it is (almost) guaranteed to do well on any other model out there.

2 I have this model on an old HDD. In this case quantization time is dominated by the time needed to read the data from the HDD. I could have copied the model to the SSD drive, but I think the timing for the other models gives enough indication of the relative performance.

@ikawrakow ikawrakow merged commit 2309ecd into main Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant