Skip to content

Conversation

DamonFool
Copy link
Contributor

llama.cpp fails to quantize T5 models with unequal encoder-decoder blocks.
The failure was caused by the failing of the verification of attention layers.

GGML_ASSERT((qs.n_attention_wv == n_attn_layer - pruned_attention_w) && "n_attention_wv is unexpected");

The original verification has assumed that the encoder and decoder have the same number of blocks.
So it fails with unequal encoder-decoder models.

Testing: flan-t5-small, t5-small and an unequal encoder-decoder t5 model

@DamonFool
Copy link
Contributor Author

Hi @CISC , there is another encoder-decoder pr here #16002 .

The simple example is a very good startup to help people get the llama.cpp integrated into their apps.
It would be helpful to also support encoder-decoder models in that example.
Hope you are fine with it
Thanks.

@CISC CISC merged commit 745cbcf into ggml-org:master Sep 17, 2025
47 of 48 checks passed
@DamonFool
Copy link
Contributor Author

Thanks @CISC .

@DamonFool DamonFool deleted the llama-quant-t5 branch September 17, 2025 07:59
angt pushed a commit to angt/llama.cpp that referenced this pull request Sep 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants