Add missing KV cache functionality #334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

davidkoski merged 8 commits into ml-explore:main from DePasqualeOrg:kv-cache

Jun 24, 2025

Contributor

DePasqualeOrg commented Jun 13, 2025 •

edited

Loading

I've separated the KV cache part out of my Gemma 3 PR. I'm testing the quantized KV cache on Qwen 3. If we go with this approach, we'll need to update all the models to use attentionWithCacheUpdate as Qwen 3 does here.

The Python API is more elegant because it allows more dynamic typing. In Swift we need this wrapper function due to the requirements of the KVCache protocol.

You can test inference with and without the quantized KV cache with these arguments in llm-tool:

--model mlx-community/Qwen3-1.7B-4bit --prompt "Explain quantum computing in simple terms" --max-tokens 100 --kv-bits 4
--model mlx-community/Qwen3-1.7B-4bit --prompt "Explain quantum computing in simple terms" --max-tokens 100

For short sequences like this, using the quantized KV cache is actually slower, but it should be more efficient for much longer sequences.

Some preliminary test results: Qwen3-1.7B-4bit using 4-bit quantization for the KV cache results in the model getting stuck in repetitive loops. The same model with 8-bit quantization for the KV cache works well, but even at 1200 tokens it's still about 8% slower than when using the non-quantized KV cache. Maybe someone who's interested in using a quantized KV cache can do some more testing on other models with much longer sequence lengths (cc @mzbac).


          Update dependencies

b8d50b5

DePasqualeOrg mentioned this pull request

Add Gemma 3 #238

Merged

DePasqualeOrg added 2 commits

June 13, 2025 10:47


          Working on KV cache

4ab5e4b


          Demonstrate attention routing in Qwen 3

c9976ed

DePasqualeOrg force-pushed the kv-cache branch from 6665204 to c9976ed Compare

June 13, 2025 08:47

Contributor

mzbac commented Jun 13, 2025

Thanks for the great work, I also noticed that the 4bit kv cache can cause some performance degradation, especially for the thinking model. I did some tests earlier for my implementation and found that memory savings become noticeable during token generation up to around 2,000 or 4,000 tokens. However, I didn't track the speed, will try to test your implementation once I get a chance

davidkoski reviewed

View reviewed changes

Libraries/MLXLMCommon/KVCache.swift Outdated

    
              ///   - cache: The model cache state

              ///   - metadata: Optional metadata to save along with cache state

              public func savePromptCache(

                  fileName: String,

Collaborator

davidkoski Jun 13, 2025

I wonder if this should be URL? e.g. if you are constructing a path to the caches directory this would normally be a url.

Contributor Author

DePasqualeOrg Jun 13, 2025

Yes, that makes sense.

davidkoski reviewed

View reviewed changes

Libraries/MLXLMCommon/KVCache.swift Outdated

    
                          flattenedData["__metadata_user_value_\(i)"] = MLXArray(valueBytes.map { Int32($0) })

                      }

                  }

                  flattenedData["__metadata_user_count"] = MLXArray([metadata.count])

Collaborator

davidkoski Jun 13, 2025

Since the metadata comes back as [String:String] below, why not use: public func saveToData( arrays: [String: MLXArray], metadata: [String: String] = [:])? It suports a metadata dictionary directly.

Contributor Author

DePasqualeOrg Jun 13, 2025

I think I've improved this in the commit that I'll add in a moment. Please check it, since I'm not familiar with the usage.

davidkoski reviewed

View reviewed changes

Libraries/MLXLMCommon/KVCache.swift Outdated

    
              /// - Returns: The prompt cache and optionally the metadata

              public func loadPromptCache(

                  fileName: String,

                  returnMetadata: Bool = false

Collaborator

davidkoski Jun 13, 2025

Why would you not want to return this? The return value always includes it (though it might be empty)

Contributor Author

DePasqualeOrg Jun 13, 2025

This mirrored the behavior in Python, but I think you're right that it makes sense to consistently return it.

davidkoski reviewed

View reviewed changes

Libraries/MLXLMCommon/KVCache.swift Outdated

    
              ///   - cache: Array of KV caches to potentially quantize

              ///   - kvBits: Number of bits for quantization (nil = no quantization)

              ///   - kvGroupSize: Group size for quantization

              ///   - quantizedKVStart: Step to begin quantizing

Collaborator

davidkoski Jun 13, 2025

I am not familiar with the typical use of this -- the step is the token offset? So if you have a long prompt you will switch to quantized before you start evaluating the response -- that is the intent?

Contributor Author

DePasqualeOrg Jun 13, 2025 •

edited

Loading

This mirrors maybe_quantize_kv_cache in mlx-lm. The comment was misleading, and in fact quantizedKVStart refers to the token count.

davidkoski reviewed

View reviewed changes

Libraries/MLXLMCommon/KVCache.swift

    
                          cache[i] = simpleCache.toQuantized(groupSize: kvGroupSize, bits: kvBits)

                      }

                      // Note: RotatingKVCache.toQuantized() is not implemented yet like in Python

                      // When implemented, add: else if let rotatingCache = cache[i] as? RotatingKVCache { ... }

Collaborator

davidkoski Jun 13, 2025

Since this may support rotating caches in the future I wonder if the name of the function should be more generic? Less about quantization -- maybe something about a step did complete?

Is this something someone might want to customize? Should this be a protocol or closure passed in to the iterator? I guess we can decide that later if needed.

Contributor Author

DePasqualeOrg Jun 13, 2025

This mirrors the implementation in mlx-lm.

davidkoski reviewed

View reviewed changes

Libraries/MLXLMCommon/KVCache.swift Outdated

    
                  private var keep: Int

                  private var keys: MLXArray?

                  private var values: MLXArray?

                  // TODO: `offset` from the Python implementation is not implemented here. Do we need it?

Collaborator

davidkoski Jun 13, 2025

meaning offset is always 0 from the base?

Collaborator

davidkoski Jun 13, 2025

but I see offset += S below so maybe I am misunderstanding?

Contributor Author

DePasqualeOrg Jun 13, 2025

That's right. I was confused.


          Improvements

429c6ff

DePasqualeOrg force-pushed the kv-cache branch from 23ff243 to 429c6ff Compare

June 13, 2025 21:05


          More improvements

2c0875e

Contributor Author

DePasqualeOrg commented Jun 15, 2025

maxSize in RotatingKVCache is optional in the Python implementation, but I'm not sure this makes sense, since a rotating KV cache should always have a maximum size. I've made it required in the Swift implementation. Is there a good reason why this is optional in Python?

DePasqualeOrg and others added 3 commits

June 15, 2025 17:44


          Clean up

7ff1757


          Merge branch 'main' into kv-cache

f1cbc40


          Merge branch 'main' into kv-cache

1e75cb4

davidkoski approved these changes

View reviewed changes

Collaborator

davidkoski left a comment

Looks good, ran integration tests and all was well. Thank you!

davidkoski merged commit f7da396 into ml-explore:main

4 checks passed

davidkoski mentioned this pull request

Add Falcon H1 #336

Draft

mzbac mentioned this pull request

add kv cache quant to MLXCommon #328

Closed

Contributor Author

DePasqualeOrg commented Jun 25, 2025 •

edited

Loading

If we go with this approach, we'll need to update all the models to use attentionWithCacheUpdate as Qwen 3 does here.

@davidkoski, I think we still need to update the other models so that they can use the cache routing.

DePasqualeOrg mentioned this pull request

Update models for KV cache routing #338

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet