Skip to content

Conversation

Alessan-git
Copy link
Contributor

Summary

LLMEval and VLMEval apps now use the AsyncStream token generation.

Changes

  • LLMEval: generation logic in LLMEvaluator.
  • VLMEval: generation logic in VLMEvaluator.
  • GenerationCompletionInfo: made Sendable.

Evaluators now include a generationTask property to manage the generation process. They also track the token count within the loop, though this functionality could potentially be integrated into the generation methods in Evaluate. This could be achieved either by passing maxTokens as an argument or by incorporating it into GenerateParameters.

Note

I removed the displayEveryNTokens part as I didn’t notice any improvements from it; in fact, performance was sometimes better without it. In any case, for a sample application, I believe prioritizing clarity over performance is the better approach.

@ronaldmannak
Copy link
Contributor

The displayEveryNTokens prevents faster models from overwhelming SwiftUI, which unfortunately does happen quite easily. It may not be a bad idea to leave it in to alert developers of this issue.
However, if we want to do it "right," it might make more sense to use time to limit to tokens per seconds using TimeInterval so it throttles only when needed.

@davidkoski
Copy link
Collaborator

The displayEveryNTokens prevents faster models from overwhelming SwiftUI, which unfortunately does happen quite easily. It may not be a bad idea to leave it in to alert developers of this issue. However, if we want to do it "right," it might make more sense to use time to limit to tokens per seconds using TimeInterval so it throttles only when needed.

I agree that doing it on a time basis makes the most sense.

@davidkoski
Copy link
Collaborator

As for doing the periodic display cleanly, I wonder if something like this might work:

It could potentially combine tokens for e.g. 0.25 seconds and then generate a small chunk that we would display. I think it would be nice to use composition like this to describe the effect and it would fit in nicely with an async sequence.

@Alessan-git
Copy link
Contributor Author

I implemented a simple throttle method. I think it works fine:

for await batch in stream.throttle(for: 0.25) {
    for result in batch {
        switch result {
        case .token(let token):
            tokenCount += 1
            if tokenCount >= maxTokens { await generationTask?.cancel() }
            let text = context.tokenizer.decode(tokens: [token])
            Task { @MainActor in
                self.output += text
            }
        case .info(let info):
            Task { @MainActor in
                self.stat = "\(info.tokensPerSecond) tokens/s"
            }
        }
    }
}

@ronaldmannak
Copy link
Contributor

ronaldmannak commented Mar 27, 2025

Update: actually it does seem to batch automatically, I missed that initially, apologies.
correct me if I'm wrong, but doesn't this set the minimum throughput to 1 token every 0.25 seconds? What you want to do is still receive tokens as quickly as possible, buffer them, and then flush the buffer to the view every 0.25 seconds (or whatever interval you set)

@Alessan-git
Copy link
Contributor Author

Yes, when throttled it returns [Generation] instead of Generation.

Even so, I’m not sure if adding too many of these "details" is appropriate for this library, as it seems more like something that each developer should adapt to their needs.

I prefer a thin layer, without many abstractions, that each developer can then adapt easily. But that's just me :)
Ideally, I think most people should write their own TokenIterator (or something similar) and even their own samplers and generation parameters. I find it too restrictive right now. That's why I decided to make some changes in previous PRs to make some models public.

That being said, the work done so far by everyone in this library establishes a solid foundation, and it’s exciting to see how this can empower developers!

Let me know if the changes feel right or need further polishing!

@ronaldmannak
Copy link
Contributor

I believe the main issue lies in the blurred lines between the libraries and the example apps in the current repo. The eval app serves as a demo to show developers how to use the libraries. I think throttling should be part of the example app so it runs smoothly on all devices and models, and demonstrating to developers how to use the library.

Although it's a different discussion, separating the examples from the libraries could prevent a lot of confusion. We could keep them in a single repo, but with a clear distinction between the two. For instance, have a top-level source directory for Swift package sources and an example directory for the Xcode project, which would contain only the demo targets. I'm not sure about the CI or other internal tools at Apple prevent that approach, but I'd be happy to work on that.

@davidkoski
Copy link
Collaborator

I prefer a thin layer, without many abstractions, that each developer can then adapt easily. But that's just me :)
Ideally, I think most people should write their own TokenIterator (or something similar) and even their own samplers and generation parameters. I find it too restrictive right now. That's why I decided to make some changes in previous PRs to make some models public.

From issues & comments it seems like we have a mix of desires -- some people want a single method that does everything and others want a toolkit with everything exposed. The good news is I think we can accommodate both, to a certain extent (the latter more than the former I think).

I agree that the chunking piece is really a UI concern so it belongs in the integration in the example app. Making sure the right pieces are open or public will help a lot with the lower level pieces. Ideally you could implement your own samplers, etc. (right now) and use the pre-built TokenIterator if you wanted, or build the whole thing yourself.

Anyway, thank you for your efforts in opening things up. We need to community to both say what they want and help build it.

@davidkoski
Copy link
Collaborator

Although it's a different discussion, separating the examples from the libraries could prevent a lot of confusion. We could keep them in a single repo, but with a clear distinction between the two. For instance, have a top-level source directory for Swift package sources and an example directory for the Xcode project, which would contain only the demo targets. I'm not sure about the CI or other internal tools at Apple prevent that approach, but I'd be happy to work on that.

I think the only real limitation is that Package.swift be at the top level -- this is required for swiftpm to use it as an importable library. We can reorganize the examples however we would like. I don't know how most people consume this: as a swiftpm library or as an xcodeproj with examples. Or maybe both.

@Alessan-git
Copy link
Contributor Author

Should I reintroduce the displayEveryNTokens property so everyone can tailor it to their needs? Or would it be better to incorporate a simple timer into the Evaluator?

The stream functionality works seamlessly in the example apps, which was the primary goal of the PR.

I really appreciate the engaging discussion we're having about the library. In my view, it should be a flexible toolkit—something lightweight yet robust, designed to make it easy to explore new ideas and support research.

case .token(let token):
tokenCount += 1
if tokenCount >= maxTokens { await generationTask?.cancel() }
let text = context.tokenizer.decode(tokens: [token])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #260 this can be simplified a bit:

        for await item in try MLXLMCommon.generate(input: input, parameters: generateParameters, context: context) {
            switch item {
            case .chunk(let string):
                print(string, terminator: "")
                fflush(stdout)
            case .info(let generateCompletionInfo):
                break
            }
        }

The conversion to String now happens inside the generator.

Do you want to update these call sites? I am happy to as well!

Copy link
Collaborator

@davidkoski davidkoski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you for the contribution!

@davidkoski davidkoski merged commit 289bb67 into ml-explore:main Apr 9, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants