Update document capabilities for LLModel #543
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the course of attempting to use Koog with Gemini models for extracting structured data from PDFs, I discovered that the Google models aren't listed for
LLMCapability.Document
.It may be the case that
LLMCapability.Document
is intended to be a broader category than just PDFs, but I would propose that since there isn't a specific capability for PDFs it would be better to be more permissive and list theDocument
capability for APIs that support PDFs.For instance, the Gemini docs here specifically note that formats like markdown, etc., are supported, but that "document vision only meaningfully understands PDFs", suggesting PDFs are considered a separate processing category.
Anthropic model support for PDFs is documented here. Currently the models listed are:
Gemini model support for PDFs appears to be ubiquitous (the documentation doesn't specify any models for which it doesn't work). The model support documentation is misleading, as I've confirmed PDFs are supported across generations (specifically 1.5 flash, 2.0 flash, 2.5 flash and 2.5 pro).
The Llama documentation on vision capabilities assumes that processing documents is a part of any multimodal model supporting vision (and this has been confirmed in my own experience).
The OpenAI documentation states explicitly that "OpenAI models with vision capabilities can also accept PDF files as input".
Type of the change
Checklist for all pull requests
develop
as the base branchAdditional steps for pull requests adding a new feature