support media types #195

devcrocod · 2025-05-30T19:20:21Z

added new property mediaContent in Message
created a new MediaContent class
added new dsl for user messages in prompt
added OpenAI audio models
Added new Claude models
refactored llm client code
updated KDocs
updated documentation in Module.md
added model tables in KDoc for *Models

Relates to

Type of the change

New feature
Bug fix
Documentation fix

Checklist for all pull requests

The pull request has a description of the proposed change
I read the Contributing Guidelines before opening the pull request
The pull request uses develop as the base branch
Tests for the changes have been added
All new and existing tests passed

Additional steps for pull requests adding a new feature

An issue describing the proposed change exists
The pull request includes a link to the issue
The change was discussed and approved in the issue
Docs have been added / updated

Rizzen · 2025-05-30T21:41:46Z

...gents-core/src/commonMain/kotlin/ai/koog/agents/core/agent/session/AIAgentLLMWriteSession.kt

                user {
-                    definition.definition(this)
+                    text { definition.definition(this) }
                }


Consider making plain text the default input type, with other media types (e.g., images, audio) explicitly defined. This way, users won't need to wrap text inputs in text { "some string" }.

@Rizzen
Yeap, I agree that need to improve dsl. what do you suggest?
now, it looks like this:

prompt() { user("string") }

prompt() { user { text("string") audio(...) image(...) video(...) document(...) } }

prompt() { user { text { ... } audio(...) image(...) video(...) document(...) } }

I was thinking about option as following:

prompt() { user { +"Hello" +"There" audio(...) image(...) video(...) document(...) } }

Just to simplify simplest cases like following:

prompt() { system { +"Hello" +"There" } user { +"General" +"Kenobi" } }

@Rizzen In my opinion, this is not good design

I'm not a big fan of overriding unaryPlus for String because such an API is not very obvious.
That's my personal pref, but in this case, such api creates confusion.

How should message to LLM look if we write it like this:

prompt() { user { +"General" document(...) +"Kenobi" } }

What do you think about explicitly separating the methods? Keep old user method and create a new one like userWithMedia?

Keep old user method and create a new one like userWithMedia?

I think we can introduce userAttachment builder for media types. @Ololoshechkin WDYT?

I'm not the biggest fan of the unary plus, either. It's quite error prone imo. But having attachments sounds actually good -- it will allow us to not mix up the text/image/text/audio/text without any order, and will be much more like in the chat UI -- you click "attach" and attach :)

If it's just text -- user("aaa") and text("...") covers this case
If its something that needs a builder:

user { text("aaa") markdown { ... } attachments { image(...) image(...) document(...) // Note: maybe if we only support PDFs, let's call it `pdf(...)` or `pdfDocument()` ??? } }

Though, voice doesn't really fit into attachments category to me, as you can just dictate it.

Note: maybe if we only support PDFs, let's call it pdf(...) or pdfDocument()

Not just PDF,
Anthropic also supports other types by sending text
Google supports many types too

Though, voice doesn't really fit into attachments category to me, as you can just dictate it.

But audio also like attachment
Later, we can make a separate userVoice when we support the Realtime API

Ololoshechkin · 2025-06-02T10:17:54Z

Will it work correctly if the image is on the user's machine, and agent is on the server.
User would like to stream/send the file to the server, and then we would want to stream/send it to OpenAI/Anthropic/Ollama/etc.

Current implementation seems to require having a URL or local file. Can we support sending documents as byte arrays from memory (user sends to a server, server doesn't create a file phisically, and sends over to openAI).

Maybe some FileSystemProvider-like absrtraction with a default would be helpful? (cc: @sproshev )

But maybe that's too much and worth a separate change. WDYT?

...client/src/commonMain/kotlin/ai/koog/prompt/executor/clients/anthropic/AnthropicLLMClient.kt

Ololoshechkin · 2025-06-02T10:25:30Z

Also, let's add some analog/extension to the onAssistantMessage?
Currently it would transform the response to text, but as we have mediaContent now, let's also add either a separate on... (onMediaReceived) or extend the existing onAssistantMessage.

Otherwise, it's impossible to use media content with agents without writing custom nodes/edges.

Ololoshechkin · 2025-06-02T10:27:30Z

Also please consider adding some example with media types:

Simple example with images and prompt executor (ex: here are my 3 instagram photos that I want to share in one post, please invent a good description under the post)
Example with agent (ex: chat-support conversation that includes image attachments and image responses from the agent) -- making this example will also allow to understand what missing nodes/edges are needed :)

Ololoshechkin

Please, add agent example with media types, and add required edges/nodes to the DSL

devcrocod · 2025-06-02T18:07:35Z

@Ololoshechkin

Will it work correctly if the image is on the user's machine, and agent is on the server.

Nope.

But maybe that's too much and worth a separate change. WDYT?

Yes, I think this part should be improved. It’s better to do it in a separate PR

Also, let's add some analog/extension to the onAssistantMessage?
Currently it would transform the response to text, but as we have mediaContent now, let's also add either a separate on... (onMediaReceived) or extend the existing onAssistantMessage.

Otherwise, it's impossible to use media content with agents without writing custom nodes/edges.

I added edge, but noted that very few models support media output. Right now, from all the models we have, only OpenAI 4o-audio can return audio
Anthropic only returns text.
Google only returns text, and only special models can return audio or images.
Also, for OpenAI, the media features are very limited because we need to support the RealTime API and the Response API

I also faced a problem when creating a node for media. For example, if we want to create a strategy that sends a specific image or file, it’s straightforward. But what if we have a lot of images or files?

We define the strategy ahead of time:

val strategy = strategy("test") {
            val processWithMedia by nodeImageProcess("/path/to/image.jpg")
            ....

            edge(nodeStart forwardTo processWithMedia)
            ....
}

But what if we need to pass many images?
Or choose them during execution?
How should the API look in that case?

Ololoshechkin · 2025-06-02T23:45:41Z

noted that very few models support media output. Right now, from all the models we have, only OpenAI 4o-audio can return audio
Anthropic only returns text.
Google only returns text, and only special models can return audio or images.

Looks like we have to actually invest some time in IDE plugin with a few inspections for all these model differences...
And mention that in docs. I'm thinking of having some badges about models/providers like [jvm/native/js] badges on kotlin libraries website (cc: @innateteniuk WDYT? )

I also faced a problem when creating a node for media. For example, if we want to create a strategy that sends a specific image or file, it’s straightforward. But what if we have a lot of images or files?

Tried to check out your branch to take a look how the node is defined, but didn't find it there,
But in general, would something like nodeLLMRequestMultiple / nodeLLMSendMultipleToolResults work for your case?

…e user message handling

…ls (#195) Co-authored-by: Vadim Briliantov <[email protected]>

…ain more explicit in LLM clients

…ain more explicit in LLM clients (#229)

…ls (JetBrains#195) Co-authored-by: Vadim Briliantov <[email protected]>

…uction again more explicit in LLM clients (JetBrains#229)

…ls (JetBrains#195) Co-authored-by: Vadim Briliantov <[email protected]>

…uction again more explicit in LLM clients (JetBrains#229)

Rizzen reviewed May 30, 2025

View reviewed changes

Ololoshechkin reviewed Jun 2, 2025

View reviewed changes

...client/src/commonMain/kotlin/ai/koog/prompt/executor/clients/anthropic/AnthropicLLMClient.kt Outdated Show resolved Hide resolved

Ololoshechkin requested changes Jun 2, 2025

View reviewed changes

devcrocod force-pushed the devcrocod/support-media-types branch from c03ca97 to b6e40ad Compare June 2, 2025 11:34

devcrocod changed the title ~~Devcrocod/support media types~~ support media types Jun 2, 2025

devcrocod marked this pull request as ready for review June 2, 2025 18:18

Ololoshechkin approved these changes Jun 4, 2025

View reviewed changes

devcrocod and others added 12 commits June 5, 2025 00:08

add MediaContent, dsl and vision/audio capabilities

7803b66

update OpenAIClient with image, audio capabilities

5eacde0

add image and file support to Anthropic client

0632d8b

add image, file, audio, video support to Google client

26aa50a

add image, file, audio support to openrouter client

567e79c

add detailed pricing and capability table to GoogleModels documentation

20856a7

fix embed path

0b7088a

update module files for clients

095d153

add RequestMetaInfo and clock dependency to UserContentBuilder, updat…

88aab86

…e user message handling

support audio response with openai models

2146b55

update user dsl for prompting

9ebd84c

Make list of MediaContent, add example, update docs

df0102f

Ololoshechkin force-pushed the devcrocod/support-media-types branch from de44b71 to df0102f Compare June 4, 2025 22:09

Ololoshechkin added 2 commits June 5, 2025 00:22

fixup! Fix small bug

b3da546

fix EventHandlerFeatureTest

2d42b4e

Ololoshechkin merged commit 921ba3d into develop Jun 4, 2025
3 checks passed

Ololoshechkin deleted the devcrocod/support-media-types branch June 4, 2025 22:48

Ololoshechkin added a commit that referenced this pull request Jun 5, 2025

Add media types (image/audio/document) support to prompt API and mode…

0508d3c

…ls (#195) Co-authored-by: Vadim Briliantov <[email protected]>

EugeneTheDev added a commit that referenced this pull request Jun 5, 2025

[prompt] Fix LLM clients after #195, make LLM request construction ag…

26822a2

…ain more explicit in LLM clients

EugeneTheDev mentioned this pull request Jun 5, 2025

[prompt] Fix LLM clients after #195, make LLM request construction again more explicit in LLM clients #229

Merged

EugeneTheDev added a commit that referenced this pull request Jun 5, 2025

[prompt] Fix LLM clients after #195, make LLM request construction ag…

6f0532f

…ain more explicit in LLM clients (#229)

karloti pushed a commit to karloti/koog that referenced this pull request Jun 6, 2025

Add media types (image/audio/document) support to prompt API and mode…

72bf9e5

…ls (JetBrains#195) Co-authored-by: Vadim Briliantov <[email protected]>

karloti pushed a commit to karloti/koog that referenced this pull request Jun 6, 2025

[prompt] Fix LLM clients after JetBrains#195, make LLM request constr…

4b38d7a

…uction again more explicit in LLM clients (JetBrains#229)

karloti pushed a commit to karloti/koog that referenced this pull request Jun 10, 2025

Add media types (image/audio/document) support to prompt API and mode…

d5bf638

…ls (JetBrains#195) Co-authored-by: Vadim Briliantov <[email protected]>

karloti pushed a commit to karloti/koog that referenced this pull request Jun 10, 2025

[prompt] Fix LLM clients after JetBrains#195, make LLM request constr…

f625df1

…uction again more explicit in LLM clients (JetBrains#229)

EugeneTheDev mentioned this pull request Jun 11, 2025

[prompt] Improve attachments (former media content) support in prompt messages #264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support media types #195

support media types #195

Uh oh!

devcrocod commented May 30, 2025 •

edited

Loading

Uh oh!

Rizzen May 30, 2025

Uh oh!

devcrocod Jun 2, 2025

Uh oh!

Rizzen Jun 2, 2025 •

edited

Loading

Uh oh!

devcrocod Jun 2, 2025

Uh oh!

Rizzen Jun 2, 2025 •

edited

Loading

Uh oh!

Ololoshechkin Jun 2, 2025

Uh oh!

devcrocod Jun 3, 2025

Uh oh!

Ololoshechkin commented Jun 2, 2025

Uh oh!

Uh oh!

Ololoshechkin commented Jun 2, 2025

Uh oh!

Ololoshechkin commented Jun 2, 2025 •

edited

Loading

Uh oh!

Ololoshechkin left a comment

Uh oh!

devcrocod commented Jun 2, 2025

Uh oh!

Ololoshechkin commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

support media types #195

support media types #195

Uh oh!

Conversation

devcrocod commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of the change

Checklist for all pull requests

Additional steps for pull requests adding a new feature

Uh oh!

Rizzen May 30, 2025

Choose a reason for hiding this comment

Uh oh!

devcrocod Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Rizzen Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devcrocod Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Rizzen Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ololoshechkin Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

devcrocod Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Ololoshechkin commented Jun 2, 2025

Uh oh!

Uh oh!

Ololoshechkin commented Jun 2, 2025

Uh oh!

Ololoshechkin commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ololoshechkin left a comment

Choose a reason for hiding this comment

Uh oh!

devcrocod commented Jun 2, 2025

Uh oh!

Ololoshechkin commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

devcrocod commented May 30, 2025 •

edited

Loading

Rizzen Jun 2, 2025 •

edited

Loading

Rizzen Jun 2, 2025 •

edited

Loading

Ololoshechkin commented Jun 2, 2025 •

edited

Loading