Skip to content

Conversation

devcrocod
Copy link
Contributor

@devcrocod devcrocod commented May 30, 2025

  • added new property mediaContent in Message
  • created a new MediaContent class
  • added new dsl for user messages in prompt
  • added OpenAI audio models
  • Added new Claude models
  • refactored llm client code
  • updated KDocs
  • updated documentation in Module.md
  • added model tables in KDoc for *Models

Relates to


Type of the change

  • New feature
  • Bug fix
  • Documentation fix

Checklist for all pull requests

  • The pull request has a description of the proposed change
  • I read the Contributing Guidelines before opening the pull request
  • The pull request uses develop as the base branch
  • Tests for the changes have been added
  • All new and existing tests passed
Additional steps for pull requests adding a new feature
  • An issue describing the proposed change exists
  • The pull request includes a link to the issue
  • The change was discussed and approved in the issue
  • Docs have been added / updated

Comment on lines 443 to 448
user {
definition.definition(this)
text { definition.definition(this) }
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making plain text the default input type, with other media types (e.g., images, audio) explicitly defined. This way, users won't need to wrap text inputs in text { "some string" }.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rizzen
Yeap, I agree that need to improve dsl. what do you suggest?
now, it looks like this:

prompt() {
   user("string")
}
prompt() {
   user {
	text("string")
	audio(...)
	image(...)
	video(...)
	document(...)
   }
}
prompt() {
   user {
	text {
		...
	}
	audio(...)
	image(...)
	video(...)
	document(...)
   }
}

Copy link
Member

@Rizzen Rizzen Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about option as following:

prompt() {
   user {
	+"Hello"
        +"There"
	audio(...)
	image(...)
	video(...)
	document(...)
   }
}

Just to simplify simplest cases like following:

prompt() {
   system {
	+"Hello"
        +"There"
   }
   user {
	+"General"
        +"Kenobi"
   }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rizzen In my opinion, this is not good design

I'm not a big fan of overriding unaryPlus for String because such an API is not very obvious.
That's my personal pref, but in this case, such api creates confusion.

How should message to LLM look if we write it like this:

prompt() {
   user {
        +"General"
        document(...)
        +"Kenobi"
   }
}

What do you think about explicitly separating the methods? Keep old user method and create a new one like userWithMedia?

Copy link
Member

@Rizzen Rizzen Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep old user method and create a new one like userWithMedia?

I think we can introduce userAttachment builder for media types. @Ololoshechkin WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not the biggest fan of the unary plus, either. It's quite error prone imo. But having attachments sounds actually good -- it will allow us to not mix up the text/image/text/audio/text without any order, and will be much more like in the chat UI -- you click "attach" and attach :)

If it's just text -- user("aaa") and text("...") covers this case
If its something that needs a builder:

user {
   text("aaa")
   markdown { ... }
   
   attachments {
      image(...)
      image(...)
      document(...) // Note: maybe if we only support PDFs, let's call it `pdf(...)` or `pdfDocument()` ???
   }
}

Though, voice doesn't really fit into attachments category to me, as you can just dictate it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: maybe if we only support PDFs, let's call it pdf(...) or pdfDocument()

Not just PDF,
Anthropic also supports other types by sending text
Google supports many types too

Though, voice doesn't really fit into attachments category to me, as you can just dictate it.

But audio also like attachment
Later, we can make a separate userVoice when we support the Realtime API

@Ololoshechkin
Copy link
Collaborator

Will it work correctly if the image is on the user's machine, and agent is on the server.
User would like to stream/send the file to the server, and then we would want to stream/send it to OpenAI/Anthropic/Ollama/etc.

Current implementation seems to require having a URL or local file. Can we support sending documents as byte arrays from memory (user sends to a server, server doesn't create a file phisically, and sends over to openAI).

Maybe some FileSystemProvider-like absrtraction with a default would be helpful? (cc: @sproshev )

But maybe that's too much and worth a separate change. WDYT?

@Ololoshechkin
Copy link
Collaborator

Also, let's add some analog/extension to the onAssistantMessage?
Currently it would transform the response to text, but as we have mediaContent now, let's also add either a separate on... (onMediaReceived) or extend the existing onAssistantMessage.

Otherwise, it's impossible to use media content with agents without writing custom nodes/edges.

@Ololoshechkin
Copy link
Collaborator

Ololoshechkin commented Jun 2, 2025

Also please consider adding some example with media types:

  1. Simple example with images and prompt executor (ex: here are my 3 instagram photos that I want to share in one post, please invent a good description under the post)

  2. Example with agent (ex: chat-support conversation that includes image attachments and image responses from the agent) -- making this example will also allow to understand what missing nodes/edges are needed :)

Copy link
Collaborator

@Ololoshechkin Ololoshechkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add agent example with media types, and add required edges/nodes to the DSL

@devcrocod devcrocod force-pushed the devcrocod/support-media-types branch from c03ca97 to b6e40ad Compare June 2, 2025 11:34
@devcrocod
Copy link
Contributor Author

@Ololoshechkin

Will it work correctly if the image is on the user's machine, and agent is on the server.

Nope.

But maybe that's too much and worth a separate change. WDYT?

Yes, I think this part should be improved. It’s better to do it in a separate PR

Also, let's add some analog/extension to the onAssistantMessage?
Currently it would transform the response to text, but as we have mediaContent now, let's also add either a separate on... (onMediaReceived) or extend the existing onAssistantMessage.

Otherwise, it's impossible to use media content with agents without writing custom nodes/edges.

I added edge, but noted that very few models support media output. Right now, from all the models we have, only OpenAI 4o-audio can return audio
Anthropic only returns text.
Google only returns text, and only special models can return audio or images.
Also, for OpenAI, the media features are very limited because we need to support the RealTime API and the Response API

I also faced a problem when creating a node for media. For example, if we want to create a strategy that sends a specific image or file, it’s straightforward. But what if we have a lot of images or files?

We define the strategy ahead of time:

val strategy = strategy("test") {
            val processWithMedia by nodeImageProcess("/path/to/image.jpg")
            ....

            edge(nodeStart forwardTo processWithMedia)
            ....
}

But what if we need to pass many images?
Or choose them during execution?
How should the API look in that case?

@devcrocod devcrocod changed the title Devcrocod/support media types support media types Jun 2, 2025
@devcrocod devcrocod marked this pull request as ready for review June 2, 2025 18:18
@Ololoshechkin
Copy link
Collaborator

noted that very few models support media output. Right now, from all the models we have, only OpenAI 4o-audio can return audio
Anthropic only returns text.
Google only returns text, and only special models can return audio or images.

Looks like we have to actually invest some time in IDE plugin with a few inspections for all these model differences...
And mention that in docs. I'm thinking of having some badges about models/providers like [jvm/native/js] badges on kotlin libraries website (cc: @innateteniuk WDYT? )

I also faced a problem when creating a node for media. For example, if we want to create a strategy that sends a specific image or file, it’s straightforward. But what if we have a lot of images or files?

Tried to check out your branch to take a look how the node is defined, but didn't find it there,
But in general, would something like nodeLLMRequestMultiple / nodeLLMSendMultipleToolResults work for your case?

@Ololoshechkin Ololoshechkin force-pushed the devcrocod/support-media-types branch from de44b71 to df0102f Compare June 4, 2025 22:09
@Ololoshechkin Ololoshechkin merged commit 921ba3d into develop Jun 4, 2025
3 checks passed
@Ololoshechkin Ololoshechkin deleted the devcrocod/support-media-types branch June 4, 2025 22:48
Ololoshechkin added a commit that referenced this pull request Jun 5, 2025
EugeneTheDev added a commit that referenced this pull request Jun 5, 2025
EugeneTheDev added a commit that referenced this pull request Jun 5, 2025
karloti pushed a commit to karloti/koog that referenced this pull request Jun 6, 2025
karloti pushed a commit to karloti/koog that referenced this pull request Jun 6, 2025
karloti pushed a commit to karloti/koog that referenced this pull request Jun 10, 2025
karloti pushed a commit to karloti/koog that referenced this pull request Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants