-
Notifications
You must be signed in to change notification settings - Fork 193
support media types #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support media types #195
Conversation
user { | ||
definition.definition(this) | ||
text { definition.definition(this) } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making plain text the default input type, with other media types (e.g., images, audio) explicitly defined. This way, users won't need to wrap text inputs in text { "some string" }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rizzen
Yeap, I agree that need to improve dsl. what do you suggest?
now, it looks like this:
prompt() {
user("string")
}
prompt() {
user {
text("string")
audio(...)
image(...)
video(...)
document(...)
}
}
prompt() {
user {
text {
...
}
audio(...)
image(...)
video(...)
document(...)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about option as following:
prompt() {
user {
+"Hello"
+"There"
audio(...)
image(...)
video(...)
document(...)
}
}
Just to simplify simplest cases like following:
prompt() {
system {
+"Hello"
+"There"
}
user {
+"General"
+"Kenobi"
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rizzen In my opinion, this is not good design
I'm not a big fan of overriding unaryPlus
for String because such an API is not very obvious.
That's my personal pref, but in this case, such api creates confusion.
How should message to LLM look if we write it like this:
prompt() {
user {
+"General"
document(...)
+"Kenobi"
}
}
What do you think about explicitly separating the methods? Keep old user
method and create a new one like userWithMedia
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep old user method and create a new one like userWithMedia?
I think we can introduce userAttachment
builder for media types. @Ololoshechkin WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not the biggest fan of the unary plus, either. It's quite error prone imo. But having attachments
sounds actually good -- it will allow us to not mix up the text/image/text/audio/text without any order, and will be much more like in the chat UI -- you click "attach" and attach :)
If it's just text -- user("aaa")
and text("...")
covers this case
If its something that needs a builder:
user {
text("aaa")
markdown { ... }
attachments {
image(...)
image(...)
document(...) // Note: maybe if we only support PDFs, let's call it `pdf(...)` or `pdfDocument()` ???
}
}
Though, voice
doesn't really fit into attachments category to me, as you can just dictate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: maybe if we only support PDFs, let's call it
pdf(...)
orpdfDocument()
Not just PDF,
Anthropic also supports other types by sending text
Google supports many types too
Though, voice doesn't really fit into attachments category to me, as you can just dictate it.
But audio also like attachment
Later, we can make a separate userVoice
when we support the Realtime API
Will it work correctly if the image is on the user's machine, and agent is on the server. Current implementation seems to require having a URL or local file. Can we support sending documents as byte arrays from memory (user sends to a server, server doesn't create a file phisically, and sends over to openAI). Maybe some FileSystemProvider-like absrtraction with a default would be helpful? (cc: @sproshev ) But maybe that's too much and worth a separate change. WDYT? |
...client/src/commonMain/kotlin/ai/koog/prompt/executor/clients/anthropic/AnthropicLLMClient.kt
Outdated
Show resolved
Hide resolved
Also, let's add some analog/extension to the Otherwise, it's impossible to use media content with agents without writing custom nodes/edges. |
Also please consider adding some example with media types:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add agent example with media types, and add required edges/nodes to the DSL
c03ca97
to
b6e40ad
Compare
Nope.
Yes, I think this part should be improved. It’s better to do it in a separate PR
I added edge, but noted that very few models support media output. Right now, from all the models we have, only OpenAI 4o-audio can return audio I also faced a problem when creating a node for media. For example, if we want to create a strategy that sends a specific image or file, it’s straightforward. But what if we have a lot of images or files? We define the strategy ahead of time: val strategy = strategy("test") {
val processWithMedia by nodeImageProcess("/path/to/image.jpg")
....
edge(nodeStart forwardTo processWithMedia)
....
} But what if we need to pass many images? |
Looks like we have to actually invest some time in IDE plugin with a few inspections for all these model differences...
Tried to check out your branch to take a look how the node is defined, but didn't find it there, |
…e user message handling
de44b71
to
df0102f
Compare
…ls (#195) Co-authored-by: Vadim Briliantov <[email protected]>
…ain more explicit in LLM clients
…ain more explicit in LLM clients (#229)
…ls (JetBrains#195) Co-authored-by: Vadim Briliantov <[email protected]>
…uction again more explicit in LLM clients (JetBrains#229)
…ls (JetBrains#195) Co-authored-by: Vadim Briliantov <[email protected]>
…uction again more explicit in LLM clients (JetBrains#229)
mediaContent
in Message*Models
Relates to
Type of the change
Checklist for all pull requests
develop
as the base branchAdditional steps for pull requests adding a new feature