Skip to content

Conversation

glorious-beard
Copy link
Contributor

@glorious-beard glorious-beard commented May 6, 2025

Motivation and Context

Why is this change required?

This template parsers like the YAML parser to embed content types other than just text and images for LLMs that support additional content types, like PDFs for OpenAI and DOCXs for Claude. Without this capability, functions with prompts that have attachments would have to manually build it's chat history in code.

What problem does it solve?

See above

What scenario does it contribute to?

Usage additional content types beyond visuals and audio for user messages

Open Issues Addressed

Description

Chat Prompt Parser

To preserve backward compatibility, rather than consolidating binary content types, I chose to go with adding additional content types so that LLM chat service providers could opt-in to new content types. It also reduces the chances of breaking existing code.

3 new content types are created:

  • PdfContent for PDF files. Uses the tag "<pdf>". Allows for Base64 data URIs or standard URIs, similar to ImageContent.
  • DocContent for MS Word .doc files. Uses the tag "<doc>". Allows for Base64 data URIs or standard URIs, similar to ImageContent.
  • DocxContent for MS Word .docx files. Uses the tag "<docx>". Allows for Base64 data URIs or standard URIs, similar to ImageContent.

(NOTE: DocContent and DocxContent are mainly separate because they have different MIME types and different content formats, though they could easily be consolidated into a single tag and just let the LLM provider handle distinguishing between "doc" and "docx" files. Alternately, I could also see the case for dropping ".doc" support and requiring the caller to only use ".docx".)

In addition, the following 2 contents are now parsed from the XML:

  • AudioContent - Parses the tag "<audio>" with either Base64 data URIs or standard URIs, similar to ImageContent.
  • BinaryContent - Parses the tag "<file>" with either Base64 data URIs or standard URIs, similar to ImageContent.

Here is a sample:

            
<message role='user'>
  This part will be discarded upon parsing
  <text>Make sense of this random assortment of stuff.</text>
  <image>https://fake-link-to-image/</image>
  <audio>data:audio/wav;base64,UklGRiQAAABXQVZFZm10IBAAAAABAAEAIlYAAACABAAZGF0YVgAAAAA</audio>
  <pdf>data:application/pdf;base64,JVBERi0xLjQKJeLjz9MKMyAwIG9iago8PC9UeXBlL1hSZWYvUGFnZXMgNiAwIFIKL1R5cGUvUGFnZS9NZWRpYUJveCBbMCAwIDQ4MCA1MF0KL0NvbnRlbnRzIDw8L0V4dEdTdGF0ZSA8PC9JRCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GMSA8PC9GMiA8PC9GMyA8PC9GNCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GNSA8PC9GNiA8PC9GNyBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GOCAvPj4KZW5kb2JqCjEwIDAgb2JqCjw8L1R5cGUvUGFnZS9NYWRlYUJveCBbMCAwIDQ4MCA1MF0KL0NvbnRlbnRzIDw8L0V4dEdTdGF0ZSA8PC9JRCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GMSA8PC9GMiA8PC9GMyA8PC9GNCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GNSA8PC9GNiA8PC9GNyBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GOCAvPj4KZW5kb2JqCjEwIDAgb2JqCjw8L1R5cGUvUGFnZS9NYWRlYUJveCBbMCAwIDQ4MCA1MF0KL0NvbnRlbnRzIDw8L0V4dEdTdGF0ZSA8PC9JRCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GMSA8PC9G</pdf>
  <pdf>https://fake-link-to-pdf/</pdf>  
 
 <doc>data:application/msword;base64,UEsDBBQAAAAIAI+Q1k5a2gAAABQAAAAIAAAAbmFtZS5kb2N4VVQJAAD9AAAACwAAAB4AAAAAA==</doc>
  <doc>https://fake-link-to-doc/</doc>
  <docx>data:application/vnd.openxmlformats-officedocument.wordprocessingml.document;base64,UEsDBBQAAAAIAI+Q1k5a2gAAABQAAAAIAAAAbmFtZS5kb2N4VVQJAAD9AAAACwAAAB4AAAAAA==</docx>
  <docx>https://fake-link-to-docx/</docx>
  <file>data:application/octet-stream;base64,UEsDBBQAAAAIAI+Q1k5a2gAAABQAAAAIAAAAbmFtZS5kb2N4VVQJAAD9AAAACwAAAB4AAAAAA==</file>
  <file>https://fake-link-to-binary/</file>
  This part will also be discarded upon parsing
</message>

Amazon Bedrock

Modified the Converse API request generator to handle the subset of binary content supported by Amazon Bedrock (PDF, DOC, DOCX, and Image), as documented here.

OpenAI

Modified the client to handle PDF content, audio content, and file references when generating a request to an OpenAI (or OpenAI compatible) client.

Contribution Checklist

@glorious-beard glorious-beard requested a review from a team as a code owner May 6, 2025 20:38
@glorious-beard glorious-beard changed the title Glorious-beard/11044-expand-chat-prompt-parser .Net: Add support for audio, pdf, doc, and docx to chat prompt parser May 6, 2025
@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel kernel.core labels May 8, 2025
@rogerbarreto rogerbarreto removed the needs discussion Issues that require discussion by the internal Semantic Kernel team before proceeding label May 30, 2025
@rogerbarreto
Copy link
Member

@glorious-beard I updated the proposal to be abstract as this is applied to the SemanticKernel.Abstraction package.

As we will have many different types of documents and binary files, to be more abroad and less specific, is better not introduce any special content types and use the existing ones we already have that works.

Given that updated the logic to accept a mimetype attribute as part of the <binary mimetype="type/subtype"/> to solve the scenarios where you provide a Uri.

For dataUri content, the mimeType is picked automatically from the data:mimeType schema.

@rogerbarreto
Copy link
Member

Updated PR Description

Motivation and Context

Enhance the Chat Prompt XML parsing capability to also support audio and documents.

Description

The following 2 contents are now supported from the Chat Prompt XML:

  • AudioContent - Parses the tag <audio mimetype="type/subtype"> with either Base64 data URIs or standard URIs, similar to ImageContent.
  • BinaryContent - Parses the tag <binary mimetype="type/subtype"> with either Base64 data URIs or standard URIs, similar to ImageContent.

The mimetype attribute is optional, and can be omitted for Base64 data URIs.

Here is a sample:

<message role='user'>
  This part will be discarded upon parsing
  <text>Summarize all the contents I provided in this message.</text>
  <image mimetype="image/png">https://fake-link-to-image/</image>
  <audio>data:audio/wav;base64,UklGRiQAAAB...</audio>
  <binary>data:application/pdf;base64,UklGRiQAAAB...</binary>
  <binary mimetype="application/pdf">https://fake-link-to-pdf/</binary>  
  <binary>data:application/msword;base64,UklGRiQAAAB...</binary>
  <binary mimetype="octet/stream">https://fake-link-to-binary/</binary>
</message>

Contribution Checklist

@rogerbarreto rogerbarreto changed the title .Net: Add support for audio, pdf, doc, and docx to chat prompt parser .Net: Add support for audio and binary tags to chat prompt parser Jun 4, 2025
@rogerbarreto rogerbarreto added this pull request to the merge queue Jun 5, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 5, 2025
@rogerbarreto rogerbarreto added this pull request to the merge queue Jun 5, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 5, 2025
@rogerbarreto rogerbarreto added this pull request to the merge queue Jun 5, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 5, 2025
@rogerbarreto rogerbarreto enabled auto-merge June 6, 2025 15:20
@rogerbarreto rogerbarreto added this pull request to the merge queue Jun 6, 2025
Merged via the queue into microsoft:main with commit 5c04bbe Jun 6, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ai connector Anything related to AI connectors documentation kernel.core kernel Issues or pull requests impacting the core kernel .NET Issue or Pull requests regarding .NET code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expanding ChatPromptParser to handle other content types
4 participants