feat: Enable multimodal tool calling #4232

indrajit96 · 2025-11-11T07:17:19Z

Overview:

Extends multimodal content array handling in the Rust preprocessor to support tool calling with multimodal requests (images, video, audio).

Details:

Extended may_be_fix_msg_content() (Lines 76-150)
Previously only handled text-only content arrays (PR feat: Convert message[content] from list to string. #3067)
Now flattens multimodal arrays by replacing image_url, video_url, audio_url with ,

Where should the reviewer start?

may_be_fix_msg_content() function - Core logic for flattening multimodal content arrays

Testing

Verified Unit tests for may_be_fix_tool_schema and tested multimodal flow with

python3 -m dynamo.vllm --model Qwen/Qwen2.5-VL-7B-Instruct --max-model-len 1000 --custom-jinja-template tool_chat_template_hermes.jinja --dyn-tool-call-parser hermes

Summary by CodeRabbit

New Features
- Added multimodal content support in prompts with proper handling of images, videos, and audio via placeholder representation
Improvements
- Enhanced prompt template selection logic, prioritizing appropriate templates when tools are integrated into prompts
- Optimized message content preprocessing for better multimodal content handling

Signed-off-by: Indrajit Bhosale <[email protected]>

coderabbitai · 2025-11-11T20:14:24Z

Walkthrough

The change reworks message content preprocessing in the OpenAI prompt template formatter to flatten content arrays into single strings with multimodal placeholders, and adjusts template selection logic to prioritize tool-use templates when tools are present.

Changes

Cohort / File(s)	Summary
OpenAI prompt template content preprocessing `lib/llm/src/preprocessor/prompt/template/oai.rs`	Reworked `may_be_fix_msg_content` to flatten content arrays into single strings; text-only arrays concatenated with newlines, multimodal arrays interleaved with placeholders (`<image>`, `<video>`, `<audio>`). Modified `OAIPromptFormatter.render` template selection to prioritize `tool_use` template when tools present, falling back to default. Updated test expectations and comments to reflect flattened-content behavior and placeholder usage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

Content flattening logic requires verification of string formatting and placeholder insertion semantics
Template selection fallback logic modification needs careful review for error handling paths
Test expectation changes should be validated against new multimodal placeholder behavior
Single-file scope reduces overall complexity, but logic density is non-trivial

Poem

🐰 Arrays dissolve to strings so neat,
With placeholders dancing, oh what a treat!
Images, videos in angle-bracketed grace,
Tools get first pick in their special place,
"Flatten and flow," the rabbit did say,
"Simpler content in every way!" ✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: Enable multimodal tool calling' directly summarizes the main change—extending multimodal content handling for tool calling—which aligns with the core purpose described in the PR objectives.
Description check	✅ Passed	The PR description covers the required template sections: Overview explains the purpose, Details describe specific changes with line numbers and examples, and it identifies where reviewers should start. However, the 'Related Issues' section is missing.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f86898 and 77bf213.

📒 Files selected for processing (1)

lib/llm/src/preprocessor/prompt/template/oai.rs (9 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.

📚 Learning: 2025-09-16T19:47:30.312Z

Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.

Applied to files:

lib/llm/src/preprocessor/prompt/template/oai.rs

🧬 Code graph analysis (1)

lib/llm/src/preprocessor/prompt/template/oai.rs (2)

lib/llm/src/preprocessor/prompt/template/formatters.rs (1)

new (30-107)

lib/llm/src/preprocessor/prompt.rs (1)

messages (51-51)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)

GitHub Check: trtllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: sglang (arm64)
GitHub Check: operator (amd64)
GitHub Check: trtllm (amd64)
GitHub Check: vllm (arm64)
GitHub Check: vllm (amd64)
GitHub Check: operator (arm64)
GitHub Check: Build and Test - dynamo
GitHub Check: Mirror Repository to GitLab
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: clippy (lib/bindings/python)
GitHub Check: clippy (.)
GitHub Check: tests (lib/runtime/examples)
GitHub Check: tests (launch/dynamo-run)
GitHub Check: tests (lib/bindings/python)
GitHub Check: tests (.)

🔇 Additional comments (1)

lib/llm/src/preprocessor/prompt/template/oai.rs (1)
140-144: Fix normalize_tool_arguments_with_multimodal_content expectations

Line 141 rewrites any non-empty content array into a String, so request.messages() now returns "Listen to this: <audio> …" instead of the original array. However, test_normalize_tool_arguments_with_multimodal_content (Lines 924-934) still asserts that messages[0]["content"] is an array of length 3, so the test will now fail. Please update that test to expect the flattened placeholder string (and similarly adjust the second scenario to match "Check this: <video> Interesting?").
⛔ Skipped due to learnings
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.
Learnt from: zhongdaor-nv
Repo: ai-dynamo/dynamo PR: 2999
File: lib/parsers/src/tool_calling/harmony/harmony_parser.rs:250-256
Timestamp: 2025-09-10T22:32:12.978Z
Learning: In lib/parsers/src/tool_calling/harmony/harmony_parser.rs, the team prefers to maintain identical code patterns between parse_tool_calls_harmony and parse_tool_calls_harmony_complete functions, including message.content[0] indexing, to ensure consistency between streaming and complete parser implementations.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

nnshah1 · 2025-11-11T20:41:13Z

lib/llm/src/preprocessor/prompt/template/oai.rs

+                        for (idx, part) in content_array.iter().enumerate() {
+                            if idx > 0 {
+                                // Use newline for text-only, space for multimodal
+                                content_string.push(if is_text_only { '\n' } else { ' ' });


question: if you have two text portions back to back - would you have a newline in between those?

Update may_be_fix_tool_schema to fix tool calling

27fefec

Signed-off-by: Indrajit Bhosale <[email protected]>

pull-request-size bot added the size/M label Nov 11, 2025

Revert Text changes

a6e2af3

Signed-off-by: Indrajit Bhosale <[email protected]>

pull-request-size bot added size/L and removed size/M labels Nov 11, 2025

copy-pr-bot bot temporarily deployed to GITLAB November 11, 2025 20:00 Inactive

indrajit96 changed the title ~~Update may_be_fix_tool_schema to fix tool calling~~ feat: Update may_be_fix_tool_schema to enable multimodal tool calling Nov 11, 2025

github-actions bot added the feat label Nov 11, 2025

Revert Text changes

233f0aa

Signed-off-by: Indrajit Bhosale <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 11, 2025 20:04 Inactive

indrajit96 requested review from KrishnanPrash, rmccorm4 and ryan-lempka November 11, 2025 20:05

Rust FMT

77bf213

Signed-off-by: Indrajit Bhosale <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 11, 2025 20:09 Inactive

indrajit96 marked this pull request as ready for review November 11, 2025 20:10

indrajit96 requested a review from a team as a code owner November 11, 2025 20:10

indrajit96 changed the title ~~feat: Update may_be_fix_tool_schema to enable multimodal tool calling~~ feat: Enable multimodal tool calling Nov 11, 2025

copy-pr-bot bot temporarily deployed to GITLAB November 11, 2025 20:14 Inactive

nnshah1 reviewed Nov 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Enable multimodal tool calling #4232

feat: Enable multimodal tool calling #4232

Uh oh!

indrajit96 commented Nov 11, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 11, 2025

Uh oh!

nnshah1 Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Enable multimodal tool calling #4232

Are you sure you want to change the base?

feat: Enable multimodal tool calling #4232

Uh oh!

Conversation

indrajit96 commented Nov 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 11, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

nnshah1 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

indrajit96 commented Nov 11, 2025 •

edited by coderabbitai bot

Loading