Skip to content

Conversation

@indrajit96
Copy link
Contributor

@indrajit96 indrajit96 commented Nov 11, 2025

Overview:

Extends multimodal content array handling in the Rust preprocessor to support tool calling with multimodal requests (images, video, audio).

Details:

  1. Extended may_be_fix_msg_content() (Lines 76-150)
    Previously only handled text-only content arrays (PR feat: Convert message[content] from list to string. #3067)
    Now flattens multimodal arrays by replacing image_url, video_url, audio_url with ,

Where should the reviewer start?

may_be_fix_msg_content() function - Core logic for flattening multimodal content arrays

Testing

Verified Unit tests for may_be_fix_tool_schema and tested multimodal flow with

python3 -m dynamo.vllm --model Qwen/Qwen2.5-VL-7B-Instruct --max-model-len 1000 --custom-jinja-template tool_chat_template_hermes.jinja --dyn-tool-call-parser hermes

Summary by CodeRabbit

  • New Features

    • Added multimodal content support in prompts with proper handling of images, videos, and audio via placeholder representation
  • Improvements

    • Enhanced prompt template selection logic, prioritizing appropriate templates when tools are integrated into prompts
    • Optimized message content preprocessing for better multimodal content handling

Signed-off-by: Indrajit Bhosale <[email protected]>
@pull-request-size pull-request-size bot added size/L and removed size/M labels Nov 11, 2025
@indrajit96 indrajit96 changed the title Update may_be_fix_tool_schema to fix tool calling feat: Update may_be_fix_tool_schema to enable multimodal tool calling Nov 11, 2025
@github-actions github-actions bot added the feat label Nov 11, 2025
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
@indrajit96 indrajit96 marked this pull request as ready for review November 11, 2025 20:10
@indrajit96 indrajit96 requested a review from a team as a code owner November 11, 2025 20:10
@indrajit96 indrajit96 changed the title feat: Update may_be_fix_tool_schema to enable multimodal tool calling feat: Enable multimodal tool calling Nov 11, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 11, 2025

Walkthrough

The change reworks message content preprocessing in the OpenAI prompt template formatter to flatten content arrays into single strings with multimodal placeholders, and adjusts template selection logic to prioritize tool-use templates when tools are present.

Changes

Cohort / File(s) Summary
OpenAI prompt template content preprocessing
lib/llm/src/preprocessor/prompt/template/oai.rs
Reworked may_be_fix_msg_content to flatten content arrays into single strings; text-only arrays concatenated with newlines, multimodal arrays interleaved with placeholders (<image>, <video>, <audio>). Modified OAIPromptFormatter.render template selection to prioritize tool_use template when tools present, falling back to default. Updated test expectations and comments to reflect flattened-content behavior and placeholder usage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

  • Content flattening logic requires verification of string formatting and placeholder insertion semantics
  • Template selection fallback logic modification needs careful review for error handling paths
  • Test expectation changes should be validated against new multimodal placeholder behavior
  • Single-file scope reduces overall complexity, but logic density is non-trivial

Poem

🐰 Arrays dissolve to strings so neat,
With placeholders dancing, oh what a treat!
Images, videos in angle-bracketed grace,
Tools get first pick in their special place,
"Flatten and flow," the rabbit did say,
"Simpler content in every way!"

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Enable multimodal tool calling' directly summarizes the main change—extending multimodal content handling for tool calling—which aligns with the core purpose described in the PR objectives.
Description check ✅ Passed The PR description covers the required template sections: Overview explains the purpose, Details describe specific changes with line numbers and examples, and it identifies where reviewers should start. However, the 'Related Issues' section is missing.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f86898 and 77bf213.

📒 Files selected for processing (1)
  • lib/llm/src/preprocessor/prompt/template/oai.rs (9 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.
📚 Learning: 2025-09-16T19:47:30.312Z
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.

Applied to files:

  • lib/llm/src/preprocessor/prompt/template/oai.rs
🧬 Code graph analysis (1)
lib/llm/src/preprocessor/prompt/template/oai.rs (2)
lib/llm/src/preprocessor/prompt/template/formatters.rs (1)
  • new (30-107)
lib/llm/src/preprocessor/prompt.rs (1)
  • messages (51-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: clippy (.)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (.)
🔇 Additional comments (1)
lib/llm/src/preprocessor/prompt/template/oai.rs (1)

140-144: Fix normalize_tool_arguments_with_multimodal_content expectations

Line 141 rewrites any non-empty content array into a String, so request.messages() now returns "Listen to this: <audio> …" instead of the original array. However, test_normalize_tool_arguments_with_multimodal_content (Lines 924-934) still asserts that messages[0]["content"] is an array of length 3, so the test will now fail. Please update that test to expect the flattened placeholder string (and similarly adjust the second scenario to match "Check this: <video> Interesting?").

⛔ Skipped due to learnings
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.
Learnt from: zhongdaor-nv
Repo: ai-dynamo/dynamo PR: 2999
File: lib/parsers/src/tool_calling/harmony/harmony_parser.rs:250-256
Timestamp: 2025-09-10T22:32:12.978Z
Learning: In lib/parsers/src/tool_calling/harmony/harmony_parser.rs, the team prefers to maintain identical code patterns between parse_tool_calls_harmony and parse_tool_calls_harmony_complete functions, including message.content[0] indexing, to ensure consistency between streaming and complete parser implementations.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

for (idx, part) in content_array.iter().enumerate() {
if idx > 0 {
// Use newline for text-only, space for multimodal
content_string.push(if is_text_only { '\n' } else { ' ' });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: if you have two text portions back to back - would you have a newline in between those?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants