Dynamo 0.6.0 Release Notes

Dynamo v0.6.0 strengthens Dynamo's production readiness with comprehensive fault tolerance and observability capabilities, advanced Kubernetes deployment infrastructure, and a vastly improved developer experience with better documentation and more unified experience across the LLM inferences engines ((see Support Matrix for details):

NVIDIA TRT-LLM
vLLM
SGLang

Fault Tolerance & Observability: Request cancellation across all backends ensures clean resource cleanup and prevents resource leaks. Coordinated shutdown processes eliminate VRAM leaks, while automatic worker inhibition prevents cascading failures. Unified metrics collection, OTEL/Tempo distributed tracing, and audit logging provide complete request visibility. Troubleshoot issues faster with end-to-end tracking across all processes and real-time system monitoring.

Developer Experience & Deployment Infrastructure: Dev containers for all frameworks (vLLM, SGLang, TensorRT-LLM) streamline local development. An overhaul of our documentation provides more consistency in the user path for each of the different frameworks. Custom chat templates and comprehensive documentation guides accelerate time to production. Multi-node Kubernetes examples demonstrate proper startup ordering and ARM64 support enables Dynamo deployment across an even larger set of hardware configurations. Automated planner integration and cluster-wide operator installation simplify deployments at scale. The SLA-aware planner with automated profiling optimizes resource allocation, while prefill-aware routing across all backends improves efficiency. Enhanced KV Block Manager adds prefill/decode disaggregation, disk offloading with access pattern filtering, and comprehensive metrics for fine-grained control.

Major Features and Improvements

1. Performance and Framework Support

Published recipes, performance sweeps, and benchmarks on InferenceMax, showcasing performance gains and TCO benefits from using Dynamo
Added Rayon compute pool for CPU-intensive operations (#2969), improved snapshot performance with reverse lookup (#3370), and optimized request processing with event-driven metrics updates (#3207) to optimize performance.
Added ability to run without etcd (#2281) for simplified deployments in controlled environments.
Added custom chat templates (#3165, #3332, #3362).
Parsers library with JSON-based parsers, parallel tool calling (#3188), reasoning transformation (#3295), and GPT-OSS reasoning integration (#3321)

2. Fault Tolerance & Observability

Implemented request cancellation across all backends (vLLM #3465, TensorRT-LLM #3193, SGLang #3465) enabling clean resource cleanup and preventing resource leaks
Implemented coordinated shutdown processes (#3481, #3513) with SIGINT/SIGTERM handling and vLLM engine cleanup (#2898) to prevent VRAM leaks and ensure clean service restarts
Unified metrics collection with cross-process instrumentation (#2243), Python Metrics Registry (#3341), and OTEL/Tempo visualization (#3307, #3160)
Added distributed tracing context support to Python bindings (#3160) and OTEL exporter with Tempo visualization (#3307) for end-to-end request tracking
Improved error messaging (#3587, #3210, #3549) to quickly identify and resolve deployment issues.
Standardized Prometheus naming conventions (#3035), added SGLang/vLLM passthrough metrics (#3539), custom NIM backend metrics (#3266), and configurable histogram buckets (#3562).
Implemented audit logging for chat completions (#3062) and comprehensive system status tracking with uptime monitoring (#2354, #3411)

3. Kubernetes Deployment

Deployed multi-node examples for TensorRT-LLM, vLLM, and SGLang (#3100, #3462) with startup ordering, resource coordination, and multinode operator behavior documentation (#3506).
Enabled ARM64 build support (#3146) for broader deployment compatibility
Added tolerations support (#2445), custom annotations, vLLM compilation cache (#3257), and improved Grove integration.
Enhanced conditional backend workflows (#3141), operator build per-commit (#3712), and improved container build metrics (#3461).
Implemented cluster-wide operator installation (#3199), and improved CRD documentation (#3504) for improved security and management
Automated planner deployment in Kubernetes operator with cluster-wide service account setup (#3520).
Using AIPerf for K8s FT tests to replace genai-perf (#3289); Added legacy client with AIPerf for FT tests to support both testing modes (#3415).

4. KV Block Manager

Enable PD disaggregation in vLLM to support prefill/decode separation (#3352).
Disk offloading filtering to selectively offload based on access patterns [#3532] to extend SSD lifespan.
Added KVBM metrics (#3561) for and cache hit analysis to optimize memory utilization; Added FullyContiguous layouts (#3090) for optimal data transfer efficiency.

5. Scheduling

Planner

Deployed SLA planner with automated parallelization mapping, pre-deployment profiling (#3441), and performance prediction to optimize resource allocation.

Router

Implemented approximate KV routing and prefill-aware routing; Enabled prefill router support across all backends(#3401, #3329, #3471, #3498).

6. User Experience and Documentation

Added comprehensive guides and documentation: tool calling (#2866), KVBM (#3578, #3759), vLLM KVBM 2P2D example [#3526], KV Router A/B testing guide [#3742], standalone routing python bindings[#3308], multinode operator behavior [#3506], planner quickstart guide[#3358], deployment examples, and API references with OpenAPI routes (#3480); Added gpu details for model recipes [#3707].
Major documentation reorganization (#3756, #3440), added version switcher (#3711); Restructured source code for better packaging (#3201), added component auto-discovery (#3348), fixed non-editable installs (#3478), and improved cross-platform compatibility (#3044).
Dev container for vLLM, SGLang, and TensorRT-LLM to streamline local development [#3228, #3576]
Added framework-specific test markers (#2561), and unit tests for custom jinja templates for vLLM, TRT-LLM and SGLang [#3165, #3332, #3362, #3472]

7. Others:

Tool Calling & Reasoning: Reasoning parser transformation to extract reasoning tokens #3295, E2e test for reasoning_effort on gpt-oss to validate reasoning modes [#3421]
Multimodal: Multimodal EPD for SGLang to support image inputs #3230

8. Bug Fixes

Fix TensorRT-LLM multinode command #3311, 3373
Remove VSWA user prompts #3404
Fixed circular rust dependencies (#3609), corrected commit info copying (#3670), resolved CUDA lock issues (#3704), and improved container security (#3367).
Update distroless go container for openssl #3486
Attach dynamo namespace label to Grove PodCliques #3359
Router registration to etcd #3302

Framework Updates

Upgrade NIXL to 0.6.0 #3550
Upgrade vLLM to 0.11.0 #3422
Upgrade SGLang container and version #3647

Known Issues

Building the Dynamo custom end-point picker (EPP) fails due to the incorrect filepath for patch
DS-R1 recipe fails due to the SLA profiling process crashing
GPT-OSS recipe does not work out-of-the-box

What's Next

Following this 0.6.0 release and a minor 0.6.1 release, our next major release (0.7.0) we will continue to prioritize improving performance across top models, ensuring robust fault-tolerance & observability in production scenarios, more backend integrations, production-grade cache management support, smarter routing/scheduling strategies, and developer-friendly UX.

We'd love to hear your feedback and comments - please open up an issue for any feature requests and chat with us on Discord!

Release Assets

Python Wheels:

Rust Crates:

Containers:

TensorRT-LLM Runtime: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.0 NGC
vLLM Runtime: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0 NGC
SGLang Runtime: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.0 NGC
Dynamo Kubernetes Operator: nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.6.0 NGC

Helm Charts:

Contributors

Big thanks to everyone who contributed to this release, including new contributors to the repo:

Full Changelog: v0.5.1...v0.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.6.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Dynamo 0.6.0 Release Notes

Major Features and Improvements

1. Performance and Framework Support

2. Fault Tolerance & Observability

3. Kubernetes Deployment

4. KV Block Manager

5. Scheduling

Planner

Router

6. User Experience and Documentation

7. Others:

8. Bug Fixes

Framework Updates

Known Issues

What's Next

Release Assets

Contributors

Contributors

Uh oh!