Skip to content

v0.6.0

Choose a tag to compare

@saturley-hall saturley-hall released this 28 Oct 12:57
e02605b

Dynamo 0.6.0 Release Notes

Dynamo v0.6.0 strengthens Dynamo's production readiness with comprehensive fault tolerance and observability capabilities, advanced Kubernetes deployment infrastructure, and a vastly improved developer experience with better documentation and more unified experience across the LLM inferences engines ((see Support Matrix for details):

  • NVIDIA TRT-LLM
  • vLLM
  • SGLang

Fault Tolerance & Observability: Request cancellation across all backends ensures clean resource cleanup and prevents resource leaks. Coordinated shutdown processes eliminate VRAM leaks, while automatic worker inhibition prevents cascading failures. Unified metrics collection, OTEL/Tempo distributed tracing, and audit logging provide complete request visibility. Troubleshoot issues faster with end-to-end tracking across all processes and real-time system monitoring.

Developer Experience & Deployment Infrastructure: Dev containers for all frameworks (vLLM, SGLang, TensorRT-LLM) streamline local development. An overhaul of our documentation provides more consistency in the user path for each of the different frameworks. Custom chat templates and comprehensive documentation guides accelerate time to production. Multi-node Kubernetes examples demonstrate proper startup ordering and ARM64 support enables Dynamo deployment across an even larger set of hardware configurations. Automated planner integration and cluster-wide operator installation simplify deployments at scale. The SLA-aware planner with automated profiling optimizes resource allocation, while prefill-aware routing across all backends improves efficiency. Enhanced KV Block Manager adds prefill/decode disaggregation, disk offloading with access pattern filtering, and comprehensive metrics for fine-grained control.


Major Features and Improvements

1. Performance and Framework Support

  • Published recipes, performance sweeps, and benchmarks on InferenceMax, showcasing performance gains and TCO benefits from using Dynamo
  • Added Rayon compute pool for CPU-intensive operations (#2969), improved snapshot performance with reverse lookup (#3370), and optimized request processing with event-driven metrics updates (#3207) to optimize performance.
  • Added ability to run without etcd (#2281) for simplified deployments in controlled environments.
  • Added custom chat templates (#3165, #3332, #3362).
  • Parsers library with JSON-based parsers, parallel tool calling (#3188), reasoning transformation (#3295), and GPT-OSS reasoning integration (#3321)

2. Fault Tolerance & Observability

  • Implemented request cancellation across all backends (vLLM #3465, TensorRT-LLM #3193, SGLang #3465) enabling clean resource cleanup and preventing resource leaks
  • Implemented coordinated shutdown processes (#3481, #3513) with SIGINT/SIGTERM handling and vLLM engine cleanup (#2898) to prevent VRAM leaks and ensure clean service restarts
  • Unified metrics collection with cross-process instrumentation (#2243), Python Metrics Registry (#3341), and OTEL/Tempo visualization (#3307, #3160)
  • Added distributed tracing context support to Python bindings (#3160) and OTEL exporter with Tempo visualization (#3307) for end-to-end request tracking
  • Improved error messaging (#3587, #3210, #3549) to quickly identify and resolve deployment issues.
  • Standardized Prometheus naming conventions (#3035), added SGLang/vLLM passthrough metrics (#3539), custom NIM backend metrics (#3266), and configurable histogram buckets (#3562).
  • Implemented audit logging for chat completions (#3062) and comprehensive system status tracking with uptime monitoring (#2354, #3411)

3. Kubernetes Deployment

  • Deployed multi-node examples for TensorRT-LLM, vLLM, and SGLang (#3100, #3462) with startup ordering, resource coordination, and multinode operator behavior documentation (#3506).
  • Enabled ARM64 build support (#3146) for broader deployment compatibility
  • Added tolerations support (#2445), custom annotations, vLLM compilation cache (#3257), and improved Grove integration.
  • Enhanced conditional backend workflows (#3141), operator build per-commit (#3712), and improved container build metrics (#3461).
  • Implemented cluster-wide operator installation (#3199), and improved CRD documentation (#3504) for improved security and management
  • Automated planner deployment in Kubernetes operator with cluster-wide service account setup (#3520).
  • Using AIPerf for K8s FT tests to replace genai-perf (#3289); Added legacy client with AIPerf for FT tests to support both testing modes (#3415).

4. KV Block Manager

  • Enable PD disaggregation in vLLM to support prefill/decode separation (#3352).
  • Disk offloading filtering to selectively offload based on access patterns [#3532] to extend SSD lifespan.
  • Added KVBM metrics (#3561) for and cache hit analysis to optimize memory utilization; Added FullyContiguous layouts (#3090) for optimal data transfer efficiency.

5. Scheduling

Planner

  • Deployed SLA planner with automated parallelization mapping, pre-deployment profiling (#3441), and performance prediction to optimize resource allocation.

Router

  • Implemented approximate KV routing and prefill-aware routing; Enabled prefill router support across all backends(#3401, #3329, #3471, #3498).

6. User Experience and Documentation

  • Added comprehensive guides and documentation: tool calling (#2866), KVBM (#3578, #3759), vLLM KVBM 2P2D example [#3526], KV Router A/B testing guide [#3742], standalone routing python bindings[#3308], multinode operator behavior [#3506], planner quickstart guide[#3358], deployment examples, and API references with OpenAPI routes (#3480); Added gpu details for model recipes [#3707].
  • Major documentation reorganization (#3756, #3440), added version switcher (#3711); Restructured source code for better packaging (#3201), added component auto-discovery (#3348), fixed non-editable installs (#3478), and improved cross-platform compatibility (#3044).
  • Dev container for vLLM, SGLang, and TensorRT-LLM to streamline local development [#3228, #3576]
  • Added framework-specific test markers (#2561), and unit tests for custom jinja templates for vLLM, TRT-LLM and SGLang [#3165, #3332, #3362, #3472]

7. Others:

  • Tool Calling & Reasoning: Reasoning parser transformation to extract reasoning tokens #3295, E2e test for reasoning_effort on gpt-oss to validate reasoning modes [#3421]
  • Multimodal: Multimodal EPD for SGLang to support image inputs #3230

8. Bug Fixes

  • Fix TensorRT-LLM multinode command #3311, 3373
  • Remove VSWA user prompts #3404
  • Fixed circular rust dependencies (#3609), corrected commit info copying (#3670), resolved CUDA lock issues (#3704), and improved container security (#3367).
  • Update distroless go container for openssl #3486
  • Attach dynamo namespace label to Grove PodCliques #3359
  • Router registration to etcd #3302

Framework Updates

  • Upgrade NIXL to 0.6.0 #3550
  • Upgrade vLLM to 0.11.0 #3422
  • Upgrade SGLang container and version #3647

--

Known Issues

  • Building the Dynamo custom end-point picker (EPP) fails due to the incorrect filepath for patch
  • DS-R1 recipe fails due to the SLA profiling process crashing
  • GPT-OSS recipe does not work out-of-the-box

What's Next

Following this 0.6.0 release and a minor 0.6.1 release, our next major release (0.7.0) we will continue to prioritize improving performance across top models, ensuring robust fault-tolerance & observability in production scenarios, more backend integrations, production-grade cache management support, smarter routing/scheduling strategies, and developer-friendly UX.

We'd love to hear your feedback and comments - please open up an issue for any feature requests and chat with us on Discord!


Release Assets

Python Wheels:

Rust Crates:

Containers:

  • TensorRT-LLM Runtime: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.0 NGC
  • vLLM Runtime: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0 NGC
  • SGLang Runtime: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.0 NGC
  • Dynamo Kubernetes Operator: nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.6.0 NGC

Helm Charts:


Contributors

Big thanks to everyone who contributed to this release, including new contributors to the repo:

Full Changelog: v0.5.1...v0.6.0