feat: sglang to `0.5.4.post3` and cu13 support in dockerfile #4207

ishandhanani · 2025-11-10T00:29:19Z

Updates to 0.5.4.post3 which contains gb200 fp8 stability. H100 needs to be tested and commands might need to change
CU13 support in dockerfile

TODO - work with @nv-tusharma to reconcile. My goal is to

get this PR accepted by NV
contribute this dev + runtime stage PR back upstream

Summary by CodeRabbit

New Features
- Added flexible build arguments for container customization, including kernel versions, package mirrors, and pip indices
- Enhanced support for multiple CUDA versions (12.6.1, 12.8.1, 12.9.1, 13.0.1)
- Introduced development and debugging tools integration to the container environment
Chores
- Updated component versions and improved build infrastructure configuration

coderabbitai · 2025-11-10T00:35:21Z

Walkthrough

The Dockerfile receives a major refactoring introducing configurable build arguments (TRITON_LANG_COMMIT, SGL_VERSION, UBUNTU_MIRROR, etc.), restructured environment setup, dynamic installation workflows for SGLang and dependencies (NVSHMEM, DeepEP, GDRCopy) with CUDA-version-aware handling, enhanced developer tooling integration, and reorganized build stage orchestration including secret mounts.

Changes

Cohort / File(s)	Summary
Build Arguments & Configuration `container/Dockerfile.sglang`	Added new public ARGs: TRITON_LANG_COMMIT, BUILD_AND_DOWNLOAD_PARALLEL, PIP_DEFAULT_INDEX, UBUNTU_MIRROR, GITHUB_ARTIFACTORY, SGL_KERNEL_VERSION, SGL_VERSION, USE_LATEST_SGLANG, GDRCOPY_VERSION, NVSHMEM_VERSION, GRACE_BLACKWELL, INSTALL_FLASHINFER_JIT_CACHE, GRACE_BLACKWELL_DEEPEP_BRANCH, DEEPEP_COMMIT. Updated SGL_KERNEL_VERSION to 0.3.16.post5. Removed legacy ARGs (DEEPEP_COMMIT, FLASHMLA_COMMIT, etc.).
Environment & Framework Setup `container/Dockerfile.sglang`	Removed fixed TZ setting. Added Ubuntu mirror handling, locale/language configuration (LANG, LANGUAGE, LC_ALL), and pip/cache behavior environment blocks. Introduced conditional source replacement logic driven by build arguments.
Dependency Installation Workflows `container/Dockerfile.sglang`	Restructured SGLang, NVSHMEM, DeepEP, and GDRCopy installation flows with CUDA-version-aware kernel selection (12.6.1, 12.8.1, 12.9.1, 13.0.1), versioned ARG usage, dynamic artifact retrieval with secret mounts for AWS credentials, and updated build options. Replaced static commits with flexible sourcing.
Developer Tooling & Utilities `container/Dockerfile.sglang`	Added installation of development tools (diff-so-fancy, clang-format, clangd, CMake, oh-my-zsh, zsh plugins, nsight-systems-cli) and Python utilities (pytest, black, isort). Included configuration steps for vim, tmux, gdb, Git/Zsh customization, and deployment scripts.
Build Stage Orchestration `container/Dockerfile.sglang`	Consolidated and reorganized RUN blocks to integrate new ARGs and conditional logic. Refactored GDRCopy/NVSHMEM/DeepEP build steps with secret mounts. Unified SGLang installation with CUDA-aware indexing. Separated production-oriented runtime stage deployment.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Critical areas requiring careful attention:
- CUDA version-specific kernel and dependency selection logic—verify conditional handling for all targeted versions (12.6.1, 12.8.1, 12.9.1, 13.0.1) and compatibility with PyTorch/Torch versions
- Secret mount usage for AWS credentials—ensure security implications are understood and that credential handling follows best practices
- SGLang installation logic branching (USE_LATEST_SGLANG flag)—confirm branch selection, version pinning, and index handling work as intended
- DeepEP/NVSHMEM/GDRCopy dynamic sourcing and build orchestration—verify version resolution, artifact retrieval paths, and build-time option correctness
- Build stage dependencies and layer optimization—ensure new tooling additions don't bloat image size unexpectedly
- Legacy ARG removal—confirm no breaking changes for existing build workflows or downstream consumers

Poem

🐰 A Dockerfile so grand, now deeply reconfigured,
With ARGs that dance and CUDA-aware vigor,
SGLang flows dynamic, tools arrive with grace,
Dev tooling blooms where secrets hide their face—
This container births order from chaos's race! ✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is incomplete and lacks required sections from the template. It provides minimal details about changes and does not follow the structured format with Overview, Details, Where to start, and Related Issues sections.	Expand the description to include all template sections: add a comprehensive Overview, detailed explanation of CU13 support and version bump changes, specify which files to review first (container/Dockerfile.sglang), and link to any related GitHub issues.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: updating SGLang to 0.5.4.post3 and adding CU13 support in the Dockerfile, which are the primary objectives reflected in both the raw summary and PR objectives.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

container/Dockerfile.sglang (2)
398-421: External tool downloads lack checksums and retry logic; supply chain risk.

Lines 398-421 download multiple tools directly from GitHub/external sources without checksums or retry logic:

diff-so-fancy (line 399)

clang-format (line 403)

clangd (line 407)

CMake (line 417)

oh-my-zsh (line 437)

just (line 453)

If any download fails transitorily or a malicious actor compromises the source, the build silently succeeds with missing/corrupted tools. Best practices:

Add checksums (SHA256) for each downloaded binary

Add --retry flags to curl/wget commands

Verify URLs are canonical and use HTTPS

Consider pinning release tags/versions more explicitly

Example remediation:
- RUN curl -LSso /usr/local/bin/diff-so-fancy https://${GITHUB_ARTIFACTORY}/so-fancy/diff-so-fancy/releases/download/v1.4.4/diff-so-fancy
+ RUN curl --retry 3 -LSso /usr/local/bin/diff-so-fancy https://${GITHUB_ARTIFACTORY}/so-fancy/diff-so-fancy/releases/download/v1.4.4/diff-so-fancy && \
+     echo "expected_sha256_here  /usr/local/bin/diff-so-fancy" | sha256sum -c -
Add checksums and retry logic to external tool downloads for supply chain safety.

Also applies to: 437-439, 453-455

291-301: Consider using the tag name (v1.2.1) instead of commit hash for the deepseek-ai/DeepEP source.

Both DeepEP sources are accessible and valid:

fzyzcjy/DeepEP gb200_blog_part_2 branch exists (commit 1fd57b0276311d035d16176bb0076426166e52f3)

deepseek-ai/DeepEP commit 9af0e0d0e74f3577af1979c9b9e1ac2cad0104ee exists as a tagged release (v1.2.1)

However, the Dockerfile uses the commit hash directly. Consider these improvements:

Replace the bare commit hash with the tag reference v1.2.1 for clarity and maintainability

Add error handling for git clone and wget/unzip failures (set -e or explicit error checks)

Document why dual sources are needed (GB200 optimizations in fzyzcjy fork vs. standard upstream)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09b26bf and 7d393ce.

📒 Files selected for processing (1)

container/Dockerfile.sglang (8 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: nv-anants
Repo: ai-dynamo/dynamo PR: 3290
File: docs/_includes/quick_start_local.rst:13-14
Timestamp: 2025-09-29T19:11:14.161Z
Learning: ai-dynamo package requires --prerelease=allow flag for installation due to sglang dependency requiring RC versions (e.g., sglang[all]==0.5.0rc2).

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: operator (arm64)
GitHub Check: trtllm (arm64)
GitHub Check: operator (amd64)
GitHub Check: vllm (arm64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (9)

container/Dockerfile.sglang (9)
221-231: Secret mounts for AWS credentials lack visibility and error handling.

Lines 221-222 and 308-309 mount AWS secrets for GDRCopy and NVSHMEM builds, but there's no validation that the secrets are provided or error handling if they're missing. This can lead to silent failures during builds in CI environments where secrets may not be configured.

Considerations:

If secrets are not mounted, the build proceeds but may fail later with cryptic errors

No logging or diagnostics to help troubleshoot secret-related failures

Unclear whether secrets are actually needed for GitHub artifact downloads (via GITHUB_ARTIFACTORY)

Recommend:

Document when secrets are required (only for S3 sccache, or for artifact downloads too?)

Add a pre-check or informative error message if AWS credentials are missing when USE_SCCACHE=true

Consider making secret mounts optional with clear fallback behavior

Are AWS secrets required for GITHUB_ARTIFACTORY downloads, or only for sccache? If only sccache, move the secret mounts to the sccache-specific build steps.

Also applies to: 308-309

46-59: ARG declarations use reasonable defaults for flexibility.

The new ARGs (GRACE_BLACKWELL_DEEPEP_BRANCH, TRITON_LANG_COMMIT, BUILD_AND_DOWNLOAD_PARALLEL, etc.) provide good build-time flexibility without breaking existing builds. Defaults are sensible (e.g., GRACE_BLACKWELL=0 defaults to standard DeepEP, USE_LATEST_SGLANG=0 defaults to versioned release).

One consideration: ARGs like PIP_DEFAULT_INDEX and UBUNTU_MIRROR default to empty strings, so they're only applied if explicitly provided. This is safe and consistent with the conditional logic at lines 78-81 and 199-201.

77-81: Ubuntu mirror and PIP index customization handles empty defaults safely.

Lines 78-81 conditionally replace Ubuntu mirrors only if UBUNTU_MIRROR is non-empty, and lines 199-201 set pip index only if PIP_DEFAULT_INDEX is non-empty. This design gracefully handles the default (empty) case without breaking the build.

Minor: Line 78 uses if [ -n "$UBUNTU_MIRROR" ] which is correct; consider the same pattern is used for PIP_DEFAULT_INDEX (line 199), making the logic consistent.

Also applies to: 195-201

284-289: No issues found—both NVSHMEM sources verified as accessible and working.

Verification confirms NVSHMEM v3.4.5 tarballs exist and are accessible on both sources:

GitHub (nvshmem_src_cuda-all-all-3.4.5.tar.gz): exists with 112+ downloads

NVIDIA developer.download (nvshmem_src_cuda12-all-all-3.4.5.tar.gz): HTTP/2 200 response

The use of different sources for CUDA 13 vs CUDA 12 is intentional design, not a reliability risk. Both URLs resolve correctly with the defined variables (GITHUB_ARTIFACTORY=github.com, NVSHMEM_VERSION=3.4.5) and are production-stable.

238-242: Verify CI coverage for both SGLang clone paths.

Tag v0.5.4.post3 exists in the sgl-project/sglang repository, confirming the versioned clone path will work. However, no evidence was found in CI configurations that both conditional branches (USE_LATEST_SGLANG=1 and USE_LATEST_SGLANG=0) are tested before merge. Confirm that CI runs build tests with both flag values or document the testing strategy.

46-59: ARG declarations verified—version compatibility confirmed.

SGLang 0.5.4 and sgl-kernel 0.3.16.post3 are packaged with CUDA 13 (cu130) images using Torch 2.9.0+cu130, validating the configuration across lines 46–59. SGL_KERNEL_VERSION 0.3.16.post5 (post5) is a compatible upgrade from the documented post3. No changes required.

253-264: All verification checks passed — wheel availability and CUDA index mapping confirmed.

Verification results:

✓ Wheels exist at sgl-project/whl releases: cu124 (8 versions), cu130 (7 versions)

✓ PyPI index has CUDA 12.8 and 12.9 wheels: cu128 (21 versions) and cu129 (14 versions)

✓ CUDA version mapping is correct: 12.6.1→cu124, 12.8.1→PyPI (cu128), 12.9.1→PyPI (cu129), 13.0.1→cu130

353-361: Update the Triton source build scope and update the outdated comment reference.

The original concern about PR#8536 status is now answered: the PR was merged Oct 26, 2025 but reverted Oct 28, 2025 and is not present in Triton 3.5.0. This means the source build workaround is still necessary and the comment's assumption about reverting after a future release is outdated.

However, a scope inconsistency remains: while the Dockerfile handles broader CUDA 13.x variants (lines 272, 284, 343 all use CUDA_VERSION%%.* = "13"), the Triton source build is narrowly scoped to exactly CUDA_VERSION = "13.0.1". The comment says "For cuda 13" but the condition is only for 13.0.1.

Clarify whether:

The 13.0.1-specific scope is intentional (sm103 issue only affects that version), or

The workaround should apply to all CUDA 13.x variants in the build matrix.

Update the comment to remove the outdated reference to waiting for a future Triton release.

332-346: CUDA 12.8.1 incorrectly includes Blackwell CC 10.3 architecture which is not supported until CUDA 12.9.

The verification reveals a critical bug:

CUDA 12.8 adds Blackwell (SM100) support for compute capability 10.0, but compute-capability 10.3 is fully supported starting with CUDA 13.x.

Line 337 incorrectly groups CUDA 12.8.1 with 12.9.1/13.0.1 using the same TORCH_CUDA_ARCH_LIST='9.0;10.0;10.3'.

GB200 devices (CC 10.3) will fail compilation or produce incompatible binaries when built with CUDA 12.8.1.

Fix required: Line 337 must be split into two cases:

CUDA 12.8.1: use CHOSEN_TORCH_CUDA_ARCH_LIST='9.0;10.0' (without 10.3)

CUDA 12.9.1/13.0.1: keep CHOSEN_TORCH_CUDA_ARCH_LIST='9.0;10.0;10.3'

The setup.py sed patch (lines 343-345) is functionally correct and will execute as intended.
⛔ Skipped due to learnings
Learnt from: ptarasiewiczNV
Repo: ai-dynamo/dynamo PR: 2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.

container/Dockerfile.sglang

nv-tusharma · 2025-11-10T17:42:34Z

container/Dockerfile.sglang

        check \
        libsubunit0 \
        libsubunit-dev \
+        # Development tools and utilities


Majority of these packages are installed in the dev stage of the dockerfile:
https://github.com/ai-dynamo/dynamo/blob/main/container/Dockerfile.sglang#L432-L444
Do you know if we need any of these dev dependencies?

nv-tusharma · 2025-11-10T17:43:02Z

container/Dockerfile.sglang

+        bear \
+        less \
+        gnupg \
+        nvidia-dkms-550 \


Not needed, gdrcopy uses the host dkms driver installed.

nv-tusharma · 2025-11-10T17:50:21Z

container/Dockerfile.sglang

+# Install SGLang
+# Until torch 2.9 and cu13 are stable we manually update torch if you are on CUDA 13
+RUN if [ "$USE_LATEST_SGLANG" = "1" ]; then \
+        git clone --depth=1 https://${GITHUB_ARTIFACTORY}/sgl-project/sglang.git /sgl-workspace/sglang; \


Do we need this? I see us only cloning the release branch, not directly using latest sglang

I think the ability would be helpful to have for dev purposes

nv-tusharma · 2025-11-10T17:52:32Z

container/Dockerfile.sglang

+    LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/nvidia/lib:/usr/local/nvidia/lib64"
+
+# Replace Ubuntu sources if specified
+RUN if [ -n "$UBUNTU_MIRROR" ]; then \


I think this is a private ubuntu index used by the sglang team

nv-tusharma · 2025-11-10T17:53:26Z

container/Dockerfile.sglang

+RUN --mount=type=cache,target=/root/.cache/pip  python3 -m pip install --upgrade pip setuptools wheel html5lib six \
+ && cd sglang \
+ && case "$CUDA_VERSION" in \
+      12.6.1) CUINDEX=126 ;; \


Do we want to support CUDA 12.6.1?

nv-tusharma · 2025-11-10T17:56:36Z

Also, the gating sglang check is failing, We want to make sure this runs and completes before we merge. I can also check what the container size looks like after we get it to pass.

nv-tusharma · 2025-11-10T22:15:09Z

container/Dockerfile.sglang

+    py-spy
+
+# Install nsight-systems
+RUN echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu2004/$(if [ "$(uname -m)" = "aarch64" ]; then echo "arm64"; else echo "amd64"; fi) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list \


We should move this down as well in the dev stage

ishandhanani added 2 commits November 10, 2025 00:20

bump

2669134

bump

7d393ce

ishandhanani requested review from a team as code owners November 10, 2025 00:29

pull-request-size bot added the size/L label Nov 10, 2025

github-actions bot added the feat label Nov 10, 2025

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

container/Dockerfile.sglang Show resolved Hide resolved

nv-tusharma reviewed Nov 10, 2025

View reviewed changes

ishandhanani marked this pull request as draft November 10, 2025 18:33

nv-tusharma reviewed Nov 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: sglang to `0.5.4.post3` and cu13 support in dockerfile #4207

feat: sglang to `0.5.4.post3` and cu13 support in dockerfile #4207

Uh oh!

ishandhanani commented Nov 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

nv-tusharma Nov 10, 2025

Uh oh!

nv-tusharma Nov 10, 2025

Uh oh!

nv-tusharma Nov 10, 2025

Uh oh!

ishandhanani Nov 10, 2025

Uh oh!

nv-tusharma Nov 10, 2025

Uh oh!

nv-tusharma Nov 10, 2025

Uh oh!

nv-tusharma commented Nov 10, 2025

Uh oh!

nv-tusharma Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: sglang to 0.5.4.post3 and cu13 support in dockerfile #4207

Are you sure you want to change the base?

feat: sglang to 0.5.4.post3 and cu13 support in dockerfile #4207

Uh oh!

Conversation

ishandhanani commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nv-tusharma Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

nv-tusharma Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

nv-tusharma Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

ishandhanani Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

nv-tusharma Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

nv-tusharma Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

nv-tusharma commented Nov 10, 2025

Uh oh!

nv-tusharma Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: sglang to `0.5.4.post3` and cu13 support in dockerfile #4207

feat: sglang to `0.5.4.post3` and cu13 support in dockerfile #4207

ishandhanani commented Nov 10, 2025 •

edited

Loading

coderabbitai bot commented Nov 10, 2025 •

edited

Loading