Skip to content

Conversation

@ishandhanani
Copy link
Contributor

@ishandhanani ishandhanani commented Nov 10, 2025

  1. Updates to 0.5.4.post3 which contains gb200 fp8 stability. H100 needs to be tested and commands might need to change
  2. CU13 support in dockerfile

TODO - work with @nv-tusharma to reconcile. My goal is to

  1. get this PR accepted by NV
  2. contribute this dev + runtime stage PR back upstream

Summary by CodeRabbit

  • New Features

    • Added flexible build arguments for container customization, including kernel versions, package mirrors, and pip indices
    • Enhanced support for multiple CUDA versions (12.6.1, 12.8.1, 12.9.1, 13.0.1)
    • Introduced development and debugging tools integration to the container environment
  • Chores

    • Updated component versions and improved build infrastructure configuration

@ishandhanani ishandhanani requested review from a team as code owners November 10, 2025 00:29
@github-actions github-actions bot added the feat label Nov 10, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 10, 2025

Walkthrough

The Dockerfile receives a major refactoring introducing configurable build arguments (TRITON_LANG_COMMIT, SGL_VERSION, UBUNTU_MIRROR, etc.), restructured environment setup, dynamic installation workflows for SGLang and dependencies (NVSHMEM, DeepEP, GDRCopy) with CUDA-version-aware handling, enhanced developer tooling integration, and reorganized build stage orchestration including secret mounts.

Changes

Cohort / File(s) Summary
Build Arguments & Configuration
container/Dockerfile.sglang
Added new public ARGs: TRITON_LANG_COMMIT, BUILD_AND_DOWNLOAD_PARALLEL, PIP_DEFAULT_INDEX, UBUNTU_MIRROR, GITHUB_ARTIFACTORY, SGL_KERNEL_VERSION, SGL_VERSION, USE_LATEST_SGLANG, GDRCOPY_VERSION, NVSHMEM_VERSION, GRACE_BLACKWELL, INSTALL_FLASHINFER_JIT_CACHE, GRACE_BLACKWELL_DEEPEP_BRANCH, DEEPEP_COMMIT. Updated SGL_KERNEL_VERSION to 0.3.16.post5. Removed legacy ARGs (DEEPEP_COMMIT, FLASHMLA_COMMIT, etc.).
Environment & Framework Setup
container/Dockerfile.sglang
Removed fixed TZ setting. Added Ubuntu mirror handling, locale/language configuration (LANG, LANGUAGE, LC_ALL), and pip/cache behavior environment blocks. Introduced conditional source replacement logic driven by build arguments.
Dependency Installation Workflows
container/Dockerfile.sglang
Restructured SGLang, NVSHMEM, DeepEP, and GDRCopy installation flows with CUDA-version-aware kernel selection (12.6.1, 12.8.1, 12.9.1, 13.0.1), versioned ARG usage, dynamic artifact retrieval with secret mounts for AWS credentials, and updated build options. Replaced static commits with flexible sourcing.
Developer Tooling & Utilities
container/Dockerfile.sglang
Added installation of development tools (diff-so-fancy, clang-format, clangd, CMake, oh-my-zsh, zsh plugins, nsight-systems-cli) and Python utilities (pytest, black, isort). Included configuration steps for vim, tmux, gdb, Git/Zsh customization, and deployment scripts.
Build Stage Orchestration
container/Dockerfile.sglang
Consolidated and reorganized RUN blocks to integrate new ARGs and conditional logic. Refactored GDRCopy/NVSHMEM/DeepEP build steps with secret mounts. Unified SGLang installation with CUDA-aware indexing. Separated production-oriented runtime stage deployment.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Critical areas requiring careful attention:
    • CUDA version-specific kernel and dependency selection logic—verify conditional handling for all targeted versions (12.6.1, 12.8.1, 12.9.1, 13.0.1) and compatibility with PyTorch/Torch versions
    • Secret mount usage for AWS credentials—ensure security implications are understood and that credential handling follows best practices
    • SGLang installation logic branching (USE_LATEST_SGLANG flag)—confirm branch selection, version pinning, and index handling work as intended
    • DeepEP/NVSHMEM/GDRCopy dynamic sourcing and build orchestration—verify version resolution, artifact retrieval paths, and build-time option correctness
    • Build stage dependencies and layer optimization—ensure new tooling additions don't bloat image size unexpectedly
    • Legacy ARG removal—confirm no breaking changes for existing build workflows or downstream consumers

Poem

🐰 A Dockerfile so grand, now deeply reconfigured,
With ARGs that dance and CUDA-aware vigor,
SGLang flows dynamic, tools arrive with grace,
Dev tooling blooms where secrets hide their face—
This container births order from chaos's race! ✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is incomplete and lacks required sections from the template. It provides minimal details about changes and does not follow the structured format with Overview, Details, Where to start, and Related Issues sections. Expand the description to include all template sections: add a comprehensive Overview, detailed explanation of CU13 support and version bump changes, specify which files to review first (container/Dockerfile.sglang), and link to any related GitHub issues.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: updating SGLang to 0.5.4.post3 and adding CU13 support in the Dockerfile, which are the primary objectives reflected in both the raw summary and PR objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
container/Dockerfile.sglang (2)

398-421: External tool downloads lack checksums and retry logic; supply chain risk.

Lines 398-421 download multiple tools directly from GitHub/external sources without checksums or retry logic:

  • diff-so-fancy (line 399)
  • clang-format (line 403)
  • clangd (line 407)
  • CMake (line 417)
  • oh-my-zsh (line 437)
  • just (line 453)

If any download fails transitorily or a malicious actor compromises the source, the build silently succeeds with missing/corrupted tools. Best practices:

  1. Add checksums (SHA256) for each downloaded binary
  2. Add --retry flags to curl/wget commands
  3. Verify URLs are canonical and use HTTPS
  4. Consider pinning release tags/versions more explicitly

Example remediation:

- RUN curl -LSso /usr/local/bin/diff-so-fancy https://${GITHUB_ARTIFACTORY}/so-fancy/diff-so-fancy/releases/download/v1.4.4/diff-so-fancy
+ RUN curl --retry 3 -LSso /usr/local/bin/diff-so-fancy https://${GITHUB_ARTIFACTORY}/so-fancy/diff-so-fancy/releases/download/v1.4.4/diff-so-fancy && \
+     echo "expected_sha256_here  /usr/local/bin/diff-so-fancy" | sha256sum -c -

Add checksums and retry logic to external tool downloads for supply chain safety.

Also applies to: 437-439, 453-455


291-301: Consider using the tag name (v1.2.1) instead of commit hash for the deepseek-ai/DeepEP source.

Both DeepEP sources are accessible and valid:

  • fzyzcjy/DeepEP gb200_blog_part_2 branch exists (commit 1fd57b0276311d035d16176bb0076426166e52f3)
  • deepseek-ai/DeepEP commit 9af0e0d0e74f3577af1979c9b9e1ac2cad0104ee exists as a tagged release (v1.2.1)

However, the Dockerfile uses the commit hash directly. Consider these improvements:

  • Replace the bare commit hash with the tag reference v1.2.1 for clarity and maintainability
  • Add error handling for git clone and wget/unzip failures (set -e or explicit error checks)
  • Document why dual sources are needed (GB200 optimizations in fzyzcjy fork vs. standard upstream)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09b26bf and 7d393ce.

📒 Files selected for processing (1)
  • container/Dockerfile.sglang (8 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: nv-anants
Repo: ai-dynamo/dynamo PR: 3290
File: docs/_includes/quick_start_local.rst:13-14
Timestamp: 2025-09-29T19:11:14.161Z
Learning: ai-dynamo package requires --prerelease=allow flag for installation due to sglang dependency requiring RC versions (e.g., sglang[all]==0.5.0rc2).
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: operator (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (9)
container/Dockerfile.sglang (9)

221-231: Secret mounts for AWS credentials lack visibility and error handling.

Lines 221-222 and 308-309 mount AWS secrets for GDRCopy and NVSHMEM builds, but there's no validation that the secrets are provided or error handling if they're missing. This can lead to silent failures during builds in CI environments where secrets may not be configured.

Considerations:

  1. If secrets are not mounted, the build proceeds but may fail later with cryptic errors
  2. No logging or diagnostics to help troubleshoot secret-related failures
  3. Unclear whether secrets are actually needed for GitHub artifact downloads (via GITHUB_ARTIFACTORY)

Recommend:

  • Document when secrets are required (only for S3 sccache, or for artifact downloads too?)
  • Add a pre-check or informative error message if AWS credentials are missing when USE_SCCACHE=true
  • Consider making secret mounts optional with clear fallback behavior

Are AWS secrets required for GITHUB_ARTIFACTORY downloads, or only for sccache? If only sccache, move the secret mounts to the sccache-specific build steps.

Also applies to: 308-309


46-59: ARG declarations use reasonable defaults for flexibility.

The new ARGs (GRACE_BLACKWELL_DEEPEP_BRANCH, TRITON_LANG_COMMIT, BUILD_AND_DOWNLOAD_PARALLEL, etc.) provide good build-time flexibility without breaking existing builds. Defaults are sensible (e.g., GRACE_BLACKWELL=0 defaults to standard DeepEP, USE_LATEST_SGLANG=0 defaults to versioned release).

One consideration: ARGs like PIP_DEFAULT_INDEX and UBUNTU_MIRROR default to empty strings, so they're only applied if explicitly provided. This is safe and consistent with the conditional logic at lines 78-81 and 199-201.


77-81: Ubuntu mirror and PIP index customization handles empty defaults safely.

Lines 78-81 conditionally replace Ubuntu mirrors only if UBUNTU_MIRROR is non-empty, and lines 199-201 set pip index only if PIP_DEFAULT_INDEX is non-empty. This design gracefully handles the default (empty) case without breaking the build.

Minor: Line 78 uses if [ -n "$UBUNTU_MIRROR" ] which is correct; consider the same pattern is used for PIP_DEFAULT_INDEX (line 199), making the logic consistent.

Also applies to: 195-201


284-289: No issues found—both NVSHMEM sources verified as accessible and working.

Verification confirms NVSHMEM v3.4.5 tarballs exist and are accessible on both sources:

  • GitHub (nvshmem_src_cuda-all-all-3.4.5.tar.gz): exists with 112+ downloads
  • NVIDIA developer.download (nvshmem_src_cuda12-all-all-3.4.5.tar.gz): HTTP/2 200 response

The use of different sources for CUDA 13 vs CUDA 12 is intentional design, not a reliability risk. Both URLs resolve correctly with the defined variables (GITHUB_ARTIFACTORY=github.com, NVSHMEM_VERSION=3.4.5) and are production-stable.


238-242: Verify CI coverage for both SGLang clone paths.

Tag v0.5.4.post3 exists in the sgl-project/sglang repository, confirming the versioned clone path will work. However, no evidence was found in CI configurations that both conditional branches (USE_LATEST_SGLANG=1 and USE_LATEST_SGLANG=0) are tested before merge. Confirm that CI runs build tests with both flag values or document the testing strategy.


46-59: ARG declarations verified—version compatibility confirmed.

SGLang 0.5.4 and sgl-kernel 0.3.16.post3 are packaged with CUDA 13 (cu130) images using Torch 2.9.0+cu130, validating the configuration across lines 46–59. SGL_KERNEL_VERSION 0.3.16.post5 (post5) is a compatible upgrade from the documented post3. No changes required.


253-264: All verification checks passed — wheel availability and CUDA index mapping confirmed.

Verification results:

  • ✓ Wheels exist at sgl-project/whl releases: cu124 (8 versions), cu130 (7 versions)
  • ✓ PyPI index has CUDA 12.8 and 12.9 wheels: cu128 (21 versions) and cu129 (14 versions)
  • ✓ CUDA version mapping is correct: 12.6.1→cu124, 12.8.1→PyPI (cu128), 12.9.1→PyPI (cu129), 13.0.1→cu130

353-361: Update the Triton source build scope and update the outdated comment reference.

The original concern about PR#8536 status is now answered: the PR was merged Oct 26, 2025 but reverted Oct 28, 2025 and is not present in Triton 3.5.0. This means the source build workaround is still necessary and the comment's assumption about reverting after a future release is outdated.

However, a scope inconsistency remains: while the Dockerfile handles broader CUDA 13.x variants (lines 272, 284, 343 all use CUDA_VERSION%%.* = "13"), the Triton source build is narrowly scoped to exactly CUDA_VERSION = "13.0.1". The comment says "For cuda 13" but the condition is only for 13.0.1.

Clarify whether:

  1. The 13.0.1-specific scope is intentional (sm103 issue only affects that version), or
  2. The workaround should apply to all CUDA 13.x variants in the build matrix.

Update the comment to remove the outdated reference to waiting for a future Triton release.


332-346: CUDA 12.8.1 incorrectly includes Blackwell CC 10.3 architecture which is not supported until CUDA 12.9.

The verification reveals a critical bug:

  • CUDA 12.8 adds Blackwell (SM100) support for compute capability 10.0, but compute-capability 10.3 is fully supported starting with CUDA 13.x.
  • Line 337 incorrectly groups CUDA 12.8.1 with 12.9.1/13.0.1 using the same TORCH_CUDA_ARCH_LIST='9.0;10.0;10.3'.
  • GB200 devices (CC 10.3) will fail compilation or produce incompatible binaries when built with CUDA 12.8.1.

Fix required: Line 337 must be split into two cases:

  • CUDA 12.8.1: use CHOSEN_TORCH_CUDA_ARCH_LIST='9.0;10.0' (without 10.3)
  • CUDA 12.9.1/13.0.1: keep CHOSEN_TORCH_CUDA_ARCH_LIST='9.0;10.0;10.3'

The setup.py sed patch (lines 343-345) is functionally correct and will execute as intended.

⛔ Skipped due to learnings
Learnt from: ptarasiewiczNV
Repo: ai-dynamo/dynamo PR: 2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.

check \
libsubunit0 \
libsubunit-dev \
# Development tools and utilities
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Majority of these packages are installed in the dev stage of the dockerfile:
https://github.com/ai-dynamo/dynamo/blob/main/container/Dockerfile.sglang#L432-L444
Do you know if we need any of these dev dependencies?

bear \
less \
gnupg \
nvidia-dkms-550 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, gdrcopy uses the host dkms driver installed.

# Install SGLang
# Until torch 2.9 and cu13 are stable we manually update torch if you are on CUDA 13
RUN if [ "$USE_LATEST_SGLANG" = "1" ]; then \
git clone --depth=1 https://${GITHUB_ARTIFACTORY}/sgl-project/sglang.git /sgl-workspace/sglang; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? I see us only cloning the release branch, not directly using latest sglang

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ability would be helpful to have for dev purposes

LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/nvidia/lib:/usr/local/nvidia/lib64"

# Replace Ubuntu sources if specified
RUN if [ -n "$UBUNTU_MIRROR" ]; then \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a private ubuntu index used by the sglang team

RUN --mount=type=cache,target=/root/.cache/pip python3 -m pip install --upgrade pip setuptools wheel html5lib six \
&& cd sglang \
&& case "$CUDA_VERSION" in \
12.6.1) CUINDEX=126 ;; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support CUDA 12.6.1?

@nv-tusharma
Copy link
Contributor

Also, the gating sglang check is failing, We want to make sure this runs and completes before we merge. I can also check what the container size looks like after we get it to pass.

@ishandhanani ishandhanani marked this pull request as draft November 10, 2025 18:33
py-spy

# Install nsight-systems
RUN echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu2004/$(if [ "$(uname -m)" = "aarch64" ]; then echo "arm64"; else echo "amd64"; fi) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move this down as well in the dev stage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants