-
Notifications
You must be signed in to change notification settings - Fork 679
feat: sglang to 0.5.4.post3 and cu13 support in dockerfile
#4207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe Dockerfile receives a major refactoring introducing configurable build arguments (TRITON_LANG_COMMIT, SGL_VERSION, UBUNTU_MIRROR, etc.), restructured environment setup, dynamic installation workflows for SGLang and dependencies (NVSHMEM, DeepEP, GDRCopy) with CUDA-version-aware handling, enhanced developer tooling integration, and reorganized build stage orchestration including secret mounts. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
container/Dockerfile.sglang (2)
398-421: External tool downloads lack checksums and retry logic; supply chain risk.Lines 398-421 download multiple tools directly from GitHub/external sources without checksums or retry logic:
- diff-so-fancy (line 399)
- clang-format (line 403)
- clangd (line 407)
- CMake (line 417)
- oh-my-zsh (line 437)
- just (line 453)
If any download fails transitorily or a malicious actor compromises the source, the build silently succeeds with missing/corrupted tools. Best practices:
- Add checksums (SHA256) for each downloaded binary
- Add
--retryflags tocurl/wgetcommands- Verify URLs are canonical and use HTTPS
- Consider pinning release tags/versions more explicitly
Example remediation:
- RUN curl -LSso /usr/local/bin/diff-so-fancy https://${GITHUB_ARTIFACTORY}/so-fancy/diff-so-fancy/releases/download/v1.4.4/diff-so-fancy + RUN curl --retry 3 -LSso /usr/local/bin/diff-so-fancy https://${GITHUB_ARTIFACTORY}/so-fancy/diff-so-fancy/releases/download/v1.4.4/diff-so-fancy && \ + echo "expected_sha256_here /usr/local/bin/diff-so-fancy" | sha256sum -c -Add checksums and retry logic to external tool downloads for supply chain safety.
Also applies to: 437-439, 453-455
291-301: Consider using the tag name (v1.2.1) instead of commit hash for the deepseek-ai/DeepEP source.Both DeepEP sources are accessible and valid:
- fzyzcjy/DeepEP gb200_blog_part_2 branch exists (commit 1fd57b0276311d035d16176bb0076426166e52f3)
- deepseek-ai/DeepEP commit 9af0e0d0e74f3577af1979c9b9e1ac2cad0104ee exists as a tagged release (v1.2.1)
However, the Dockerfile uses the commit hash directly. Consider these improvements:
- Replace the bare commit hash with the tag reference
v1.2.1for clarity and maintainability- Add error handling for git clone and wget/unzip failures (set -e or explicit error checks)
- Document why dual sources are needed (GB200 optimizations in fzyzcjy fork vs. standard upstream)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
container/Dockerfile.sglang(8 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: nv-anants
Repo: ai-dynamo/dynamo PR: 3290
File: docs/_includes/quick_start_local.rst:13-14
Timestamp: 2025-09-29T19:11:14.161Z
Learning: ai-dynamo package requires --prerelease=allow flag for installation due to sglang dependency requiring RC versions (e.g., sglang[all]==0.5.0rc2).
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: operator (arm64)
- GitHub Check: trtllm (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (9)
container/Dockerfile.sglang (9)
221-231: Secret mounts for AWS credentials lack visibility and error handling.Lines 221-222 and 308-309 mount AWS secrets for GDRCopy and NVSHMEM builds, but there's no validation that the secrets are provided or error handling if they're missing. This can lead to silent failures during builds in CI environments where secrets may not be configured.
Considerations:
- If secrets are not mounted, the build proceeds but may fail later with cryptic errors
- No logging or diagnostics to help troubleshoot secret-related failures
- Unclear whether secrets are actually needed for GitHub artifact downloads (via GITHUB_ARTIFACTORY)
Recommend:
- Document when secrets are required (only for S3 sccache, or for artifact downloads too?)
- Add a pre-check or informative error message if AWS credentials are missing when
USE_SCCACHE=true- Consider making secret mounts optional with clear fallback behavior
Are AWS secrets required for GITHUB_ARTIFACTORY downloads, or only for sccache? If only sccache, move the secret mounts to the sccache-specific build steps.
Also applies to: 308-309
46-59: ARG declarations use reasonable defaults for flexibility.The new ARGs (GRACE_BLACKWELL_DEEPEP_BRANCH, TRITON_LANG_COMMIT, BUILD_AND_DOWNLOAD_PARALLEL, etc.) provide good build-time flexibility without breaking existing builds. Defaults are sensible (e.g., GRACE_BLACKWELL=0 defaults to standard DeepEP, USE_LATEST_SGLANG=0 defaults to versioned release).
One consideration: ARGs like PIP_DEFAULT_INDEX and UBUNTU_MIRROR default to empty strings, so they're only applied if explicitly provided. This is safe and consistent with the conditional logic at lines 78-81 and 199-201.
77-81: Ubuntu mirror and PIP index customization handles empty defaults safely.Lines 78-81 conditionally replace Ubuntu mirrors only if
UBUNTU_MIRRORis non-empty, and lines 199-201 set pip index only ifPIP_DEFAULT_INDEXis non-empty. This design gracefully handles the default (empty) case without breaking the build.Minor: Line 78 uses
if [ -n "$UBUNTU_MIRROR" ]which is correct; consider the same pattern is used for PIP_DEFAULT_INDEX (line 199), making the logic consistent.Also applies to: 195-201
284-289: No issues found—both NVSHMEM sources verified as accessible and working.Verification confirms NVSHMEM v3.4.5 tarballs exist and are accessible on both sources:
- GitHub (
nvshmem_src_cuda-all-all-3.4.5.tar.gz): exists with 112+ downloads- NVIDIA developer.download (
nvshmem_src_cuda12-all-all-3.4.5.tar.gz): HTTP/2 200 responseThe use of different sources for CUDA 13 vs CUDA 12 is intentional design, not a reliability risk. Both URLs resolve correctly with the defined variables (GITHUB_ARTIFACTORY=github.com, NVSHMEM_VERSION=3.4.5) and are production-stable.
238-242: Verify CI coverage for both SGLang clone paths.Tag
v0.5.4.post3exists in the sgl-project/sglang repository, confirming the versioned clone path will work. However, no evidence was found in CI configurations that both conditional branches (USE_LATEST_SGLANG=1 and USE_LATEST_SGLANG=0) are tested before merge. Confirm that CI runs build tests with both flag values or document the testing strategy.
46-59: ARG declarations verified—version compatibility confirmed.SGLang 0.5.4 and sgl-kernel 0.3.16.post3 are packaged with CUDA 13 (cu130) images using Torch 2.9.0+cu130, validating the configuration across lines 46–59. SGL_KERNEL_VERSION 0.3.16.post5 (post5) is a compatible upgrade from the documented post3. No changes required.
253-264: All verification checks passed — wheel availability and CUDA index mapping confirmed.Verification results:
- ✓ Wheels exist at sgl-project/whl releases: cu124 (8 versions), cu130 (7 versions)
- ✓ PyPI index has CUDA 12.8 and 12.9 wheels: cu128 (21 versions) and cu129 (14 versions)
- ✓ CUDA version mapping is correct: 12.6.1→cu124, 12.8.1→PyPI (cu128), 12.9.1→PyPI (cu129), 13.0.1→cu130
353-361: Update the Triton source build scope and update the outdated comment reference.The original concern about PR#8536 status is now answered: the PR was merged Oct 26, 2025 but reverted Oct 28, 2025 and is not present in Triton 3.5.0. This means the source build workaround is still necessary and the comment's assumption about reverting after a future release is outdated.
However, a scope inconsistency remains: while the Dockerfile handles broader CUDA 13.x variants (lines 272, 284, 343 all use
CUDA_VERSION%%.*= "13"), the Triton source build is narrowly scoped to exactlyCUDA_VERSION = "13.0.1". The comment says "For cuda 13" but the condition is only for 13.0.1.Clarify whether:
- The 13.0.1-specific scope is intentional (sm103 issue only affects that version), or
- The workaround should apply to all CUDA 13.x variants in the build matrix.
Update the comment to remove the outdated reference to waiting for a future Triton release.
332-346: CUDA 12.8.1 incorrectly includes Blackwell CC 10.3 architecture which is not supported until CUDA 12.9.The verification reveals a critical bug:
- CUDA 12.8 adds Blackwell (SM100) support for compute capability 10.0, but compute-capability 10.3 is fully supported starting with CUDA 13.x.
- Line 337 incorrectly groups CUDA 12.8.1 with 12.9.1/13.0.1 using the same
TORCH_CUDA_ARCH_LIST='9.0;10.0;10.3'.- GB200 devices (CC 10.3) will fail compilation or produce incompatible binaries when built with CUDA 12.8.1.
Fix required: Line 337 must be split into two cases:
- CUDA 12.8.1: use
CHOSEN_TORCH_CUDA_ARCH_LIST='9.0;10.0'(without 10.3)- CUDA 12.9.1/13.0.1: keep
CHOSEN_TORCH_CUDA_ARCH_LIST='9.0;10.0;10.3'The setup.py sed patch (lines 343-345) is functionally correct and will execute as intended.
⛔ Skipped due to learnings
Learnt from: ptarasiewiczNV Repo: ai-dynamo/dynamo PR: 2027 File: container/deps/vllm/install_vllm.sh:0-0 Timestamp: 2025-07-22T10:22:28.972Z Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
| check \ | ||
| libsubunit0 \ | ||
| libsubunit-dev \ | ||
| # Development tools and utilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Majority of these packages are installed in the dev stage of the dockerfile:
https://github.com/ai-dynamo/dynamo/blob/main/container/Dockerfile.sglang#L432-L444
Do you know if we need any of these dev dependencies?
| bear \ | ||
| less \ | ||
| gnupg \ | ||
| nvidia-dkms-550 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed, gdrcopy uses the host dkms driver installed.
| # Install SGLang | ||
| # Until torch 2.9 and cu13 are stable we manually update torch if you are on CUDA 13 | ||
| RUN if [ "$USE_LATEST_SGLANG" = "1" ]; then \ | ||
| git clone --depth=1 https://${GITHUB_ARTIFACTORY}/sgl-project/sglang.git /sgl-workspace/sglang; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? I see us only cloning the release branch, not directly using latest sglang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ability would be helpful to have for dev purposes
| LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/nvidia/lib:/usr/local/nvidia/lib64" | ||
|
|
||
| # Replace Ubuntu sources if specified | ||
| RUN if [ -n "$UBUNTU_MIRROR" ]; then \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a private ubuntu index used by the sglang team
| RUN --mount=type=cache,target=/root/.cache/pip python3 -m pip install --upgrade pip setuptools wheel html5lib six \ | ||
| && cd sglang \ | ||
| && case "$CUDA_VERSION" in \ | ||
| 12.6.1) CUINDEX=126 ;; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to support CUDA 12.6.1?
|
Also, the gating sglang check is failing, We want to make sure this runs and completes before we merge. I can also check what the container size looks like after we get it to pass. |
| py-spy | ||
|
|
||
| # Install nsight-systems | ||
| RUN echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu2004/$(if [ "$(uname -m)" = "aarch64" ]; then echo "arm64"; else echo "amd64"; fi) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move this down as well in the dev stage
0.5.4.post3which contains gb200 fp8 stability. H100 needs to be tested and commands might need to changeTODO - work with @nv-tusharma to reconcile. My goal is to
Summary by CodeRabbit
New Features
Chores