feat: add experimental support for checkpoint vLLM Dynamo Workers #4189

michaelshin · 2025-11-07T21:41:49Z

Overview:

This feature enables fast startup of vLLM inference engines by checkpointing GPU-initialized processes and restoring them on-demand. This dramatically reduces the time to initialize vLLM workers, especially beneficial in autoscaling scenarios.

Important Caveats

Intra-Pod Only: This feature currently only works for checkpoint/restore within the same pod. This is due to a known issue where CRIU cannot restore due to process ID collisions. This is actively being worked on and is crucial to fix the cold start problem.
Requires upstream vLLM changes: CRIU requires small changes to vLLM to enable this feature and we require a vLLM patch before merging this
Requires a lot of permissions: The pod must have root, host PIDs exposed, and securityContext
Requires 580 drivers

Details:

The checkpoint/restore functionality uses

cuda-checkpoint: Checkpoints GPU memory and CUDA state
CRIU (Checkpoint/Restore In Userspace): Captures the complete process state including memory, file descriptors, and process tree

The implementation uses CheckpointableAsyncLLM, a wrapper around vLLM's AsyncLLM that:

Runs the vLLM engine in a subprocess
Communicates via ZMQ (REQ-REP pattern) for client-server communication
Supports CRIU checkpoint/restore operations
Handles GPU memory checkpointing via NVIDIA's cuda-checkpoint API

Three Initialization Paths

1. Checkpointing Disabled (Standard Path)

Creates a standard AsyncLLM instance directly
No subprocess, no ZMQ overhead
Normal vLLM startup time

2. Checkpoint Exists (Fast Restore Path)

Detects existing checkpoint directory (with matching config hash)
Creates CheckpointableAsyncLLM with auto_start=False
Runs CRIU and cuda checkpoint plugin to restore to resurrect the checkpointed process
Reconnects ZMQ socket to restored subprocess
Result: Engine ready in seconds instead of minutes

3. Checkpoint Doesn't Exist (Initial Checkpoint Creation)

Creates CheckpointableAsyncLLM with auto_start=True
Starts vLLM engine in subprocess and waits for readiness
Dumps process state with CRIU and cuda checkpoint plugin
Atomically moves checkpoint to final directory (handles race conditions)
Restores from the checkpoint (whether created by this worker or another)
Result: Checkpoint created for future use, engine ready after restore

Where should the reviewer start?

CheckpointableAsyncLLM

Found in components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py
Wraps AsyncLLM in subprocess with ZMQ server, exposes same async API
Manages CRIU checkpoint/restore lifecycle and PTY/TTY handling

vLLM Engine Management

Found in components/src/dynamo/vllm/engine.py
Added a new function to choose between AsyncLLM and CheckpointableAsyncLLM

Dockerfile.vllm-checkpoint

A new Docker image to add CRIU, cuda-checkpoint and additional dependencies while this is an experimental feature

Config Hash-Based Isolation

Uses VllmConfig.compute_hash()[:8] for subdirectories: checkpoint_dir/{hash}/
Prevents mismatches when using same checkpoint_dir with different configs

GPU Memory Checkpointing (`checkpoint/utils.py`)

NVIDIA cuda-checkpoint API: Lock → Checkpoint → Restore/Unlock
Logs GPU memory usage for debugging

ZMQ Communication

REQ-REP over TCP, cloudpickle serialization
Subprocess runs ZMQAsyncLLMServer, parent sends requests

Metadata (`checkpoint/metadata.py`)

Stores vllm_checkpoint_metadata.json with TTY info, PIDs, ZMQ port
Used for TTY remapping during restore

Race Condition Handling

Atomic os.rename() ensures only one checkpoint survives
First worker wins, others clean up and restore from the shared checkpoint

Summary by CodeRabbit

New Features
- Added model checkpoint and restore capabilities for LLM inference engines, controllable via ENABLE_CHECKPOINTING and CHECKPOINT_DIR environment variables
- Supports CUDA-aware checkpointing with automatic metadata persistence and recovery
Chores
- Added container image for deploying checkpoint-enabled inference environments

Signed-off-by: Michael Shin <[email protected]>

copy-pr-bot · 2025-11-07T21:41:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-11-07T21:51:55Z

Walkthrough

This pull request introduces a complete checkpoint/restore system for vLLM engines using CRIU. It adds configuration flags, a subprocess-based AsyncLLM wrapper communicating via ZMQ RPC, GPU process utilities, metadata persistence, and integrates checkpointing into the engine factory with optional resume-on-startup behavior for distributed workers.

Changes

Cohort / File(s)	Summary
Configuration `components/src/dynamo/vllm/args.py`	Adds `ENABLE_CHECKPOINTING` boolean flag and `CHECKPOINT_DIR` path variable, both sourced from environment variables during module initialization.
Checkpoint Package Exports `components/src/dynamo/vllm/checkpoint/__init__.py`	Defines package-level exports for `CheckpointableAsyncLLM` and `CheckpointMetadata` via `__all__`.
Metadata Storage `components/src/dynamo/vllm/checkpoint/metadata.py`	Introduces `CheckpointMetadata` class for serializing/deserializing checkpoint state (TTY info, PIDs, ZMQ port) to/from JSON files.
Checkpointable Engine `components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py`	Implements `CheckpointableAsyncLLM` wrapper that runs `AsyncLLM` in an isolated subprocess with ZMQ RPC communication, CRIU checkpoint/restore lifecycle methods, PTY log forwarding, and generator support.
ZMQ RPC Server `components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py`	Defines `ZMQAsyncLLMServer` and related RPC message types (RPCMessageType, RPCResponse, PropertyRequest, PropertyResponse) for request-reply communication with subprocess-based engine.
Checkpoint Utilities `components/src/dynamo/vllm/checkpoint/utils.py`	Provides GPU process discovery, NVIDIA file descriptor tracking, CUDA checkpointing integration, TTY information retrieval, and error handling utilities.
Integration & Testing `components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py`	End-to-end test module exercising full checkpoint/restore cycles with generation validation before and after restoration.
Engine Factory `components/src/dynamo/vllm/engine.py`	Adds `create_vllm_engine()` factory returning either standard `AsyncLLM` or `CheckpointableAsyncLLM` based on `enable_checkpointing` flag; includes `initialize_engine()` for post-creation setup, checkpoint discovery, and worker coordination.
Main Integration `components/src/dynamo/vllm/main.py`	Replaces direct `AsyncLLM` creation with `create_vllm_engine()` call; integrates `initialize_engine()` for syncing configs across subprocess boundaries and applying recovered state.
Container Setup `container/Dockerfile.vllm-checkpoint`	Multi-stage Dockerfile for building vLLM with CUDA, DeepGEMM, FlashInfer, CRIU support, and runtime environment with GPU utilities and checkpoint tooling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Dense subprocess/RPC orchestration: checkpointable_async_llm.py combines ZMQ communication, subprocess lifecycle, CRIU coordination, and generator state management requiring careful verification of synchronization logic and error paths.
GPU/CUDA integration: utils.py and zmq_async_llm_server.py require understanding of process file descriptors, CUDA contexts, and checkpoint sequencing.
Integration surface: Changes span configuration, engine factory, and main initialization paths, requiring trace-through of config propagation and worker synchronization (especially atomic checkpoint placement under contention).
Heterogeneous concerns: Multiple subsystems (subprocesses, ZMQ sockets, PTY forwarding, CRIU operations) interact in complex ways; checkpoint/resume flows need validation for state consistency.
Container complexity: Multi-stage build with numerous build arguments and dependency chain.

Areas requiring extra attention:

Socket cleanup and subprocess teardown edge cases in CheckpointableAsyncLLM.__del__() and shutdown()
RPC timeout and error propagation paths, especially during engine initialization
Atomic checkpoint directory race condition handling in initialize_engine()
CRIU metadata persistence and recovery correctness
Generator state management across subprocess boundary

Poem

🐇 Hop, hop, checkpoint and restore—
Subprocesses and sockets, a rabbit's folklore,
CRIU guards state in forests of PIDs,
ZMQ whispers RPC to what hides,
Now engines awake where checkpoints reside. 🎉

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.44% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: add experimental support for checkpoint vLLM Dynamo Workers' clearly and concisely summarizes the main change: adding experimental checkpoint support for vLLM workers.
Description check	✅ Passed	The PR description comprehensively covers overview, technical details, initialization paths, file locations, and important caveats; it aligns well with the template structure despite not strictly following every section heading.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/src/dynamo/vllm/main.py (1)
655-673: Initialize checkpointable engine for multimodal workers

After introducing checkpointing, create_vllm_engine() can return a CheckpointableAsyncLLM with auto_start=False (when a checkpoint exists). Unlike the prefill/decode paths, this branch never calls initialize_engine, so we neither restore nor wait for readiness and the subprocess never starts. Please invoke initialize_engine(...) here and update vllm_config if it returns a synced config, mirroring the other worker initializers.

Apply this diff:
     engine_client, vllm_config, default_sampling_params = setup_vllm_engine(config)
 
+    synced_config = await initialize_engine(
+        engine_client,
+        ENABLE_CHECKPOINTING,
+        CHECKPOINT_DIR,
+    )
+    if synced_config is not None:
+        vllm_config = synced_config
+
     # For aggregated mode, no downstream client is needed
     # TODO: Implement disaggregated mode with proper decode worker client
     downstream_client = None

🧹 Nitpick comments (1)

components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (1)

750-775: Path validation for checkpoint_dir is a good practice but lower priority given current mitigations.

The code passes checkpoint_dir directly to the CRIU command without explicit path validation (no absolute path check or symlink resolution). While this is a valid concern for defense-in-depth, the risk is mitigated by:

subprocess.run(cmd) without shell=True prevents shell injection attacks

No sudo or privilege escalation found in the code (command runs with process permissions only)

Test usage shows safe temporary paths via tempfile.mktemp()

Consider adding optional validation: verify checkpoint_dir is an absolute path and resolve any symlinks before passing to CRIU, but this is not a critical blocker given the existing protections.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f8fd86 and daf302f.

📒 Files selected for processing (10)

components/src/dynamo/vllm/args.py (1 hunks)
components/src/dynamo/vllm/checkpoint/__init__.py (1 hunks)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (1 hunks)
components/src/dynamo/vllm/checkpoint/metadata.py (1 hunks)
components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py (1 hunks)
components/src/dynamo/vllm/checkpoint/utils.py (1 hunks)
components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py (1 hunks)
components/src/dynamo/vllm/engine.py (1 hunks)
components/src/dynamo/vllm/main.py (5 hunks)
container/Dockerfile.vllm-checkpoint (1 hunks)

🧰 Additional context used

🧠 Learnings (6)

📚 Learning: 2025-08-30T20:43:49.632Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.

Applied to files:

container/Dockerfile.vllm-checkpoint

📚 Learning: 2025-08-30T20:43:10.091Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project, devcontainer.json files use templated container names (like "dynamo-vllm-devcontainer") that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

container/Dockerfile.vllm-checkpoint

📚 Learning: 2025-09-16T17:16:07.820Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.vllm.j2:221-233
Timestamp: 2025-09-16T17:16:07.820Z
Learning: In the dynamo project, when converting Dockerfiles to Jinja2 templates, the primary goal is backward compatibility - generated Dockerfiles must be identical to the originals. Security improvements or other enhancements that would change the generated output are out of scope for templating PRs and should be addressed separately.

Applied to files:

container/Dockerfile.vllm-checkpoint

📚 Learning: 2025-08-30T20:43:10.091Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project's devcontainer setup, hard-coded container names in devcontainer.json files serve as templates that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

container/Dockerfile.vllm-checkpoint

📚 Learning: 2025-09-03T01:10:12.599Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.

Applied to files:

container/Dockerfile.vllm-checkpoint

📚 Learning: 2025-06-05T01:10:51.865Z

Learnt from: tanmayv25
Repo: ai-dynamo/dynamo PR: 1391
File: examples/tensorrt_llm/common/base_engine.py:171-176
Timestamp: 2025-06-05T01:10:51.865Z
Learning: In examples/tensorrt_llm/common/base_engine.py, the _init_engine method is called only once during initialization, so direct mutation of the _default_sampling_params object during setup is safe and appropriate.

Applied to files:

components/src/dynamo/vllm/main.py

🧬 Code graph analysis (6)

components/src/dynamo/vllm/engine.py (1)

components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (6)

CheckpointableAsyncLLM (143-1166)

from_vllm_config (1082-1122)

get_vllm_config (597-598)

criu_resume (804-960)

wait_until_ready (327-384)

criu_checkpoint (693-802)

components/src/dynamo/vllm/main.py (1)

components/src/dynamo/vllm/engine.py (2)

create_vllm_engine (27-106)

initialize_engine (130-228)

components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py (2)

components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (7)

CheckpointableAsyncLLM (143-1166)

from_engine_args (1125-1166)

criu_resume (804-960)

generate (548-568)

shutdown (962-1030)

wait_until_ready (327-384)

criu_checkpoint (693-802)

components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py (1)

run (167-301)

components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py (1)

components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (5)

get_model_config (600-601)

errored (681-684)

sleep (631-632)

encode (573-595)

shutdown (962-1030)

components/src/dynamo/vllm/checkpoint/__init__.py (2)

components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (1)

CheckpointableAsyncLLM (143-1166)

components/src/dynamo/vllm/checkpoint/metadata.py (1)

CheckpointMetadata (13-74)

components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (3)

components/src/dynamo/vllm/checkpoint/utils.py (4)

find_gpu_worker_pids (326-345)

collect_process_tree_pids (46-75)

verify_processes_exited (97-136)

get_tty_info (139-160)

components/src/dynamo/vllm/checkpoint/metadata.py (4)

CheckpointMetadata (13-74)

tty_external (70-74)

save (44-52)

load (55-67)

components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py (6)

RPCMessageType (27-32)

RPCResponse (34-39)

PropertyRequest (41-43)

PropertyResponse (45-48)

run_async_llm_server (304-367)

run (167-301)

🪛 ast-grep (0.39.7)

components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py

[warning] 120-121: The function mktemp is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile() instead and pass it the delete=False parameter.
Context: tempfile.mktemp(
prefix="vllm_test_checkpoint_", dir="/tmp")
Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
https://owasp.org/Top10/A01_2021-Broken_Access_Control

(avoid-mktemp-python)

🪛 GitHub Actions: Copyright Checks

components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py

components/src/dynamo/vllm/checkpoint/utils.py

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/utils.py

components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py

components/src/dynamo/vllm/checkpoint/metadata.py

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/metadata.py

components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4189/merge) by michaelshin.

container/Dockerfile.vllm-checkpoint

[error] 206-206: Docker build failed during CRIU installation step. Process completed with exit code 1.

🪛 Ruff (0.14.3)

components/src/dynamo/vllm/engine.py

127-127: Avoid specifying long messages outside the exception class

(TRY003)

166-166: Avoid specifying long messages outside the exception class

(TRY003)

components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py

1-1: Shebang is present but file is not executable

(EXE001)

52-52: Avoid specifying long messages outside the exception class

(TRY003)

99-99: Abstract raise to an inner function

(TRY301)

99-99: Avoid specifying long messages outside the exception class

(TRY003)

112-113: Remove exception handler; error is immediately re-raised

(TRY203)

121-122: Use of insecure and deprecated function (mktemp)

(S306)

163-163: Abstract raise to an inner function

(TRY301)

163-163: Avoid specifying long messages outside the exception class

(TRY003)

224-224: Abstract raise to an inner function

(TRY301)

224-224: Avoid specifying long messages outside the exception class

(TRY003)

240-241: Remove exception handler; error is immediately re-raised

(TRY203)

components/src/dynamo/vllm/checkpoint/utils.py

15-15: Do not catch blind exception: Exception

(BLE001)

29-29: Do not catch blind exception: Exception

(BLE001)

42-42: Do not catch blind exception: Exception

(BLE001)

65-66: try-except-continue detected, consider logging the exception

(S112)

65-65: Do not catch blind exception: Exception

(BLE001)

91-91: Consider moving this statement to an else block

(TRY300)

92-92: Do not catch blind exception: Exception

(BLE001)

131-131: Do not catch blind exception: Exception

(BLE001)

132-132: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

157-157: Consider moving this statement to an else block

(TRY300)

158-158: Do not catch blind exception: Exception

(BLE001)

179-179: Do not catch blind exception: Exception

(BLE001)

208-209: try-except-continue detected, consider logging the exception

(S112)

208-208: Do not catch blind exception: Exception

(BLE001)

210-211: try-except-pass detected, consider logging the exception

(S110)

210-210: Do not catch blind exception: Exception

(BLE001)

249-249: Do not catch blind exception: Exception

(BLE001)

258-258: Avoid specifying long messages outside the exception class

(TRY003)

276-279: Avoid specifying long messages outside the exception class

(TRY003)

289-292: Avoid specifying long messages outside the exception class

(TRY003)

319-320: try-except-continue detected, consider logging the exception

(S112)

319-319: Do not catch blind exception: Exception

(BLE001)

321-322: try-except-pass detected, consider logging the exception

(S110)

321-321: Do not catch blind exception: Exception

(BLE001)

381-381: Do not catch blind exception: Exception

(BLE001)

445-447: Avoid specifying long messages outside the exception class

(TRY003)

456-456: Do not catch blind exception: Exception

(BLE001)

components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py

91-91: Consider moving this statement to an else block

(TRY300)

97-99: Avoid specifying long messages outside the exception class

(TRY003)

107-109: Avoid specifying long messages outside the exception class

(TRY003)

118-118: Avoid specifying long messages outside the exception class

(TRY003)

149-150: Avoid specifying long messages outside the exception class

(TRY003)

155-155: Consider moving this statement to an else block

(TRY300)

163-163: Avoid specifying long messages outside the exception class

(TRY003)

252-252: Do not catch blind exception: Exception

(BLE001)

255-255: Use explicit conversion flag

Replace with conversion flag

(RUF010)

270-270: Do not catch blind exception: Exception

(BLE001)

273-273: Use explicit conversion flag

Replace with conversion flag

(RUF010)

330-330: Do not catch blind exception: Exception

(BLE001)

364-364: Do not catch blind exception: Exception

(BLE001)

components/src/dynamo/vllm/checkpoint/__init__.py

9-9: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

components/src/dynamo/vllm/checkpoint/metadata.py

51-51: Do not catch blind exception: Exception

(BLE001)

65-65: Do not catch blind exception: Exception

(BLE001)

components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py

75-75: Do not catch blind exception: Exception

(BLE001)

78-78: Use explicit conversion flag

Replace with conversion flag

(RUF010)

125-125: Abstract raise to an inner function

(TRY301)

125-125: Avoid specifying long messages outside the exception class

(TRY003)

134-134: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

134-134: Avoid specifying long messages outside the exception class

(TRY003)

140-140: Use raise without specifying exception name

Remove exception name

(TRY201)

178-178: Unused method argument: stat_loggers

(ARG002)

259-259: Do not catch blind exception: Exception

(BLE001)

278-278: Avoid specifying long messages outside the exception class

(TRY003)

300-300: Avoid specifying long messages outside the exception class

(TRY003)

305-307: Avoid specifying long messages outside the exception class

(TRY003)

308-310: Avoid specifying long messages outside the exception class

(TRY003)

347-348: Avoid specifying long messages outside the exception class

(TRY003)

367-370: Avoid specifying long messages outside the exception class

(TRY003)

381-384: Avoid specifying long messages outside the exception class

(TRY003)

389-391: Avoid specifying long messages outside the exception class

(TRY003)

393-394: Avoid specifying long messages outside the exception class

(TRY003)

413-413: Avoid specifying long messages outside the exception class

(TRY003)

416-416: Avoid specifying long messages outside the exception class

(TRY003)

420-420: Avoid specifying long messages outside the exception class

(TRY003)

428-430: Avoid specifying long messages outside the exception class

(TRY003)

432-433: Avoid specifying long messages outside the exception class

(TRY003)

452-452: Avoid specifying long messages outside the exception class

(TRY003)

455-455: Avoid specifying long messages outside the exception class

(TRY003)

459-459: Avoid specifying long messages outside the exception class

(TRY003)

475-475: Avoid specifying long messages outside the exception class

(TRY003)

478-478: Avoid specifying long messages outside the exception class

(TRY003)

482-482: Avoid specifying long messages outside the exception class

(TRY003)

492-494: Avoid specifying long messages outside the exception class

(TRY003)

496-497: Avoid specifying long messages outside the exception class

(TRY003)

508-508: Avoid specifying long messages outside the exception class

(TRY003)

511-511: Avoid specifying long messages outside the exception class

(TRY003)

515-515: Avoid specifying long messages outside the exception class

(TRY003)

695-695: Unused method argument: cuda_checkpoint_path

(ARG002)

703-705: Avoid specifying long messages outside the exception class

(TRY003)

707-707: Avoid specifying long messages outside the exception class

(TRY003)

713-717: Avoid specifying long messages outside the exception class

(TRY003)

773-773: subprocess call: check for execution of untrusted input

(S603)

775-775: Avoid specifying long messages outside the exception class

(TRY003)

789-789: Avoid specifying long messages outside the exception class

(TRY003)

805-805: Unused method argument: cuda_checkpoint_path

(ARG002)

817-818: Avoid specifying long messages outside the exception class

(TRY003)

823-824: Avoid specifying long messages outside the exception class

(TRY003)

838-838: Starting a process with a shell, possible injection detected

(S605)

852-852: Do not catch blind exception: Exception

(BLE001)

900-900: Do not catch blind exception: Exception

(BLE001)

907-907: subprocess call: check for execution of untrusted input

(S603)

910-910: subprocess call: check for execution of untrusted input

(S603)

912-912: Avoid specifying long messages outside the exception class

(TRY003)

942-942: Avoid specifying long messages outside the exception class

(TRY003)

944-944: Avoid specifying long messages outside the exception class

(TRY003)

955-955: Do not catch blind exception: Exception

(BLE001)

1017-1017: Do not catch blind exception: Exception

(BLE001)

1018-1018: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1066-1066: Do not catch blind exception: Exception

(BLE001)

1098-1101: Avoid specifying long messages outside the exception class

(TRY003)

1105-1108: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (1)

components/src/dynamo/vllm/engine.py (1)

183-228: Well-designed atomic checkpoint creation with race handling.

The checkpoint creation logic correctly uses os.rename() for atomic operations and handles EEXIST race conditions when multiple workers compete to create the checkpoint. The fallback to restore from the winner's checkpoint is a clean solution for distributed worker initialization.

coderabbitai · 2025-11-07T21:51:59Z

components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py

+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


⚠️ Potential issue | 🔴 Critical

Pipeline failure: Fix copyright header format.

The copyright header format is not matching the expected pattern, causing the pipeline to fail.

The current header includes NVIDIA copyright, but the pipeline expects a different format. Check with the team or examine other accepted files in components/src/dynamo/vllm/ for the correct pattern that satisfies the copyright check.

🧰 Tools

🪛 GitHub Actions: Copyright Checks

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py

🤖 Prompt for AI Agents

components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py lines 1-2: the file header does not match the repository's expected SPDX/copyright header pattern and is causing CI to fail; replace the two header lines with the exact header used in other accepted files in components/src/dynamo/vllm/ (copy-paste the SPDX-License-Identifier and SPDX-FileCopyrightText lines from a matching file in that folder so formatting and wording match exactly), then run the repo's license/header check locally to confirm it passes.

coderabbitai · 2025-11-07T21:51:59Z

components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py

+        if metadata.tree_pid:
+            start = time.time()
+            # Wait up to a short grace period since dump should have killed it
+            while time.time() - start < 5.0:
+                if os.system(f"kill -0 {metadata.tree_pid} >/dev/null 2>&1") != 0:
+                    break
+                await asyncio.sleep(0.05)
+
+            # Verify root PID is not taken now (PID could be reused by others)
+            pid_path = f"/proc/{metadata.tree_pid}"
+            if os.path.exists(pid_path):
+                # Try to read the cmdline of the holder for diagnostics
+                holder = ""
+                try:
+                    cmd_path = os.path.join(pid_path, "cmdline")
+                    with open(cmd_path, "rb") as f:
+                        raw = f.read().replace(b"\x00", b" ")
+                        holder = raw.decode("utf-8", "ignore").strip()
+                except Exception:
+                    holder = ""
+                msg = (
+                    "CRIU restore pre-check failed: root PID "
+                    f"{metadata.tree_pid} is in use in current PID namespace. "
+                    "Restore will fail with EEXIST. Consider restoring in a "
+                    "new PID namespace or wait until the PID is free."
+                )
+                if holder:
+                    logger.error(
+                        "%s Holder cmdline: %s",
+                        msg,
+                        holder,
+                    )
+                else:
+                    logger.error("%s", msg)
+


⚠️ Potential issue | 🟠 Major

Security: Replace os.system() with subprocess API to prevent shell injection.

Using os.system() with formatted strings can be vulnerable to command injection (CWE-78) if the PID value were ever controlled by an attacker or derived from untrusted input.

Apply this diff to use the safer subprocess API:

- # Verify root PID is not taken now (PID could be reused by others) - pid_path = f"/proc/{metadata.tree_pid}" - if os.path.exists(pid_path): + # Verify root PID is not taken now (PID could be reused by others) + pid_path = f"/proc/{metadata.tree_pid}" + if os.path.exists(pid_path):

For the PID check at line 838, replace:

- while time.time() - start < 5.0: - if os.system(f"kill -0 {metadata.tree_pid} >/dev/null 2>&1") != 0: - break - await asyncio.sleep(0.05) + while time.time() - start < 5.0: + try: + os.kill(metadata.tree_pid, 0) # Signal 0 checks existence + # Process exists, keep waiting + await asyncio.sleep(0.05) + except (ProcessLookupError, PermissionError): + # Process doesn't exist or we can't check it + break

Using os.kill(pid, 0) is the standard Python way to check if a process exists without sending a real signal.

🧰 Tools

🪛 Ruff (0.14.3)

838-838: Starting a process with a shell, possible injection detected

(S605)

852-852: Do not catch blind exception: Exception

(BLE001)

coderabbitai · 2025-11-07T21:51:59Z

components/src/dynamo/vllm/checkpoint/metadata.py

+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project


⚠️ Potential issue | 🔴 Critical

Pipeline failure: Fix copyright header format.

The copyright header format is not matching the expected pattern, causing the pipeline to fail.

Similar to zmq_async_llm_server.py, verify the expected header format with the team or check other accepted files in the components/src/dynamo/vllm/ directory.

🧰 Tools

🪛 GitHub Actions: Copyright Checks

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/metadata.py

coderabbitai · 2025-11-07T21:51:59Z

components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py

@@ -0,0 +1,293 @@
+#!/usr/bin/env python3


⚠️ Potential issue | 🔴 Critical

Pipeline failure: Add required copyright header.

The copyright header is missing or invalid, causing the pipeline to fail.

Add the required header format at the top of the file. Based on other files in the PR, it should follow this pattern:

#!/usr/bin/env python3 +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project """ Test checkpoint/restore functionality of CheckpointableAsyncLLM.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#!/usr/bin/env python3

#!/usr/bin/env python3

# SPDX-License-Identifier: Apache-2.0

# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

🧰 Tools

🪛 GitHub Actions: Copyright Checks

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py

🪛 Ruff (0.14.3)

1-1: Shebang is present but file is not executable

(EXE001)

🤖 Prompt for AI Agents

In components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py at line 1, the required copyright/header is missing which causes the pipeline to fail; add the standard project header (matching other files in this PR) at the very top of the file, replacing or preceding the current shebang if needed, and ensure it includes the copyright owner, year, and license text exactly as used in other files so the linter/pipeline accepts it.

coderabbitai · 2025-11-07T21:52:00Z

components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py

+    # Generate a unique checkpoint directory path (but don't create it)
+    checkpoint_dir = tempfile.mktemp(
+        prefix="vllm_test_checkpoint_", dir="/tmp")
+    print(f"Will use checkpoint directory: {checkpoint_dir}")


⚠️ Potential issue | 🔴 Critical

Security: Replace deprecated tempfile.mktemp() with secure alternative.

tempfile.mktemp() is deprecated and vulnerable to race conditions (CWE-377) where an attacker could create a file at the returned path before your code does.

Apply this diff to use a secure alternative:

- checkpoint_dir = tempfile.mktemp( - prefix="vllm_test_checkpoint_", dir="/tmp") - print(f"Will use checkpoint directory: {checkpoint_dir}") + with tempfile.TemporaryDirectory( + prefix="vllm_test_checkpoint_", dir="/tmp") as checkpoint_dir: + print(f"Will use checkpoint directory: {checkpoint_dir}")

Note: You'll need to adjust the indentation of the code that follows to be inside the with block, or use tempfile.mkdtemp() if you need to preserve the directory after the test completes (which appears to be the intent based on line 248).

If you want to preserve the directory for debugging:

- checkpoint_dir = tempfile.mktemp( - prefix="vllm_test_checkpoint_", dir="/tmp") + checkpoint_dir = tempfile.mkdtemp( + prefix="vllm_test_checkpoint_", dir="/tmp")

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 ast-grep (0.39.7)

[warning] 120-121: The function mktemp is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile() instead and pass it the delete=False parameter.
Context: tempfile.mktemp(
prefix="vllm_test_checkpoint_", dir="/tmp")
Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
https://owasp.org/Top10/A01_2021-Broken_Access_Control

(avoid-mktemp-python)

🪛 Ruff (0.14.3)

121-122: Use of insecure and deprecated function (mktemp)

(S306)

🤖 Prompt for AI Agents

In components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py around lines 120-123, replace the deprecated and insecure tempfile.mktemp() call with a secure alternative: use tempfile.mkdtemp() if the test needs the directory to persist after the test (preferred here), or wrap code in a with tempfile.TemporaryDirectory() as checkpoint_dir: block and indent subsequent code if you want automatic cleanup; update any teardown/cleanup logic accordingly to remove the directory when no longer needed.

coderabbitai · 2025-11-07T21:52:00Z

components/src/dynamo/vllm/checkpoint/utils.py

+    pending: list[int] = [root_pid]
+    seen: set[int] = set[int]()
+    while pending:
+        pid = pending.pop()
+        if pid in seen:
+            continue
+        seen.add(pid)
+        children_path = f"/proc/{pid}/task/{pid}/children"
+        try:
+            with open(children_path, encoding="utf-8") as f:
+                content = f.read().strip()
+        except FileNotFoundError:
+            continue
+        except Exception:
+            continue
+        if not content:
+            continue
+        for token in content.split():
+            try:
+                child = int(token)
+            except ValueError:
+                continue
+            pending.append(child)
+    return seen


⚠️ Potential issue | 🔴 Critical

Fix runtime TypeError in process tree traversal

set[int]() raises TypeError: 'types.GenericAlias' object is not callable on every call (Python 3.9+), so this helper currently breaks before reading /proc. Use a plain set() and keep the type annotation separate.

Apply this diff:

- pending: list[int] = [root_pid] - seen: set[int] = set[int]() + pending: list[int] = [root_pid] + seen: set[int] = set()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pending: list[int] = [root_pid]

seen: set[int] = set[int]()

while pending:

pid = pending.pop()

if pid in seen:

continue

seen.add(pid)

children_path = f"/proc/{pid}/task/{pid}/children"

try:

with open(children_path, encoding="utf-8") as f:

content = f.read().strip()

except FileNotFoundError:

continue

except Exception:

continue

if not content:

continue

for token in content.split():

try:

child = int(token)

except ValueError:

continue

pending.append(child)

return seen

pending: list[int] = [root_pid]

seen: set[int] = set()

while pending:

pid = pending.pop()

if pid in seen:

continue

seen.add(pid)

children_path = f"/proc/{pid}/task/{pid}/children"

try:

with open(children_path, encoding="utf-8") as f:

content = f.read().strip()

except FileNotFoundError:

continue

except Exception:

continue

if not content:

continue

for token in content.split():

try:

child = int(token)

except ValueError:

continue

pending.append(child)

return seen

🧰 Tools

🪛 Ruff (0.14.3)

65-66: try-except-continue detected, consider logging the exception

(S112)

65-65: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In components/src/dynamo/vllm/checkpoint/utils.py around lines 52 to 75, the code uses set[int]() which raises TypeError because types.GenericAlias is not callable; change the initialization to use a plain set() while keeping the typing annotation (e.g., seen: set[int] = set()) so the runtime constructs a set correctly and the static type remains accurate; no other logic changes required.

coderabbitai · 2025-11-07T21:52:00Z

components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py

+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project


⚠️ Potential issue | 🔴 Critical

Pipeline failure: Fix copyright header format.

The copyright header format is not matching the expected pattern, causing the pipeline to fail.

The existing headers appear correct for the vLLM project. This might be a mismatch with the Dynamo project requirements. Verify the expected header format with the team or check other accepted files in components/src/dynamo/vllm/ for the correct pattern.

🧰 Tools

🪛 GitHub Actions: Copyright Checks

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py

🤖 Prompt for AI Agents

components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py lines 1-2: the file header SPDX/license lines do not match the Dynamo project expected header pattern and cause CI failures; update the top-of-file header to the repository's canonical header (match other accepted files under components/src/dynamo/vllm/)—replace the current two SPDX lines with the exact header format used across the Dynamo codebase (including the correct SPDX-License-Identifier line and the correct SPDX-FileCopyrightText content/owner string), commit the change, and run the pipeline to confirm the header now passes validation.

coderabbitai · 2025-11-07T21:52:00Z

container/Dockerfile.vllm-checkpoint

+RUN add-apt-repository ppa:criu/ppa && \
+    apt-get update && \
+    apt-get install -y --no-install-recommends criu && \
+    rm -rf /var/lib/apt/lists/*


⚠️ Potential issue | 🔴 Critical

Make the CRIU PPA step non-interactive

add-apt-repository prompts for confirmation unless -y/--yes is passed, so the Docker build hangs and then fails (see the pre-merge CI failure). Please add the non-interactive flag so the stage can run unattended.

Apply this diff:

-RUN add-apt-repository ppa:criu/ppa && \ +RUN add-apt-repository -y ppa:criu/ppa && \ apt-get update && \ apt-get install -y --no-install-recommends criu && \ rm -rf /var/lib/apt/lists/*

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

RUN add-apt-repository ppa:criu/ppa && \

apt-get update && \

apt-get install -y --no-install-recommends criu && \

rm -rf /var/lib/apt/lists/*

RUN add-apt-repository -y ppa:criu/ppa && \

apt-get update && \

apt-get install -y --no-install-recommends criu && \

rm -rf /var/lib/apt/lists/*

🤖 Prompt for AI Agents

In container/Dockerfile.vllm-checkpoint around lines 211 to 214, the add-apt-repository call is interactive and blocks the Docker build; update the command to run non-interactively by adding the -y (or --yes) flag to add-apt-repository (e.g., add-apt-repository -y ppa:criu/ppa) so the step proceeds unattended, leaving the apt-get update, apt-get install -y --no-install-recommends criu, and cleanup of /var/lib/apt/lists/* unchanged.

experimental: add support for checkpoint vLLM Dynamo Workers

daf302f

Signed-off-by: Michael Shin <[email protected]>

michaelshin requested review from a team as code owners November 7, 2025 21:41

pull-request-size bot added the size/XXL label Nov 7, 2025

michaelshin changed the title ~~experimental: add support for checkpoint vLLM Dynamo Workers~~ feat: add experimental support for checkpoint vLLM Dynamo Workers Nov 7, 2025

github-actions bot added the feat label Nov 7, 2025

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

fix copyright

8eecd6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add experimental support for checkpoint vLLM Dynamo Workers #4189

feat: add experimental support for checkpoint vLLM Dynamo Workers #4189

Uh oh!

michaelshin commented Nov 7, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 7, 2025

Uh oh!

coderabbitai bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# SPDX-License-Identifier: Apache-2.0
		# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		# SPDX-License-Identifier: Apache-2.0
		# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

feat: add experimental support for checkpoint vLLM Dynamo Workers #4189

Are you sure you want to change the base?

feat: add experimental support for checkpoint vLLM Dynamo Workers #4189

Uh oh!

Conversation

michaelshin commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Important Caveats

Details:

Three Initialization Paths

1. Checkpointing Disabled (Standard Path)

2. Checkpoint Exists (Fast Restore Path)

3. Checkpoint Doesn't Exist (Initial Checkpoint Creation)

Where should the reviewer start?

CheckpointableAsyncLLM

vLLM Engine Management

Dockerfile.vllm-checkpoint

Config Hash-Based Isolation

GPU Memory Checkpointing (checkpoint/utils.py)

ZMQ Communication

Metadata (checkpoint/metadata.py)

Race Condition Handling

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Nov 7, 2025

Uh oh!

coderabbitai bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelshin commented Nov 7, 2025 •

edited

Loading

GPU Memory Checkpointing (`checkpoint/utils.py`)

Metadata (`checkpoint/metadata.py`)

coderabbitai bot commented Nov 7, 2025 •

edited

Loading