Skip to content

Conversation

@michaelshin
Copy link
Contributor

@michaelshin michaelshin commented Nov 7, 2025

Overview:

Co-written by @galletas1712

This feature enables fast startup of vLLM inference engines by checkpointing GPU-initialized processes and restoring them on-demand. This dramatically reduces the time to initialize vLLM workers, especially beneficial in autoscaling scenarios.

Important Caveats
  • Intra-Pod Only: This feature currently only works for checkpoint/restore within the same pod. This is due to a known issue where CRIU cannot restore due to process ID collisions. This is actively being worked on and is crucial to fix the cold start problem.
  • Requires upstream vLLM changes: CRIU requires small changes to vLLM to enable this feature and we require a vLLM patch before merging this
  • Requires a lot of permissions: The pod must have root, host PIDs exposed, and securityContext
  • Requires 580 drivers

Details:

The checkpoint/restore functionality uses

  1. cuda-checkpoint: Checkpoints GPU memory and CUDA state
  2. CRIU (Checkpoint/Restore In Userspace): Captures the complete process state including memory, file descriptors, and process tree

The implementation uses CheckpointableAsyncLLM, a wrapper around vLLM's AsyncLLM that:

  • Runs the vLLM engine in a subprocess
  • Communicates via ZMQ (REQ-REP pattern) for client-server communication
  • Supports CRIU checkpoint/restore operations
  • Handles GPU memory checkpointing via NVIDIA's cuda-checkpoint API

Mermaid Chart - Create complex, visual diagrams with text -2025-11-07-211945

Three Initialization Paths

1. Checkpointing Disabled (Standard Path)
  • Creates a standard AsyncLLM instance directly
  • No subprocess, no ZMQ overhead
  • Normal vLLM startup time
2. Checkpoint Exists (Fast Restore Path)
  • Detects existing checkpoint directory (with matching config hash)
  • Creates CheckpointableAsyncLLM with auto_start=False
  • Runs CRIU and cuda checkpoint plugin to restore to resurrect the checkpointed process
  • Reconnects ZMQ socket to restored subprocess
  • Result: Engine ready in seconds instead of minutes
3. Checkpoint Doesn't Exist (Initial Checkpoint Creation)
  • Creates CheckpointableAsyncLLM with auto_start=True
  • Starts vLLM engine in subprocess and waits for readiness
  • Dumps process state with CRIU and cuda checkpoint plugin
  • Atomically moves checkpoint to final directory (handles race conditions)
  • Restores from the checkpoint (whether created by this worker or another)
  • Result: Checkpoint created for future use, engine ready after restore

Where should the reviewer start?

CheckpointableAsyncLLM
  • Found in components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py
  • Wraps AsyncLLM in subprocess with ZMQ server, exposes same async API
  • Manages CRIU checkpoint/restore lifecycle and PTY/TTY handling
vLLM Engine Management
  • Found in components/src/dynamo/vllm/engine.py
  • Added a new function to choose between AsyncLLM and CheckpointableAsyncLLM
Dockerfile.vllm-checkpoint
  • A new Docker image to add CRIU, cuda-checkpoint and additional dependencies while this is an experimental feature
Config Hash-Based Isolation
  • Uses VllmConfig.compute_hash()[:8] for subdirectories: checkpoint_dir/{hash}/
  • Prevents mismatches when using same checkpoint_dir with different configs
GPU Memory Checkpointing (checkpoint/utils.py)
  • NVIDIA cuda-checkpoint API: Lock → Checkpoint → Restore/Unlock
  • Logs GPU memory usage for debugging
ZMQ Communication
  • REQ-REP over TCP, cloudpickle serialization
  • Subprocess runs ZMQAsyncLLMServer, parent sends requests
Metadata (checkpoint/metadata.py)
  • Stores vllm_checkpoint_metadata.json with TTY info, PIDs, ZMQ port
  • Used for TTY remapping during restore
Race Condition Handling
  • Atomic os.rename() ensures only one checkpoint survives
  • First worker wins, others clean up and restore from the shared checkpoint

Summary by CodeRabbit

  • New Features

    • Added model checkpoint and restore capabilities for LLM inference engines, controllable via ENABLE_CHECKPOINTING and CHECKPOINT_DIR environment variables
    • Supports CUDA-aware checkpointing with automatic metadata persistence and recovery
  • Chores

    • Added container image for deploying checkpoint-enabled inference environments

@michaelshin michaelshin requested review from a team as code owners November 7, 2025 21:41
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 7, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@michaelshin michaelshin changed the title experimental: add support for checkpoint vLLM Dynamo Workers feat: add experimental support for checkpoint vLLM Dynamo Workers Nov 7, 2025
@github-actions github-actions bot added the feat label Nov 7, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

Walkthrough

This pull request introduces a complete checkpoint/restore system for vLLM engines using CRIU. It adds configuration flags, a subprocess-based AsyncLLM wrapper communicating via ZMQ RPC, GPU process utilities, metadata persistence, and integrates checkpointing into the engine factory with optional resume-on-startup behavior for distributed workers.

Changes

Cohort / File(s) Summary
Configuration
components/src/dynamo/vllm/args.py
Adds ENABLE_CHECKPOINTING boolean flag and CHECKPOINT_DIR path variable, both sourced from environment variables during module initialization.
Checkpoint Package Exports
components/src/dynamo/vllm/checkpoint/__init__.py
Defines package-level exports for CheckpointableAsyncLLM and CheckpointMetadata via __all__.
Metadata Storage
components/src/dynamo/vllm/checkpoint/metadata.py
Introduces CheckpointMetadata class for serializing/deserializing checkpoint state (TTY info, PIDs, ZMQ port) to/from JSON files.
Checkpointable Engine
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py
Implements CheckpointableAsyncLLM wrapper that runs AsyncLLM in an isolated subprocess with ZMQ RPC communication, CRIU checkpoint/restore lifecycle methods, PTY log forwarding, and generator support.
ZMQ RPC Server
components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py
Defines ZMQAsyncLLMServer and related RPC message types (RPCMessageType, RPCResponse, PropertyRequest, PropertyResponse) for request-reply communication with subprocess-based engine.
Checkpoint Utilities
components/src/dynamo/vllm/checkpoint/utils.py
Provides GPU process discovery, NVIDIA file descriptor tracking, CUDA checkpointing integration, TTY information retrieval, and error handling utilities.
Integration & Testing
components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py
End-to-end test module exercising full checkpoint/restore cycles with generation validation before and after restoration.
Engine Factory
components/src/dynamo/vllm/engine.py
Adds create_vllm_engine() factory returning either standard AsyncLLM or CheckpointableAsyncLLM based on enable_checkpointing flag; includes initialize_engine() for post-creation setup, checkpoint discovery, and worker coordination.
Main Integration
components/src/dynamo/vllm/main.py
Replaces direct AsyncLLM creation with create_vllm_engine() call; integrates initialize_engine() for syncing configs across subprocess boundaries and applying recovered state.
Container Setup
container/Dockerfile.vllm-checkpoint
Multi-stage Dockerfile for building vLLM with CUDA, DeepGEMM, FlashInfer, CRIU support, and runtime environment with GPU utilities and checkpoint tooling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

  • Dense subprocess/RPC orchestration: checkpointable_async_llm.py combines ZMQ communication, subprocess lifecycle, CRIU coordination, and generator state management requiring careful verification of synchronization logic and error paths.
  • GPU/CUDA integration: utils.py and zmq_async_llm_server.py require understanding of process file descriptors, CUDA contexts, and checkpoint sequencing.
  • Integration surface: Changes span configuration, engine factory, and main initialization paths, requiring trace-through of config propagation and worker synchronization (especially atomic checkpoint placement under contention).
  • Heterogeneous concerns: Multiple subsystems (subprocesses, ZMQ sockets, PTY forwarding, CRIU operations) interact in complex ways; checkpoint/resume flows need validation for state consistency.
  • Container complexity: Multi-stage build with numerous build arguments and dependency chain.

Areas requiring extra attention:

  • Socket cleanup and subprocess teardown edge cases in CheckpointableAsyncLLM.__del__() and shutdown()
  • RPC timeout and error propagation paths, especially during engine initialization
  • Atomic checkpoint directory race condition handling in initialize_engine()
  • CRIU metadata persistence and recovery correctness
  • Generator state management across subprocess boundary

Poem

🐇 Hop, hop, checkpoint and restore—
Subprocesses and sockets, a rabbit's folklore,
CRIU guards state in forests of PIDs,
ZMQ whispers RPC to what hides,
Now engines awake where checkpoints reside. 🎉

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.44% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add experimental support for checkpoint vLLM Dynamo Workers' clearly and concisely summarizes the main change: adding experimental checkpoint support for vLLM workers.
Description check ✅ Passed The PR description comprehensively covers overview, technical details, initialization paths, file locations, and important caveats; it aligns well with the template structure despite not strictly following every section heading.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/src/dynamo/vllm/main.py (1)

655-673: Initialize checkpointable engine for multimodal workers

After introducing checkpointing, create_vllm_engine() can return a CheckpointableAsyncLLM with auto_start=False (when a checkpoint exists). Unlike the prefill/decode paths, this branch never calls initialize_engine, so we neither restore nor wait for readiness and the subprocess never starts. Please invoke initialize_engine(...) here and update vllm_config if it returns a synced config, mirroring the other worker initializers.

Apply this diff:

     engine_client, vllm_config, default_sampling_params = setup_vllm_engine(config)
 
+    synced_config = await initialize_engine(
+        engine_client,
+        ENABLE_CHECKPOINTING,
+        CHECKPOINT_DIR,
+    )
+    if synced_config is not None:
+        vllm_config = synced_config
+
     # For aggregated mode, no downstream client is needed
     # TODO: Implement disaggregated mode with proper decode worker client
     downstream_client = None
🧹 Nitpick comments (1)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (1)

750-775: Path validation for checkpoint_dir is a good practice but lower priority given current mitigations.

The code passes checkpoint_dir directly to the CRIU command without explicit path validation (no absolute path check or symlink resolution). While this is a valid concern for defense-in-depth, the risk is mitigated by:

  • subprocess.run(cmd) without shell=True prevents shell injection attacks
  • No sudo or privilege escalation found in the code (command runs with process permissions only)
  • Test usage shows safe temporary paths via tempfile.mktemp()

Consider adding optional validation: verify checkpoint_dir is an absolute path and resolve any symlinks before passing to CRIU, but this is not a critical blocker given the existing protections.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f8fd86 and daf302f.

📒 Files selected for processing (10)
  • components/src/dynamo/vllm/args.py (1 hunks)
  • components/src/dynamo/vllm/checkpoint/__init__.py (1 hunks)
  • components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (1 hunks)
  • components/src/dynamo/vllm/checkpoint/metadata.py (1 hunks)
  • components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py (1 hunks)
  • components/src/dynamo/vllm/checkpoint/utils.py (1 hunks)
  • components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py (1 hunks)
  • components/src/dynamo/vllm/engine.py (1 hunks)
  • components/src/dynamo/vllm/main.py (5 hunks)
  • container/Dockerfile.vllm-checkpoint (1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-08-30T20:43:49.632Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.

Applied to files:

  • container/Dockerfile.vllm-checkpoint
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project, devcontainer.json files use templated container names (like "dynamo-vllm-devcontainer") that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

  • container/Dockerfile.vllm-checkpoint
📚 Learning: 2025-09-16T17:16:07.820Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.vllm.j2:221-233
Timestamp: 2025-09-16T17:16:07.820Z
Learning: In the dynamo project, when converting Dockerfiles to Jinja2 templates, the primary goal is backward compatibility - generated Dockerfiles must be identical to the originals. Security improvements or other enhancements that would change the generated output are out of scope for templating PRs and should be addressed separately.

Applied to files:

  • container/Dockerfile.vllm-checkpoint
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project's devcontainer setup, hard-coded container names in devcontainer.json files serve as templates that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

  • container/Dockerfile.vllm-checkpoint
📚 Learning: 2025-09-03T01:10:12.599Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.

Applied to files:

  • container/Dockerfile.vllm-checkpoint
📚 Learning: 2025-06-05T01:10:51.865Z
Learnt from: tanmayv25
Repo: ai-dynamo/dynamo PR: 1391
File: examples/tensorrt_llm/common/base_engine.py:171-176
Timestamp: 2025-06-05T01:10:51.865Z
Learning: In examples/tensorrt_llm/common/base_engine.py, the _init_engine method is called only once during initialization, so direct mutation of the _default_sampling_params object during setup is safe and appropriate.

Applied to files:

  • components/src/dynamo/vllm/main.py
🧬 Code graph analysis (6)
components/src/dynamo/vllm/engine.py (1)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (6)
  • CheckpointableAsyncLLM (143-1166)
  • from_vllm_config (1082-1122)
  • get_vllm_config (597-598)
  • criu_resume (804-960)
  • wait_until_ready (327-384)
  • criu_checkpoint (693-802)
components/src/dynamo/vllm/main.py (1)
components/src/dynamo/vllm/engine.py (2)
  • create_vllm_engine (27-106)
  • initialize_engine (130-228)
components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py (2)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (7)
  • CheckpointableAsyncLLM (143-1166)
  • from_engine_args (1125-1166)
  • criu_resume (804-960)
  • generate (548-568)
  • shutdown (962-1030)
  • wait_until_ready (327-384)
  • criu_checkpoint (693-802)
components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py (1)
  • run (167-301)
components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py (1)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (5)
  • get_model_config (600-601)
  • errored (681-684)
  • sleep (631-632)
  • encode (573-595)
  • shutdown (962-1030)
components/src/dynamo/vllm/checkpoint/__init__.py (2)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (1)
  • CheckpointableAsyncLLM (143-1166)
components/src/dynamo/vllm/checkpoint/metadata.py (1)
  • CheckpointMetadata (13-74)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (3)
components/src/dynamo/vllm/checkpoint/utils.py (4)
  • find_gpu_worker_pids (326-345)
  • collect_process_tree_pids (46-75)
  • verify_processes_exited (97-136)
  • get_tty_info (139-160)
components/src/dynamo/vllm/checkpoint/metadata.py (4)
  • CheckpointMetadata (13-74)
  • tty_external (70-74)
  • save (44-52)
  • load (55-67)
components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py (6)
  • RPCMessageType (27-32)
  • RPCResponse (34-39)
  • PropertyRequest (41-43)
  • PropertyResponse (45-48)
  • run_async_llm_server (304-367)
  • run (167-301)
🪛 ast-grep (0.39.7)
components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py

[warning] 120-121: The function mktemp is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile() instead and pass it the delete=False parameter.
Context: tempfile.mktemp(
prefix="vllm_test_checkpoint_", dir="/tmp")
Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
https://owasp.org/Top10/A01_2021-Broken_Access_Control

(avoid-mktemp-python)

🪛 GitHub Actions: Copyright Checks
components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py

components/src/dynamo/vllm/checkpoint/utils.py

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/utils.py

components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py

components/src/dynamo/vllm/checkpoint/metadata.py

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/metadata.py

components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4189/merge) by michaelshin.
container/Dockerfile.vllm-checkpoint

[error] 206-206: Docker build failed during CRIU installation step. Process completed with exit code 1.

🪛 Ruff (0.14.3)
components/src/dynamo/vllm/engine.py

127-127: Avoid specifying long messages outside the exception class

(TRY003)


166-166: Avoid specifying long messages outside the exception class

(TRY003)

components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py

1-1: Shebang is present but file is not executable

(EXE001)


52-52: Avoid specifying long messages outside the exception class

(TRY003)


99-99: Abstract raise to an inner function

(TRY301)


99-99: Avoid specifying long messages outside the exception class

(TRY003)


112-113: Remove exception handler; error is immediately re-raised

(TRY203)


121-122: Use of insecure and deprecated function (mktemp)

(S306)


163-163: Abstract raise to an inner function

(TRY301)


163-163: Avoid specifying long messages outside the exception class

(TRY003)


224-224: Abstract raise to an inner function

(TRY301)


224-224: Avoid specifying long messages outside the exception class

(TRY003)


240-241: Remove exception handler; error is immediately re-raised

(TRY203)

components/src/dynamo/vllm/checkpoint/utils.py

15-15: Do not catch blind exception: Exception

(BLE001)


29-29: Do not catch blind exception: Exception

(BLE001)


42-42: Do not catch blind exception: Exception

(BLE001)


65-66: try-except-continue detected, consider logging the exception

(S112)


65-65: Do not catch blind exception: Exception

(BLE001)


91-91: Consider moving this statement to an else block

(TRY300)


92-92: Do not catch blind exception: Exception

(BLE001)


131-131: Do not catch blind exception: Exception

(BLE001)


132-132: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


157-157: Consider moving this statement to an else block

(TRY300)


158-158: Do not catch blind exception: Exception

(BLE001)


179-179: Do not catch blind exception: Exception

(BLE001)


208-209: try-except-continue detected, consider logging the exception

(S112)


208-208: Do not catch blind exception: Exception

(BLE001)


210-211: try-except-pass detected, consider logging the exception

(S110)


210-210: Do not catch blind exception: Exception

(BLE001)


249-249: Do not catch blind exception: Exception

(BLE001)


258-258: Avoid specifying long messages outside the exception class

(TRY003)


276-279: Avoid specifying long messages outside the exception class

(TRY003)


289-292: Avoid specifying long messages outside the exception class

(TRY003)


319-320: try-except-continue detected, consider logging the exception

(S112)


319-319: Do not catch blind exception: Exception

(BLE001)


321-322: try-except-pass detected, consider logging the exception

(S110)


321-321: Do not catch blind exception: Exception

(BLE001)


381-381: Do not catch blind exception: Exception

(BLE001)


445-447: Avoid specifying long messages outside the exception class

(TRY003)


456-456: Do not catch blind exception: Exception

(BLE001)

components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py

91-91: Consider moving this statement to an else block

(TRY300)


97-99: Avoid specifying long messages outside the exception class

(TRY003)


107-109: Avoid specifying long messages outside the exception class

(TRY003)


118-118: Avoid specifying long messages outside the exception class

(TRY003)


149-150: Avoid specifying long messages outside the exception class

(TRY003)


155-155: Consider moving this statement to an else block

(TRY300)


163-163: Avoid specifying long messages outside the exception class

(TRY003)


252-252: Do not catch blind exception: Exception

(BLE001)


255-255: Use explicit conversion flag

Replace with conversion flag

(RUF010)


270-270: Do not catch blind exception: Exception

(BLE001)


273-273: Use explicit conversion flag

Replace with conversion flag

(RUF010)


330-330: Do not catch blind exception: Exception

(BLE001)


364-364: Do not catch blind exception: Exception

(BLE001)

components/src/dynamo/vllm/checkpoint/__init__.py

9-9: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

components/src/dynamo/vllm/checkpoint/metadata.py

51-51: Do not catch blind exception: Exception

(BLE001)


65-65: Do not catch blind exception: Exception

(BLE001)

components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py

75-75: Do not catch blind exception: Exception

(BLE001)


78-78: Use explicit conversion flag

Replace with conversion flag

(RUF010)


125-125: Abstract raise to an inner function

(TRY301)


125-125: Avoid specifying long messages outside the exception class

(TRY003)


134-134: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


134-134: Avoid specifying long messages outside the exception class

(TRY003)


140-140: Use raise without specifying exception name

Remove exception name

(TRY201)


178-178: Unused method argument: stat_loggers

(ARG002)


259-259: Do not catch blind exception: Exception

(BLE001)


278-278: Avoid specifying long messages outside the exception class

(TRY003)


300-300: Avoid specifying long messages outside the exception class

(TRY003)


305-307: Avoid specifying long messages outside the exception class

(TRY003)


308-310: Avoid specifying long messages outside the exception class

(TRY003)


347-348: Avoid specifying long messages outside the exception class

(TRY003)


367-370: Avoid specifying long messages outside the exception class

(TRY003)


381-384: Avoid specifying long messages outside the exception class

(TRY003)


389-391: Avoid specifying long messages outside the exception class

(TRY003)


393-394: Avoid specifying long messages outside the exception class

(TRY003)


413-413: Avoid specifying long messages outside the exception class

(TRY003)


416-416: Avoid specifying long messages outside the exception class

(TRY003)


420-420: Avoid specifying long messages outside the exception class

(TRY003)


428-430: Avoid specifying long messages outside the exception class

(TRY003)


432-433: Avoid specifying long messages outside the exception class

(TRY003)


452-452: Avoid specifying long messages outside the exception class

(TRY003)


455-455: Avoid specifying long messages outside the exception class

(TRY003)


459-459: Avoid specifying long messages outside the exception class

(TRY003)


475-475: Avoid specifying long messages outside the exception class

(TRY003)


478-478: Avoid specifying long messages outside the exception class

(TRY003)


482-482: Avoid specifying long messages outside the exception class

(TRY003)


492-494: Avoid specifying long messages outside the exception class

(TRY003)


496-497: Avoid specifying long messages outside the exception class

(TRY003)


508-508: Avoid specifying long messages outside the exception class

(TRY003)


511-511: Avoid specifying long messages outside the exception class

(TRY003)


515-515: Avoid specifying long messages outside the exception class

(TRY003)


695-695: Unused method argument: cuda_checkpoint_path

(ARG002)


703-705: Avoid specifying long messages outside the exception class

(TRY003)


707-707: Avoid specifying long messages outside the exception class

(TRY003)


713-717: Avoid specifying long messages outside the exception class

(TRY003)


773-773: subprocess call: check for execution of untrusted input

(S603)


775-775: Avoid specifying long messages outside the exception class

(TRY003)


789-789: Avoid specifying long messages outside the exception class

(TRY003)


805-805: Unused method argument: cuda_checkpoint_path

(ARG002)


817-818: Avoid specifying long messages outside the exception class

(TRY003)


823-824: Avoid specifying long messages outside the exception class

(TRY003)


838-838: Starting a process with a shell, possible injection detected

(S605)


852-852: Do not catch blind exception: Exception

(BLE001)


900-900: Do not catch blind exception: Exception

(BLE001)


907-907: subprocess call: check for execution of untrusted input

(S603)


910-910: subprocess call: check for execution of untrusted input

(S603)


912-912: Avoid specifying long messages outside the exception class

(TRY003)


942-942: Avoid specifying long messages outside the exception class

(TRY003)


944-944: Avoid specifying long messages outside the exception class

(TRY003)


955-955: Do not catch blind exception: Exception

(BLE001)


1017-1017: Do not catch blind exception: Exception

(BLE001)


1018-1018: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


1066-1066: Do not catch blind exception: Exception

(BLE001)


1098-1101: Avoid specifying long messages outside the exception class

(TRY003)


1105-1108: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
components/src/dynamo/vllm/engine.py (1)

183-228: Well-designed atomic checkpoint creation with race handling.

The checkpoint creation logic correctly uses os.rename() for atomic operations and handles EEXIST race conditions when multiple workers compete to create the checkpoint. The fallback to restore from the winner's checkpoint is a clean solution for distributed worker initialization.

Comment on lines +1 to +2
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Pipeline failure: Fix copyright header format.

The copyright header format is not matching the expected pattern, causing the pipeline to fail.

The current header includes NVIDIA copyright, but the pipeline expects a different format. Check with the team or examine other accepted files in components/src/dynamo/vllm/ for the correct pattern that satisfies the copyright check.

🧰 Tools
🪛 GitHub Actions: Copyright Checks

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py

🤖 Prompt for AI Agents
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py lines 1-2: the
file header does not match the repository's expected SPDX/copyright header
pattern and is causing CI to fail; replace the two header lines with the exact
header used in other accepted files in components/src/dynamo/vllm/ (copy-paste
the SPDX-License-Identifier and SPDX-FileCopyrightText lines from a matching
file in that folder so formatting and wording match exactly), then run the
repo's license/header check locally to confirm it passes.

Comment on lines +834 to +868
if metadata.tree_pid:
start = time.time()
# Wait up to a short grace period since dump should have killed it
while time.time() - start < 5.0:
if os.system(f"kill -0 {metadata.tree_pid} >/dev/null 2>&1") != 0:
break
await asyncio.sleep(0.05)

# Verify root PID is not taken now (PID could be reused by others)
pid_path = f"/proc/{metadata.tree_pid}"
if os.path.exists(pid_path):
# Try to read the cmdline of the holder for diagnostics
holder = ""
try:
cmd_path = os.path.join(pid_path, "cmdline")
with open(cmd_path, "rb") as f:
raw = f.read().replace(b"\x00", b" ")
holder = raw.decode("utf-8", "ignore").strip()
except Exception:
holder = ""
msg = (
"CRIU restore pre-check failed: root PID "
f"{metadata.tree_pid} is in use in current PID namespace. "
"Restore will fail with EEXIST. Consider restoring in a "
"new PID namespace or wait until the PID is free."
)
if holder:
logger.error(
"%s Holder cmdline: %s",
msg,
holder,
)
else:
logger.error("%s", msg)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Security: Replace os.system() with subprocess API to prevent shell injection.

Using os.system() with formatted strings can be vulnerable to command injection (CWE-78) if the PID value were ever controlled by an attacker or derived from untrusted input.

Apply this diff to use the safer subprocess API:

-            # Verify root PID is not taken now (PID could be reused by others)
-            pid_path = f"/proc/{metadata.tree_pid}"
-            if os.path.exists(pid_path):
+            # Verify root PID is not taken now (PID could be reused by others)
+            pid_path = f"/proc/{metadata.tree_pid}"
+            if os.path.exists(pid_path):

For the PID check at line 838, replace:

-            while time.time() - start < 5.0:
-                if os.system(f"kill -0 {metadata.tree_pid} >/dev/null 2>&1") != 0:
-                    break
-                await asyncio.sleep(0.05)
+            while time.time() - start < 5.0:
+                try:
+                    os.kill(metadata.tree_pid, 0)  # Signal 0 checks existence
+                    # Process exists, keep waiting
+                    await asyncio.sleep(0.05)
+                except (ProcessLookupError, PermissionError):
+                    # Process doesn't exist or we can't check it
+                    break

Using os.kill(pid, 0) is the standard Python way to check if a process exists without sending a real signal.

🧰 Tools
🪛 Ruff (0.14.3)

838-838: Starting a process with a shell, possible injection detected

(S605)


852-852: Do not catch blind exception: Exception

(BLE001)

Comment on lines 1 to 2
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Pipeline failure: Fix copyright header format.

The copyright header format is not matching the expected pattern, causing the pipeline to fail.

Similar to zmq_async_llm_server.py, verify the expected header format with the team or check other accepted files in the components/src/dynamo/vllm/ directory.

🧰 Tools
🪛 GitHub Actions: Copyright Checks

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/metadata.py

@@ -0,0 +1,293 @@
#!/usr/bin/env python3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Pipeline failure: Add required copyright header.

The copyright header is missing or invalid, causing the pipeline to fail.

Add the required header format at the top of the file. Based on other files in the PR, it should follow this pattern:

 #!/usr/bin/env python3
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 """
 Test checkpoint/restore functionality of CheckpointableAsyncLLM.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/usr/bin/env python3
#!/usr/bin/env python3
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
🧰 Tools
🪛 GitHub Actions: Copyright Checks

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py

🪛 Ruff (0.14.3)

1-1: Shebang is present but file is not executable

(EXE001)

🤖 Prompt for AI Agents
In components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py at line 1,
the required copyright/header is missing which causes the pipeline to fail; add
the standard project header (matching other files in this PR) at the very top of
the file, replacing or preceding the current shebang if needed, and ensure it
includes the copyright owner, year, and license text exactly as used in other
files so the linter/pipeline accepts it.

Comment on lines +120 to +123
# Generate a unique checkpoint directory path (but don't create it)
checkpoint_dir = tempfile.mktemp(
prefix="vllm_test_checkpoint_", dir="/tmp")
print(f"Will use checkpoint directory: {checkpoint_dir}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Security: Replace deprecated tempfile.mktemp() with secure alternative.

tempfile.mktemp() is deprecated and vulnerable to race conditions (CWE-377) where an attacker could create a file at the returned path before your code does.

Apply this diff to use a secure alternative:

-    checkpoint_dir = tempfile.mktemp(
-        prefix="vllm_test_checkpoint_", dir="/tmp")
-    print(f"Will use checkpoint directory: {checkpoint_dir}")
+    with tempfile.TemporaryDirectory(
+        prefix="vllm_test_checkpoint_", dir="/tmp") as checkpoint_dir:
+        print(f"Will use checkpoint directory: {checkpoint_dir}")

Note: You'll need to adjust the indentation of the code that follows to be inside the with block, or use tempfile.mkdtemp() if you need to preserve the directory after the test completes (which appears to be the intent based on line 248).

If you want to preserve the directory for debugging:

-    checkpoint_dir = tempfile.mktemp(
-        prefix="vllm_test_checkpoint_", dir="/tmp")
+    checkpoint_dir = tempfile.mkdtemp(
+        prefix="vllm_test_checkpoint_", dir="/tmp")

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 ast-grep (0.39.7)

[warning] 120-121: The function mktemp is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile() instead and pass it the delete=False parameter.
Context: tempfile.mktemp(
prefix="vllm_test_checkpoint_", dir="/tmp")
Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
https://owasp.org/Top10/A01_2021-Broken_Access_Control

(avoid-mktemp-python)

🪛 Ruff (0.14.3)

121-122: Use of insecure and deprecated function (mktemp)

(S306)

🤖 Prompt for AI Agents
In components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py around lines
120-123, replace the deprecated and insecure tempfile.mktemp() call with a
secure alternative: use tempfile.mkdtemp() if the test needs the directory to
persist after the test (preferred here), or wrap code in a with
tempfile.TemporaryDirectory() as checkpoint_dir: block and indent subsequent
code if you want automatic cleanup; update any teardown/cleanup logic
accordingly to remove the directory when no longer needed.

Comment on lines +52 to +75
pending: list[int] = [root_pid]
seen: set[int] = set[int]()
while pending:
pid = pending.pop()
if pid in seen:
continue
seen.add(pid)
children_path = f"/proc/{pid}/task/{pid}/children"
try:
with open(children_path, encoding="utf-8") as f:
content = f.read().strip()
except FileNotFoundError:
continue
except Exception:
continue
if not content:
continue
for token in content.split():
try:
child = int(token)
except ValueError:
continue
pending.append(child)
return seen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix runtime TypeError in process tree traversal

set[int]() raises TypeError: 'types.GenericAlias' object is not callable on every call (Python 3.9+), so this helper currently breaks before reading /proc. Use a plain set() and keep the type annotation separate.

Apply this diff:

-    pending: list[int] = [root_pid]
-    seen: set[int] = set[int]()
+    pending: list[int] = [root_pid]
+    seen: set[int] = set()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pending: list[int] = [root_pid]
seen: set[int] = set[int]()
while pending:
pid = pending.pop()
if pid in seen:
continue
seen.add(pid)
children_path = f"/proc/{pid}/task/{pid}/children"
try:
with open(children_path, encoding="utf-8") as f:
content = f.read().strip()
except FileNotFoundError:
continue
except Exception:
continue
if not content:
continue
for token in content.split():
try:
child = int(token)
except ValueError:
continue
pending.append(child)
return seen
pending: list[int] = [root_pid]
seen: set[int] = set()
while pending:
pid = pending.pop()
if pid in seen:
continue
seen.add(pid)
children_path = f"/proc/{pid}/task/{pid}/children"
try:
with open(children_path, encoding="utf-8") as f:
content = f.read().strip()
except FileNotFoundError:
continue
except Exception:
continue
if not content:
continue
for token in content.split():
try:
child = int(token)
except ValueError:
continue
pending.append(child)
return seen
🧰 Tools
🪛 Ruff (0.14.3)

65-66: try-except-continue detected, consider logging the exception

(S112)


65-65: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In components/src/dynamo/vllm/checkpoint/utils.py around lines 52 to 75, the
code uses set[int]() which raises TypeError because types.GenericAlias is not
callable; change the initialization to use a plain set() while keeping the
typing annotation (e.g., seen: set[int] = set()) so the runtime constructs a set
correctly and the static type remains accurate; no other logic changes required.

Comment on lines 1 to 2
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Pipeline failure: Fix copyright header format.

The copyright header format is not matching the expected pattern, causing the pipeline to fail.

The existing headers appear correct for the vLLM project. This might be a mismatch with the Dynamo project requirements. Verify the expected header format with the team or check other accepted files in components/src/dynamo/vllm/ for the correct pattern.

🧰 Tools
🪛 GitHub Actions: Copyright Checks

[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py

🤖 Prompt for AI Agents
components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py lines 1-2: the
file header SPDX/license lines do not match the Dynamo project expected header
pattern and cause CI failures; update the top-of-file header to the repository's
canonical header (match other accepted files under
components/src/dynamo/vllm/)—replace the current two SPDX lines with the exact
header format used across the Dynamo codebase (including the correct
SPDX-License-Identifier line and the correct SPDX-FileCopyrightText
content/owner string), commit the change, and run the pipeline to confirm the
header now passes validation.

Comment on lines +211 to +214
RUN add-apt-repository ppa:criu/ppa && \
apt-get update && \
apt-get install -y --no-install-recommends criu && \
rm -rf /var/lib/apt/lists/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Make the CRIU PPA step non-interactive

add-apt-repository prompts for confirmation unless -y/--yes is passed, so the Docker build hangs and then fails (see the pre-merge CI failure). Please add the non-interactive flag so the stage can run unattended.

Apply this diff:

-RUN add-apt-repository ppa:criu/ppa && \
+RUN add-apt-repository -y ppa:criu/ppa && \
     apt-get update && \
     apt-get install -y --no-install-recommends criu && \
     rm -rf /var/lib/apt/lists/*
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
RUN add-apt-repository ppa:criu/ppa && \
apt-get update && \
apt-get install -y --no-install-recommends criu && \
rm -rf /var/lib/apt/lists/*
RUN add-apt-repository -y ppa:criu/ppa && \
apt-get update && \
apt-get install -y --no-install-recommends criu && \
rm -rf /var/lib/apt/lists/*
🤖 Prompt for AI Agents
In container/Dockerfile.vllm-checkpoint around lines 211 to 214, the
add-apt-repository call is interactive and blocks the Docker build; update the
command to run non-interactively by adding the -y (or --yes) flag to
add-apt-repository (e.g., add-apt-repository -y ppa:criu/ppa) so the step
proceeds unattended, leaving the apt-get update, apt-get install -y
--no-install-recommends criu, and cleanup of /var/lib/apt/lists/* unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants