-
Notifications
You must be signed in to change notification settings - Fork 679
feat: add experimental support for checkpoint vLLM Dynamo Workers #4189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Michael Shin <[email protected]>
WalkthroughThis pull request introduces a complete checkpoint/restore system for vLLM engines using CRIU. It adds configuration flags, a subprocess-based AsyncLLM wrapper communicating via ZMQ RPC, GPU process utilities, metadata persistence, and integrates checkpointing into the engine factory with optional resume-on-startup behavior for distributed workers. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes
Areas requiring extra attention:
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/src/dynamo/vllm/main.py (1)
655-673: Initialize checkpointable engine for multimodal workersAfter introducing checkpointing,
create_vllm_engine()can return aCheckpointableAsyncLLMwithauto_start=False(when a checkpoint exists). Unlike the prefill/decode paths, this branch never callsinitialize_engine, so we neither restore nor wait for readiness and the subprocess never starts. Please invokeinitialize_engine(...)here and updatevllm_configif it returns a synced config, mirroring the other worker initializers.Apply this diff:
engine_client, vllm_config, default_sampling_params = setup_vllm_engine(config) + synced_config = await initialize_engine( + engine_client, + ENABLE_CHECKPOINTING, + CHECKPOINT_DIR, + ) + if synced_config is not None: + vllm_config = synced_config + # For aggregated mode, no downstream client is needed # TODO: Implement disaggregated mode with proper decode worker client downstream_client = None
🧹 Nitpick comments (1)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (1)
750-775: Path validation for checkpoint_dir is a good practice but lower priority given current mitigations.The code passes
checkpoint_dirdirectly to the CRIU command without explicit path validation (no absolute path check or symlink resolution). While this is a valid concern for defense-in-depth, the risk is mitigated by:
subprocess.run(cmd)withoutshell=Trueprevents shell injection attacks- No
sudoor privilege escalation found in the code (command runs with process permissions only)- Test usage shows safe temporary paths via
tempfile.mktemp()Consider adding optional validation: verify
checkpoint_diris an absolute path and resolve any symlinks before passing to CRIU, but this is not a critical blocker given the existing protections.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
components/src/dynamo/vllm/args.py(1 hunks)components/src/dynamo/vllm/checkpoint/__init__.py(1 hunks)components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py(1 hunks)components/src/dynamo/vllm/checkpoint/metadata.py(1 hunks)components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py(1 hunks)components/src/dynamo/vllm/checkpoint/utils.py(1 hunks)components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py(1 hunks)components/src/dynamo/vllm/engine.py(1 hunks)components/src/dynamo/vllm/main.py(5 hunks)container/Dockerfile.vllm-checkpoint(1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-08-30T20:43:49.632Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.
Applied to files:
container/Dockerfile.vllm-checkpoint
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project, devcontainer.json files use templated container names (like "dynamo-vllm-devcontainer") that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.
Applied to files:
container/Dockerfile.vllm-checkpoint
📚 Learning: 2025-09-16T17:16:07.820Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.vllm.j2:221-233
Timestamp: 2025-09-16T17:16:07.820Z
Learning: In the dynamo project, when converting Dockerfiles to Jinja2 templates, the primary goal is backward compatibility - generated Dockerfiles must be identical to the originals. Security improvements or other enhancements that would change the generated output are out of scope for templating PRs and should be addressed separately.
Applied to files:
container/Dockerfile.vllm-checkpoint
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project's devcontainer setup, hard-coded container names in devcontainer.json files serve as templates that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.
Applied to files:
container/Dockerfile.vllm-checkpoint
📚 Learning: 2025-09-03T01:10:12.599Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.
Applied to files:
container/Dockerfile.vllm-checkpoint
📚 Learning: 2025-06-05T01:10:51.865Z
Learnt from: tanmayv25
Repo: ai-dynamo/dynamo PR: 1391
File: examples/tensorrt_llm/common/base_engine.py:171-176
Timestamp: 2025-06-05T01:10:51.865Z
Learning: In examples/tensorrt_llm/common/base_engine.py, the _init_engine method is called only once during initialization, so direct mutation of the _default_sampling_params object during setup is safe and appropriate.
Applied to files:
components/src/dynamo/vllm/main.py
🧬 Code graph analysis (6)
components/src/dynamo/vllm/engine.py (1)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (6)
CheckpointableAsyncLLM(143-1166)from_vllm_config(1082-1122)get_vllm_config(597-598)criu_resume(804-960)wait_until_ready(327-384)criu_checkpoint(693-802)
components/src/dynamo/vllm/main.py (1)
components/src/dynamo/vllm/engine.py (2)
create_vllm_engine(27-106)initialize_engine(130-228)
components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py (2)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (7)
CheckpointableAsyncLLM(143-1166)from_engine_args(1125-1166)criu_resume(804-960)generate(548-568)shutdown(962-1030)wait_until_ready(327-384)criu_checkpoint(693-802)components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py (1)
run(167-301)
components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py (1)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (5)
get_model_config(600-601)errored(681-684)sleep(631-632)encode(573-595)shutdown(962-1030)
components/src/dynamo/vllm/checkpoint/__init__.py (2)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (1)
CheckpointableAsyncLLM(143-1166)components/src/dynamo/vllm/checkpoint/metadata.py (1)
CheckpointMetadata(13-74)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py (3)
components/src/dynamo/vllm/checkpoint/utils.py (4)
find_gpu_worker_pids(326-345)collect_process_tree_pids(46-75)verify_processes_exited(97-136)get_tty_info(139-160)components/src/dynamo/vllm/checkpoint/metadata.py (4)
CheckpointMetadata(13-74)tty_external(70-74)save(44-52)load(55-67)components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py (6)
RPCMessageType(27-32)RPCResponse(34-39)PropertyRequest(41-43)PropertyResponse(45-48)run_async_llm_server(304-367)run(167-301)
🪛 ast-grep (0.39.7)
components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py
[warning] 120-121: The function mktemp is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile() instead and pass it the delete=False parameter.
Context: tempfile.mktemp(
prefix="vllm_test_checkpoint_", dir="/tmp")
Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
https://owasp.org/Top10/A01_2021-Broken_Access_Control
(avoid-mktemp-python)
🪛 GitHub Actions: Copyright Checks
components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py
[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py
components/src/dynamo/vllm/checkpoint/utils.py
[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/utils.py
components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py
[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py
components/src/dynamo/vllm/checkpoint/metadata.py
[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/metadata.py
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py
[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4189/merge) by michaelshin.
container/Dockerfile.vllm-checkpoint
[error] 206-206: Docker build failed during CRIU installation step. Process completed with exit code 1.
🪛 Ruff (0.14.3)
components/src/dynamo/vllm/engine.py
127-127: Avoid specifying long messages outside the exception class
(TRY003)
166-166: Avoid specifying long messages outside the exception class
(TRY003)
components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py
1-1: Shebang is present but file is not executable
(EXE001)
52-52: Avoid specifying long messages outside the exception class
(TRY003)
99-99: Abstract raise to an inner function
(TRY301)
99-99: Avoid specifying long messages outside the exception class
(TRY003)
112-113: Remove exception handler; error is immediately re-raised
(TRY203)
121-122: Use of insecure and deprecated function (mktemp)
(S306)
163-163: Abstract raise to an inner function
(TRY301)
163-163: Avoid specifying long messages outside the exception class
(TRY003)
224-224: Abstract raise to an inner function
(TRY301)
224-224: Avoid specifying long messages outside the exception class
(TRY003)
240-241: Remove exception handler; error is immediately re-raised
(TRY203)
components/src/dynamo/vllm/checkpoint/utils.py
15-15: Do not catch blind exception: Exception
(BLE001)
29-29: Do not catch blind exception: Exception
(BLE001)
42-42: Do not catch blind exception: Exception
(BLE001)
65-66: try-except-continue detected, consider logging the exception
(S112)
65-65: Do not catch blind exception: Exception
(BLE001)
91-91: Consider moving this statement to an else block
(TRY300)
92-92: Do not catch blind exception: Exception
(BLE001)
131-131: Do not catch blind exception: Exception
(BLE001)
132-132: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
157-157: Consider moving this statement to an else block
(TRY300)
158-158: Do not catch blind exception: Exception
(BLE001)
179-179: Do not catch blind exception: Exception
(BLE001)
208-209: try-except-continue detected, consider logging the exception
(S112)
208-208: Do not catch blind exception: Exception
(BLE001)
210-211: try-except-pass detected, consider logging the exception
(S110)
210-210: Do not catch blind exception: Exception
(BLE001)
249-249: Do not catch blind exception: Exception
(BLE001)
258-258: Avoid specifying long messages outside the exception class
(TRY003)
276-279: Avoid specifying long messages outside the exception class
(TRY003)
289-292: Avoid specifying long messages outside the exception class
(TRY003)
319-320: try-except-continue detected, consider logging the exception
(S112)
319-319: Do not catch blind exception: Exception
(BLE001)
321-322: try-except-pass detected, consider logging the exception
(S110)
321-321: Do not catch blind exception: Exception
(BLE001)
381-381: Do not catch blind exception: Exception
(BLE001)
445-447: Avoid specifying long messages outside the exception class
(TRY003)
456-456: Do not catch blind exception: Exception
(BLE001)
components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py
91-91: Consider moving this statement to an else block
(TRY300)
97-99: Avoid specifying long messages outside the exception class
(TRY003)
107-109: Avoid specifying long messages outside the exception class
(TRY003)
118-118: Avoid specifying long messages outside the exception class
(TRY003)
149-150: Avoid specifying long messages outside the exception class
(TRY003)
155-155: Consider moving this statement to an else block
(TRY300)
163-163: Avoid specifying long messages outside the exception class
(TRY003)
252-252: Do not catch blind exception: Exception
(BLE001)
255-255: Use explicit conversion flag
Replace with conversion flag
(RUF010)
270-270: Do not catch blind exception: Exception
(BLE001)
273-273: Use explicit conversion flag
Replace with conversion flag
(RUF010)
330-330: Do not catch blind exception: Exception
(BLE001)
364-364: Do not catch blind exception: Exception
(BLE001)
components/src/dynamo/vllm/checkpoint/__init__.py
9-9: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
components/src/dynamo/vllm/checkpoint/metadata.py
51-51: Do not catch blind exception: Exception
(BLE001)
65-65: Do not catch blind exception: Exception
(BLE001)
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py
75-75: Do not catch blind exception: Exception
(BLE001)
78-78: Use explicit conversion flag
Replace with conversion flag
(RUF010)
125-125: Abstract raise to an inner function
(TRY301)
125-125: Avoid specifying long messages outside the exception class
(TRY003)
134-134: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
134-134: Avoid specifying long messages outside the exception class
(TRY003)
140-140: Use raise without specifying exception name
Remove exception name
(TRY201)
178-178: Unused method argument: stat_loggers
(ARG002)
259-259: Do not catch blind exception: Exception
(BLE001)
278-278: Avoid specifying long messages outside the exception class
(TRY003)
300-300: Avoid specifying long messages outside the exception class
(TRY003)
305-307: Avoid specifying long messages outside the exception class
(TRY003)
308-310: Avoid specifying long messages outside the exception class
(TRY003)
347-348: Avoid specifying long messages outside the exception class
(TRY003)
367-370: Avoid specifying long messages outside the exception class
(TRY003)
381-384: Avoid specifying long messages outside the exception class
(TRY003)
389-391: Avoid specifying long messages outside the exception class
(TRY003)
393-394: Avoid specifying long messages outside the exception class
(TRY003)
413-413: Avoid specifying long messages outside the exception class
(TRY003)
416-416: Avoid specifying long messages outside the exception class
(TRY003)
420-420: Avoid specifying long messages outside the exception class
(TRY003)
428-430: Avoid specifying long messages outside the exception class
(TRY003)
432-433: Avoid specifying long messages outside the exception class
(TRY003)
452-452: Avoid specifying long messages outside the exception class
(TRY003)
455-455: Avoid specifying long messages outside the exception class
(TRY003)
459-459: Avoid specifying long messages outside the exception class
(TRY003)
475-475: Avoid specifying long messages outside the exception class
(TRY003)
478-478: Avoid specifying long messages outside the exception class
(TRY003)
482-482: Avoid specifying long messages outside the exception class
(TRY003)
492-494: Avoid specifying long messages outside the exception class
(TRY003)
496-497: Avoid specifying long messages outside the exception class
(TRY003)
508-508: Avoid specifying long messages outside the exception class
(TRY003)
511-511: Avoid specifying long messages outside the exception class
(TRY003)
515-515: Avoid specifying long messages outside the exception class
(TRY003)
695-695: Unused method argument: cuda_checkpoint_path
(ARG002)
703-705: Avoid specifying long messages outside the exception class
(TRY003)
707-707: Avoid specifying long messages outside the exception class
(TRY003)
713-717: Avoid specifying long messages outside the exception class
(TRY003)
773-773: subprocess call: check for execution of untrusted input
(S603)
775-775: Avoid specifying long messages outside the exception class
(TRY003)
789-789: Avoid specifying long messages outside the exception class
(TRY003)
805-805: Unused method argument: cuda_checkpoint_path
(ARG002)
817-818: Avoid specifying long messages outside the exception class
(TRY003)
823-824: Avoid specifying long messages outside the exception class
(TRY003)
838-838: Starting a process with a shell, possible injection detected
(S605)
852-852: Do not catch blind exception: Exception
(BLE001)
900-900: Do not catch blind exception: Exception
(BLE001)
907-907: subprocess call: check for execution of untrusted input
(S603)
910-910: subprocess call: check for execution of untrusted input
(S603)
912-912: Avoid specifying long messages outside the exception class
(TRY003)
942-942: Avoid specifying long messages outside the exception class
(TRY003)
944-944: Avoid specifying long messages outside the exception class
(TRY003)
955-955: Do not catch blind exception: Exception
(BLE001)
1017-1017: Do not catch blind exception: Exception
(BLE001)
1018-1018: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
1066-1066: Do not catch blind exception: Exception
(BLE001)
1098-1101: Avoid specifying long messages outside the exception class
(TRY003)
1105-1108: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
components/src/dynamo/vllm/engine.py (1)
183-228: Well-designed atomic checkpoint creation with race handling.The checkpoint creation logic correctly uses
os.rename()for atomic operations and handlesEEXISTrace conditions when multiple workers compete to create the checkpoint. The fallback to restore from the winner's checkpoint is a clean solution for distributed worker initialization.
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline failure: Fix copyright header format.
The copyright header format is not matching the expected pattern, causing the pipeline to fail.
The current header includes NVIDIA copyright, but the pipeline expects a different format. Check with the team or examine other accepted files in components/src/dynamo/vllm/ for the correct pattern that satisfies the copyright check.
🧰 Tools
🪛 GitHub Actions: Copyright Checks
[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py
🤖 Prompt for AI Agents
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.py lines 1-2: the
file header does not match the repository's expected SPDX/copyright header
pattern and is causing CI to fail; replace the two header lines with the exact
header used in other accepted files in components/src/dynamo/vllm/ (copy-paste
the SPDX-License-Identifier and SPDX-FileCopyrightText lines from a matching
file in that folder so formatting and wording match exactly), then run the
repo's license/header check locally to confirm it passes.
| if metadata.tree_pid: | ||
| start = time.time() | ||
| # Wait up to a short grace period since dump should have killed it | ||
| while time.time() - start < 5.0: | ||
| if os.system(f"kill -0 {metadata.tree_pid} >/dev/null 2>&1") != 0: | ||
| break | ||
| await asyncio.sleep(0.05) | ||
|
|
||
| # Verify root PID is not taken now (PID could be reused by others) | ||
| pid_path = f"/proc/{metadata.tree_pid}" | ||
| if os.path.exists(pid_path): | ||
| # Try to read the cmdline of the holder for diagnostics | ||
| holder = "" | ||
| try: | ||
| cmd_path = os.path.join(pid_path, "cmdline") | ||
| with open(cmd_path, "rb") as f: | ||
| raw = f.read().replace(b"\x00", b" ") | ||
| holder = raw.decode("utf-8", "ignore").strip() | ||
| except Exception: | ||
| holder = "" | ||
| msg = ( | ||
| "CRIU restore pre-check failed: root PID " | ||
| f"{metadata.tree_pid} is in use in current PID namespace. " | ||
| "Restore will fail with EEXIST. Consider restoring in a " | ||
| "new PID namespace or wait until the PID is free." | ||
| ) | ||
| if holder: | ||
| logger.error( | ||
| "%s Holder cmdline: %s", | ||
| msg, | ||
| holder, | ||
| ) | ||
| else: | ||
| logger.error("%s", msg) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security: Replace os.system() with subprocess API to prevent shell injection.
Using os.system() with formatted strings can be vulnerable to command injection (CWE-78) if the PID value were ever controlled by an attacker or derived from untrusted input.
Apply this diff to use the safer subprocess API:
- # Verify root PID is not taken now (PID could be reused by others)
- pid_path = f"/proc/{metadata.tree_pid}"
- if os.path.exists(pid_path):
+ # Verify root PID is not taken now (PID could be reused by others)
+ pid_path = f"/proc/{metadata.tree_pid}"
+ if os.path.exists(pid_path):For the PID check at line 838, replace:
- while time.time() - start < 5.0:
- if os.system(f"kill -0 {metadata.tree_pid} >/dev/null 2>&1") != 0:
- break
- await asyncio.sleep(0.05)
+ while time.time() - start < 5.0:
+ try:
+ os.kill(metadata.tree_pid, 0) # Signal 0 checks existence
+ # Process exists, keep waiting
+ await asyncio.sleep(0.05)
+ except (ProcessLookupError, PermissionError):
+ # Process doesn't exist or we can't check it
+ breakUsing os.kill(pid, 0) is the standard Python way to check if a process exists without sending a real signal.
🧰 Tools
🪛 Ruff (0.14.3)
838-838: Starting a process with a shell, possible injection detected
(S605)
852-852: Do not catch blind exception: Exception
(BLE001)
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline failure: Fix copyright header format.
The copyright header format is not matching the expected pattern, causing the pipeline to fail.
Similar to zmq_async_llm_server.py, verify the expected header format with the team or check other accepted files in the components/src/dynamo/vllm/ directory.
🧰 Tools
🪛 GitHub Actions: Copyright Checks
[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/metadata.py
| @@ -0,0 +1,293 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline failure: Add required copyright header.
The copyright header is missing or invalid, causing the pipeline to fail.
Add the required header format at the top of the file. Based on other files in the PR, it should follow this pattern:
#!/usr/bin/env python3
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
Test checkpoint/restore functionality of CheckpointableAsyncLLM.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #!/usr/bin/env python3 | |
| #!/usr/bin/env python3 | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project |
🧰 Tools
🪛 GitHub Actions: Copyright Checks
[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py
🪛 Ruff (0.14.3)
1-1: Shebang is present but file is not executable
(EXE001)
🤖 Prompt for AI Agents
In components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py at line 1,
the required copyright/header is missing which causes the pipeline to fail; add
the standard project header (matching other files in this PR) at the very top of
the file, replacing or preceding the current shebang if needed, and ensure it
includes the copyright owner, year, and license text exactly as used in other
files so the linter/pipeline accepts it.
| # Generate a unique checkpoint directory path (but don't create it) | ||
| checkpoint_dir = tempfile.mktemp( | ||
| prefix="vllm_test_checkpoint_", dir="/tmp") | ||
| print(f"Will use checkpoint directory: {checkpoint_dir}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security: Replace deprecated tempfile.mktemp() with secure alternative.
tempfile.mktemp() is deprecated and vulnerable to race conditions (CWE-377) where an attacker could create a file at the returned path before your code does.
Apply this diff to use a secure alternative:
- checkpoint_dir = tempfile.mktemp(
- prefix="vllm_test_checkpoint_", dir="/tmp")
- print(f"Will use checkpoint directory: {checkpoint_dir}")
+ with tempfile.TemporaryDirectory(
+ prefix="vllm_test_checkpoint_", dir="/tmp") as checkpoint_dir:
+ print(f"Will use checkpoint directory: {checkpoint_dir}")Note: You'll need to adjust the indentation of the code that follows to be inside the with block, or use tempfile.mkdtemp() if you need to preserve the directory after the test completes (which appears to be the intent based on line 248).
If you want to preserve the directory for debugging:
- checkpoint_dir = tempfile.mktemp(
- prefix="vllm_test_checkpoint_", dir="/tmp")
+ checkpoint_dir = tempfile.mkdtemp(
+ prefix="vllm_test_checkpoint_", dir="/tmp")Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 ast-grep (0.39.7)
[warning] 120-121: The function mktemp is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile() instead and pass it the delete=False parameter.
Context: tempfile.mktemp(
prefix="vllm_test_checkpoint_", dir="/tmp")
Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
https://owasp.org/Top10/A01_2021-Broken_Access_Control
(avoid-mktemp-python)
🪛 Ruff (0.14.3)
121-122: Use of insecure and deprecated function (mktemp)
(S306)
🤖 Prompt for AI Agents
In components/src/dynamo/vllm/checkpoint/test_checkpoint_restore.py around lines
120-123, replace the deprecated and insecure tempfile.mktemp() call with a
secure alternative: use tempfile.mkdtemp() if the test needs the directory to
persist after the test (preferred here), or wrap code in a with
tempfile.TemporaryDirectory() as checkpoint_dir: block and indent subsequent
code if you want automatic cleanup; update any teardown/cleanup logic
accordingly to remove the directory when no longer needed.
| pending: list[int] = [root_pid] | ||
| seen: set[int] = set[int]() | ||
| while pending: | ||
| pid = pending.pop() | ||
| if pid in seen: | ||
| continue | ||
| seen.add(pid) | ||
| children_path = f"/proc/{pid}/task/{pid}/children" | ||
| try: | ||
| with open(children_path, encoding="utf-8") as f: | ||
| content = f.read().strip() | ||
| except FileNotFoundError: | ||
| continue | ||
| except Exception: | ||
| continue | ||
| if not content: | ||
| continue | ||
| for token in content.split(): | ||
| try: | ||
| child = int(token) | ||
| except ValueError: | ||
| continue | ||
| pending.append(child) | ||
| return seen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix runtime TypeError in process tree traversal
set[int]() raises TypeError: 'types.GenericAlias' object is not callable on every call (Python 3.9+), so this helper currently breaks before reading /proc. Use a plain set() and keep the type annotation separate.
Apply this diff:
- pending: list[int] = [root_pid]
- seen: set[int] = set[int]()
+ pending: list[int] = [root_pid]
+ seen: set[int] = set()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pending: list[int] = [root_pid] | |
| seen: set[int] = set[int]() | |
| while pending: | |
| pid = pending.pop() | |
| if pid in seen: | |
| continue | |
| seen.add(pid) | |
| children_path = f"/proc/{pid}/task/{pid}/children" | |
| try: | |
| with open(children_path, encoding="utf-8") as f: | |
| content = f.read().strip() | |
| except FileNotFoundError: | |
| continue | |
| except Exception: | |
| continue | |
| if not content: | |
| continue | |
| for token in content.split(): | |
| try: | |
| child = int(token) | |
| except ValueError: | |
| continue | |
| pending.append(child) | |
| return seen | |
| pending: list[int] = [root_pid] | |
| seen: set[int] = set() | |
| while pending: | |
| pid = pending.pop() | |
| if pid in seen: | |
| continue | |
| seen.add(pid) | |
| children_path = f"/proc/{pid}/task/{pid}/children" | |
| try: | |
| with open(children_path, encoding="utf-8") as f: | |
| content = f.read().strip() | |
| except FileNotFoundError: | |
| continue | |
| except Exception: | |
| continue | |
| if not content: | |
| continue | |
| for token in content.split(): | |
| try: | |
| child = int(token) | |
| except ValueError: | |
| continue | |
| pending.append(child) | |
| return seen |
🧰 Tools
🪛 Ruff (0.14.3)
65-66: try-except-continue detected, consider logging the exception
(S112)
65-65: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In components/src/dynamo/vllm/checkpoint/utils.py around lines 52 to 75, the
code uses set[int]() which raises TypeError because types.GenericAlias is not
callable; change the initialization to use a plain set() while keeping the
typing annotation (e.g., seen: set[int] = set()) so the runtime constructs a set
correctly and the static type remains accurate; no other logic changes required.
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline failure: Fix copyright header format.
The copyright header format is not matching the expected pattern, causing the pipeline to fail.
The existing headers appear correct for the vLLM project. This might be a mismatch with the Dynamo project requirements. Verify the expected header format with the team or check other accepted files in components/src/dynamo/vllm/ for the correct pattern.
🧰 Tools
🪛 GitHub Actions: Copyright Checks
[error] 1-1: invalid/missing header: components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py
🤖 Prompt for AI Agents
components/src/dynamo/vllm/checkpoint/zmq_async_llm_server.py lines 1-2: the
file header SPDX/license lines do not match the Dynamo project expected header
pattern and cause CI failures; update the top-of-file header to the repository's
canonical header (match other accepted files under
components/src/dynamo/vllm/)—replace the current two SPDX lines with the exact
header format used across the Dynamo codebase (including the correct
SPDX-License-Identifier line and the correct SPDX-FileCopyrightText
content/owner string), commit the change, and run the pipeline to confirm the
header now passes validation.
| RUN add-apt-repository ppa:criu/ppa && \ | ||
| apt-get update && \ | ||
| apt-get install -y --no-install-recommends criu && \ | ||
| rm -rf /var/lib/apt/lists/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the CRIU PPA step non-interactive
add-apt-repository prompts for confirmation unless -y/--yes is passed, so the Docker build hangs and then fails (see the pre-merge CI failure). Please add the non-interactive flag so the stage can run unattended.
Apply this diff:
-RUN add-apt-repository ppa:criu/ppa && \
+RUN add-apt-repository -y ppa:criu/ppa && \
apt-get update && \
apt-get install -y --no-install-recommends criu && \
rm -rf /var/lib/apt/lists/*📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| RUN add-apt-repository ppa:criu/ppa && \ | |
| apt-get update && \ | |
| apt-get install -y --no-install-recommends criu && \ | |
| rm -rf /var/lib/apt/lists/* | |
| RUN add-apt-repository -y ppa:criu/ppa && \ | |
| apt-get update && \ | |
| apt-get install -y --no-install-recommends criu && \ | |
| rm -rf /var/lib/apt/lists/* |
🤖 Prompt for AI Agents
In container/Dockerfile.vllm-checkpoint around lines 211 to 214, the
add-apt-repository call is interactive and blocks the Docker build; update the
command to run non-interactively by adding the -y (or --yes) flag to
add-apt-repository (e.g., add-apt-repository -y ppa:criu/ppa) so the step
proceeds unattended, leaving the apt-get update, apt-get install -y
--no-install-recommends criu, and cleanup of /var/lib/apt/lists/* unchanged.
Overview:
Co-written by @galletas1712
This feature enables fast startup of vLLM inference engines by checkpointing GPU-initialized processes and restoring them on-demand. This dramatically reduces the time to initialize vLLM workers, especially beneficial in autoscaling scenarios.
Important Caveats
Details:
The checkpoint/restore functionality uses
The implementation uses
CheckpointableAsyncLLM, a wrapper around vLLM'sAsyncLLMthat:Three Initialization Paths
1. Checkpointing Disabled (Standard Path)
AsyncLLMinstance directly2. Checkpoint Exists (Fast Restore Path)
CheckpointableAsyncLLMwithauto_start=False3. Checkpoint Doesn't Exist (Initial Checkpoint Creation)
CheckpointableAsyncLLMwithauto_start=TrueWhere should the reviewer start?
CheckpointableAsyncLLM
components/src/dynamo/vllm/checkpoint/checkpointable_async_llm.pyvLLM Engine Management
components/src/dynamo/vllm/engine.pyDockerfile.vllm-checkpoint
Config Hash-Based Isolation
VllmConfig.compute_hash()[:8]for subdirectories:checkpoint_dir/{hash}/GPU Memory Checkpointing (
checkpoint/utils.py)ZMQ Communication
ZMQAsyncLLMServer, parent sends requestsMetadata (
checkpoint/metadata.py)vllm_checkpoint_metadata.jsonwith TTY info, PIDs, ZMQ portRace Condition Handling
os.rename()ensures only one checkpoint survivesSummary by CodeRabbit
New Features
ENABLE_CHECKPOINTINGandCHECKPOINT_DIRenvironment variablesChores