-
Notifications
You must be signed in to change notification settings - Fork 688
Description
Describe the Bug
[generated with ai based on slack convo + debugging session]
🧭 Summary
When running multi-node setups with dynamo.sglang, only the leader (rank 0) node exposes both Dynamo + SGLang Prometheus metrics.
Non-leader nodes expose only Dynamo metrics on their DYN_SYSTEM_PORT, and their SGLang metrics are only available through the dummy server (default port 30000).
This creates an inconsistent UX compared to sgl.launch_server, where every node serves merged metrics from a single /metrics endpoint.
⚙️ Environment
| Component | Version / Config |
|---|---|
| Launcher | python -m dynamo.sglang |
| Cluster | SLURM (non-K8s) |
| Setup | nnodes=2, tp=8, dp=1 |
| Env Vars | DYN_SYSTEM_ENABLED=true, DYN_SYSTEM_PORT=8081, SGLANG_BLOCK_NONZERO_RANK_CHILDREN unset |
| Args | --enable-metrics |
🔁 Repro Steps
# Node 0
python3 -m dynamo.sglang \
--model-path deepseek-ai/DeepSeek-R1-0528 \
--trust-remote-code \
--disaggregation-mode prefill \
--dist-init-addr $ADDR \
--nnodes 2 \
--node-rank 0 \
--enable-metrics &
# Node 1
python3 -m dynamo.sglang \
--model-path deepseek-ai/DeepSeek-R1-0528 \
--trust-remote-code \
--disaggregation-mode prefill \
--dist-init-addr $ADDR \
--nnodes 2 \
--node-rank 1 \
--enable-metrics &Poll metrics:
curl http://$NODE0_IP:8081/metrics # ✅ Dynamo + SGLang metrics
curl http://$NODE1_IP:8081/metrics # ❌ Dynamo only
curl http://$NODE1_IP:30000/metrics # ✅ SGLang dummy metrics🔍 Root Cause
When using the sgl.Engine API (which dynamo.sglang relies on), non-zero-rank nodes never start an HTTP server due to this logic in engine.py:
if os.getenv("SGLANG_BLOCK_NONZERO_RANK_CHILDREN") == "0":
# When using Engine as a Python API, we don't want to block here.
return None, None, None, port_args
launch_dummy_health_check_server(...)Since SGLANG_BLOCK_NONZERO_RANK_CHILDREN is unset by default, the check effectively disables the dummy health/metrics server for nodes ≥ 1.
As a result:
| Node | Behavior |
|---|---|
| Rank 0 | Full Dynamo + SGLang metrics exposed |
| Rank ≥ 1 | Dynamo metrics only — SGLang metrics accessible only from dummy port 30000 |
✅ Partial Fix (Upstream SGLang)
A fix for the missing dummy metrics server was introduced in
[sgl-project/sglang#12297](https://github.com/sgl-project/sglang/pull/12297)
After this PR:
curl http://$NODE1_IP:30000/metrics→ ✅ SGLang metrics available- Still, Dynamo’s
/metricsendpoint doesn’t merge these SGLang metrics
🧩 Remaining Issue (Dynamo Side)
Even with the SGLang fix, Dynamo’s metrics publisher:
- exposes its own Prometheus registry on
DYN_SYSTEM_PORT - does not import/merge the SGLang metrics registry from local ranks
Hence, multi-node runs still require hitting two endpoints per node.
💡 Desired Behavior
| Node | Expected /metrics output |
|---|---|
| Rank 0 | Dynamo + SGLang metrics |
| Rank ≥ 1 | Dynamo + SGLang metrics (same port) |
In other words, each node should expose a unified metrics endpoint on its DYN_SYSTEM_PORT, consistent with sgl.launch_server.
🧪 Proposed Fix
-
Merge registries
Update Dynamo’s metrics publisher to collect from both:- Its own Prometheus registry
- SGLang’s dummy metrics registry (via
PROMETHEUS_MULTIPROC_DIRor direct merge)
-
Set sane default for child ranks
Whenserver_args.node_rank != 0, internally set
SGLANG_BLOCK_NONZERO_RANK_CHILDREN=1
so that non-leader ranks can start the dummy metrics publisher without blocking generation endpoints. -
Integration test
Add a regression test (e.g.tests/integration/metrics_multi_node.py) verifying:curl $NODE0:8081/metrics # Dynamo + SGLang curl $NODE1:8081/metrics # Dynamo + SGLang
🧠 Additional Context
This surfaced during SA benchmarking (SLURM environment, TP16 across two GB200 nodes).
The gap was identified after debugging differences between sgl.Engine and launch_server code paths and confirming that non-zero ranks never reach launch_dummy_health_check_server() initialization.
Steps to Reproduce
.
Expected Behavior
.
Actual Behavior
.
Environment
.
Additional Context
No response
Screenshots
No response