Skip to content

[BUG]: SGL non-leader workers in multinode setups dont expose metrics via dynamo collector endpoint #4236

@ishandhanani

Description

@ishandhanani

Describe the Bug

[generated with ai based on slack convo + debugging session]


🧭 Summary

When running multi-node setups with dynamo.sglang, only the leader (rank 0) node exposes both Dynamo + SGLang Prometheus metrics.
Non-leader nodes expose only Dynamo metrics on their DYN_SYSTEM_PORT, and their SGLang metrics are only available through the dummy server (default port 30000).

This creates an inconsistent UX compared to sgl.launch_server, where every node serves merged metrics from a single /metrics endpoint.


⚙️ Environment

Component Version / Config
Launcher python -m dynamo.sglang
Cluster SLURM (non-K8s)
Setup nnodes=2, tp=8, dp=1
Env Vars DYN_SYSTEM_ENABLED=true, DYN_SYSTEM_PORT=8081, SGLANG_BLOCK_NONZERO_RANK_CHILDREN unset
Args --enable-metrics

🔁 Repro Steps

# Node 0
python3 -m dynamo.sglang \
  --model-path deepseek-ai/DeepSeek-R1-0528 \
  --trust-remote-code \
  --disaggregation-mode prefill \
  --dist-init-addr $ADDR \
  --nnodes 2 \
  --node-rank 0 \
  --enable-metrics &

# Node 1
python3 -m dynamo.sglang \
  --model-path deepseek-ai/DeepSeek-R1-0528 \
  --trust-remote-code \
  --disaggregation-mode prefill \
  --dist-init-addr $ADDR \
  --nnodes 2 \
  --node-rank 1 \
  --enable-metrics &

Poll metrics:

curl http://$NODE0_IP:8081/metrics   # ✅ Dynamo + SGLang metrics
curl http://$NODE1_IP:8081/metrics   # ❌ Dynamo only
curl http://$NODE1_IP:30000/metrics  # ✅ SGLang dummy metrics

🔍 Root Cause

When using the sgl.Engine API (which dynamo.sglang relies on), non-zero-rank nodes never start an HTTP server due to this logic in engine.py:

if os.getenv("SGLANG_BLOCK_NONZERO_RANK_CHILDREN") == "0":
    # When using Engine as a Python API, we don't want to block here.
    return None, None, None, port_args
launch_dummy_health_check_server(...)

Since SGLANG_BLOCK_NONZERO_RANK_CHILDREN is unset by default, the check effectively disables the dummy health/metrics server for nodes ≥ 1.
As a result:

Node Behavior
Rank 0 Full Dynamo + SGLang metrics exposed
Rank ≥ 1 Dynamo metrics only — SGLang metrics accessible only from dummy port 30000

✅ Partial Fix (Upstream SGLang)

A fix for the missing dummy metrics server was introduced in
[sgl-project/sglang#12297](https://github.com/sgl-project/sglang/pull/12297)

After this PR:

  • curl http://$NODE1_IP:30000/metrics → ✅ SGLang metrics available
  • Still, Dynamo’s /metrics endpoint doesn’t merge these SGLang metrics

🧩 Remaining Issue (Dynamo Side)

Even with the SGLang fix, Dynamo’s metrics publisher:

  • exposes its own Prometheus registry on DYN_SYSTEM_PORT
  • does not import/merge the SGLang metrics registry from local ranks

Hence, multi-node runs still require hitting two endpoints per node.


💡 Desired Behavior

Node Expected /metrics output
Rank 0 Dynamo + SGLang metrics
Rank ≥ 1 Dynamo + SGLang metrics (same port)

In other words, each node should expose a unified metrics endpoint on its DYN_SYSTEM_PORT, consistent with sgl.launch_server.


🧪 Proposed Fix

  1. Merge registries
    Update Dynamo’s metrics publisher to collect from both:

    • Its own Prometheus registry
    • SGLang’s dummy metrics registry (via PROMETHEUS_MULTIPROC_DIR or direct merge)
  2. Set sane default for child ranks
    When server_args.node_rank != 0, internally set
    SGLANG_BLOCK_NONZERO_RANK_CHILDREN=1
    so that non-leader ranks can start the dummy metrics publisher without blocking generation endpoints.

  3. Integration test
    Add a regression test (e.g. tests/integration/metrics_multi_node.py) verifying:

    curl $NODE0:8081/metrics  # Dynamo + SGLang
    curl $NODE1:8081/metrics  # Dynamo + SGLang

🧠 Additional Context

This surfaced during SA benchmarking (SLURM environment, TP16 across two GB200 nodes).
The gap was identified after debugging differences between sgl.Engine and launch_server code paths and confirming that non-zero ranks never reach launch_dummy_health_check_server() initialization.

Steps to Reproduce

.

Expected Behavior

.

Actual Behavior

.

Environment

.

Additional Context

No response

Screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions