-
Notifications
You must be signed in to change notification settings - Fork 688
Description
Describe the Bug
I'm attempting to run disaggregated serving on AWS EKS with EFA-enabled instances. The two instances are p5.48xlarge instances with EFA enabled, and I verified using this AWS NCCL test which showed 400GB/s bandwidth over the fabric. Currently, I have one decode worker and one prefill worker running an FP8 Llama 3.3 70B model. To test KV transfer, I'm running one concurrent request at a time with Guidellm and observing the prefill and decode worker performance.
The ITL output is as expected (<20ms) but the TTFT is way higher than it should be. Running on other clouds with RDMA enabled, our TTFT in this configuration was ~600-800ms, whereas on EKS with EFA-enabled instances we are getting >2.5s.
My guess is that NIXL based on this thread needs to be enabled with LIBFABRIC backend support, but I can't set that flag to test if that fixes it. See #4186
Steps to Reproduce
- Setup Dynamo Cloud with the k8s quickstart guide: https://docs.nvidia.com/dynamo/latest/kubernetes/README.html
- Deploy the following disaggregated setup
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-disagg
spec:
services:
Frontend:
dynamoNamespace: vllm-disagg
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1
VllmDecodeWorker:
dynamoNamespace: vllm-disagg
envFromSecret: hf-token-secret
componentType: worker
subComponentType: decode
replicas: 1
resources:
limits:
gpu: "2"
custom: { "vpc.amazonaws.com/efa": "32" }
requests:
gpu: "2"
custom: { "vpc.amazonaws.com/efa": "32" }
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1
workingDir: /workspace/components/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- nvidia/Llama-3.3-70B-Instruct-FP8
- --no-enable-prefix-caching
- --tensor-parallel-size
- "2"
- --quantization
- modelopt
- --kv-cache-dtype
- fp8
- --max-model-len
- "128000"
# - "--kv-transfer-config"
# - "{\"kv_connector\": \"NixlConnector\", \"kv_role\": \"kv_both\", \"kv_connector_extra_config\": {\"backends\": [\"UCX\", \"LIBFABRIC\"]}}"
volumeMounts:
- name: model-volume
mountPath: /root/.cache/huggingface
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: vllm-model-claim
VllmPrefillWorker:
dynamoNamespace: vllm-disagg
envFromSecret: hf-token-secret
componentType: worker
subComponentType: prefill
replicas: 1
resources:
limits:
gpu: "2"
custom: { "vpc.amazonaws.com/efa": "32" }
requests:
gpu: "2"
custom: { "vpc.amazonaws.com/efa": "32" }
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1
workingDir: /workspace/components/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- nvidia/Llama-3.3-70B-Instruct-FP8
- --no-enable-prefix-caching
- --tensor-parallel-size
- "2"
- --quantization
- modelopt
- --kv-cache-dtype
- fp8
- --max-model-len
- "128000"
# - "--kv-transfer-config"
# - "{\"kv_connector\": \"NixlConnector\", \"kv_role\": \"kv_both\", \"kv_connector_extra_config\": {\"backends\": [\"UCX\", \"LIBFABRIC\"]}}"
- --is-prefill-worker
volumeMounts:
- name: model-volume
mountPath: /root/.cache/huggingface
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: vllm-model-claim- Run a guidellm benchmark with the following
Job
apiVersion: batch/v1
kind: Job
metadata:
name: benchmark-pod
spec:
template:
spec:
containers:
- name: benchmark-container
image: ghcr.io/vllm-project/guidellm:latest
env:
- name: GUIDELLM_TARGET
value: http://vllm-disagg-frontend.dynamo-system.svc.cluster.local:8000
- name: GUIDELLM_RATE_TYPE
value: concurrent
- name: GUIDELLM_RATE
value: "1"
- name: GUIDELLM_MAX_SECONDS
value: "120"
- name: GUIDELLM_DATA
value: "prompt_tokens=7000,output_tokens=128"
restartPolicy: Never
backoffLimit: 0Expected Behavior
Mean/P90 TTFT should be less than 1 second
Actual Behavior
TTFT is 2.5 seconds:
| Benchmark | Req/Second | Concurrency | Out Tok/sec (mean) | Tot Tok/sec (mean) | Latency (mean) | Latency (median) | Latency (p99) | TTFT (mean) | TTFT (median) | TTFT (p99) | ITL (mean) | ITL (median) | ITL (p99) | TPOT (mean) | TPOT (median) | TPOT (p99) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| concurrent@1 | 0.21 | 1.00 | 26.6 | 1481.5 | 4.81 | 4.81 | 4.84 | 2651.2 | 2647.8 | 2678.6 | 17.0 | 17.0 | 17.1 | 16.9 | 16.8 | 16.9 |
Environment
AWS EKS Kubernetes version v1.34
Two p5.48xlarge Instances with EFA-enabled
Dynamo vLLM Runtime v0.6.1
Additional Context
My initialization of the cluster was very similar to this doc: https://github.com/ai-dynamo/dynamo/blob/main/examples/deployments/EKS/Create_EKS_EFS.md
Screenshots
No response