Skip to content

[BUG]: AWS EFA Performance #4185

@chandlj

Description

@chandlj

Describe the Bug

I'm attempting to run disaggregated serving on AWS EKS with EFA-enabled instances. The two instances are p5.48xlarge instances with EFA enabled, and I verified using this AWS NCCL test which showed 400GB/s bandwidth over the fabric. Currently, I have one decode worker and one prefill worker running an FP8 Llama 3.3 70B model. To test KV transfer, I'm running one concurrent request at a time with Guidellm and observing the prefill and decode worker performance.

The ITL output is as expected (<20ms) but the TTFT is way higher than it should be. Running on other clouds with RDMA enabled, our TTFT in this configuration was ~600-800ms, whereas on EKS with EFA-enabled instances we are getting >2.5s.

My guess is that NIXL based on this thread needs to be enabled with LIBFABRIC backend support, but I can't set that flag to test if that fixes it. See #4186

Steps to Reproduce

  1. Setup Dynamo Cloud with the k8s quickstart guide: https://docs.nvidia.com/dynamo/latest/kubernetes/README.html
  2. Deploy the following disaggregated setup
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: vllm-disagg
spec:
  services:
    Frontend:
      dynamoNamespace: vllm-disagg
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1
    VllmDecodeWorker:
      dynamoNamespace: vllm-disagg
      envFromSecret: hf-token-secret
      componentType: worker
      subComponentType: decode
      replicas: 1
      resources:
        limits:
          gpu: "2"
          custom: { "vpc.amazonaws.com/efa": "32" }
        requests:
          gpu: "2"
          custom: { "vpc.amazonaws.com/efa": "32" }
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1
          workingDir: /workspace/components/backends/vllm
          command:
            - python3
            - -m
            - dynamo.vllm
          args:
            - --model
            - nvidia/Llama-3.3-70B-Instruct-FP8
            - --no-enable-prefix-caching
            - --tensor-parallel-size
            - "2"
            - --quantization
            - modelopt
            - --kv-cache-dtype
            - fp8
            - --max-model-len
            - "128000"
            # - "--kv-transfer-config"
            # - "{\"kv_connector\": \"NixlConnector\", \"kv_role\": \"kv_both\", \"kv_connector_extra_config\": {\"backends\": [\"UCX\", \"LIBFABRIC\"]}}"
          volumeMounts:
            - name: model-volume
              mountPath: /root/.cache/huggingface
        volumes:
          - name: model-volume
            persistentVolumeClaim:
              claimName: vllm-model-claim
    VllmPrefillWorker:
      dynamoNamespace: vllm-disagg
      envFromSecret: hf-token-secret
      componentType: worker
      subComponentType: prefill
      replicas: 1
      resources:
        limits:
          gpu: "2"
          custom: { "vpc.amazonaws.com/efa": "32" }
        requests:
          gpu: "2"
          custom: { "vpc.amazonaws.com/efa": "32" }
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1
          workingDir: /workspace/components/backends/vllm
          command:
            - python3
            - -m
            - dynamo.vllm
          args:
            - --model
            - nvidia/Llama-3.3-70B-Instruct-FP8
            - --no-enable-prefix-caching
            - --tensor-parallel-size
            - "2"
            - --quantization
            - modelopt
            - --kv-cache-dtype
            - fp8
            - --max-model-len
            - "128000"
            # - "--kv-transfer-config"
            # - "{\"kv_connector\": \"NixlConnector\", \"kv_role\": \"kv_both\", \"kv_connector_extra_config\": {\"backends\": [\"UCX\", \"LIBFABRIC\"]}}"
            - --is-prefill-worker
          volumeMounts:
            - name: model-volume
              mountPath: /root/.cache/huggingface
        volumes:
          - name: model-volume
            persistentVolumeClaim:
              claimName: vllm-model-claim
  1. Run a guidellm benchmark with the following Job
apiVersion: batch/v1
kind: Job
metadata:
  name: benchmark-pod
spec:
  template:
    spec:
      containers:
        - name: benchmark-container
          image: ghcr.io/vllm-project/guidellm:latest
          env:
            - name: GUIDELLM_TARGET
              value: http://vllm-disagg-frontend.dynamo-system.svc.cluster.local:8000
            - name: GUIDELLM_RATE_TYPE
              value: concurrent
            - name: GUIDELLM_RATE
              value: "1"
            - name: GUIDELLM_MAX_SECONDS
              value: "120"
            - name: GUIDELLM_DATA
              value: "prompt_tokens=7000,output_tokens=128"
      restartPolicy: Never
  backoffLimit: 0

Expected Behavior

Mean/P90 TTFT should be less than 1 second

Actual Behavior

TTFT is 2.5 seconds:

Benchmark Req/Second Concurrency Out Tok/sec (mean) Tot Tok/sec (mean) Latency (mean) Latency (median) Latency (p99) TTFT (mean) TTFT (median) TTFT (p99) ITL (mean) ITL (median) ITL (p99) TPOT (mean) TPOT (median) TPOT (p99)
concurrent@1 0.21 1.00 26.6 1481.5 4.81 4.81 4.84 2651.2 2647.8 2678.6 17.0 17.0 17.1 16.9 16.8 16.9

Environment

AWS EKS Kubernetes version v1.34
Two p5.48xlarge Instances with EFA-enabled
Dynamo vLLM Runtime v0.6.1

Additional Context

My initialization of the cluster was very similar to this doc: https://github.com/ai-dynamo/dynamo/blob/main/examples/deployments/EKS/Create_EKS_EFS.md

Screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions