[BUG]: AWS EFA Performance

### Describe the Bug

I'm attempting to run disaggregated serving on AWS EKS with EFA-enabled instances. The two instances are p5.48xlarge instances with EFA enabled, and I verified using [this AWS NCCL test](https://docs.aws.amazon.com/eks/latest/userguide/node-efa.html) which showed 400GB/s bandwidth over the fabric. Currently, I have one decode worker and one prefill worker running an FP8 Llama 3.3 70B model. To test KV transfer, I'm running one concurrent request at a time with Guidellm and observing the prefill and decode worker performance.

The ITL output is as expected (<20ms) but the TTFT is way higher than it should be. Running on other clouds with RDMA enabled, our TTFT in this configuration was ~600-800ms, whereas on EKS with EFA-enabled instances we are getting >2.5s.

My guess is that NIXL based on [this thread](https://github.com/ai-dynamo/nixl/issues/909) needs to be enabled with LIBFABRIC backend support, but I can't set that flag to test if that fixes it. See #4186

### Steps to Reproduce

1. Setup Dynamo Cloud with the k8s quickstart guide: https://docs.nvidia.com/dynamo/latest/kubernetes/README.html
2. Deploy the following disaggregated setup

```yaml
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: vllm-disagg
spec:
  services:
    Frontend:
      dynamoNamespace: vllm-disagg
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1
    VllmDecodeWorker:
      dynamoNamespace: vllm-disagg
      envFromSecret: hf-token-secret
      componentType: worker
      subComponentType: decode
      replicas: 1
      resources:
        limits:
          gpu: "2"
          custom: { "vpc.amazonaws.com/efa": "32" }
        requests:
          gpu: "2"
          custom: { "vpc.amazonaws.com/efa": "32" }
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1
          workingDir: /workspace/components/backends/vllm
          command:
            - python3
            - -m
            - dynamo.vllm
          args:
            - --model
            - nvidia/Llama-3.3-70B-Instruct-FP8
            - --no-enable-prefix-caching
            - --tensor-parallel-size
            - "2"
            - --quantization
            - modelopt
            - --kv-cache-dtype
            - fp8
            - --max-model-len
            - "128000"
            # - "--kv-transfer-config"
            # - "{\"kv_connector\": \"NixlConnector\", \"kv_role\": \"kv_both\", \"kv_connector_extra_config\": {\"backends\": [\"UCX\", \"LIBFABRIC\"]}}"
          volumeMounts:
            - name: model-volume
              mountPath: /root/.cache/huggingface
        volumes:
          - name: model-volume
            persistentVolumeClaim:
              claimName: vllm-model-claim
    VllmPrefillWorker:
      dynamoNamespace: vllm-disagg
      envFromSecret: hf-token-secret
      componentType: worker
      subComponentType: prefill
      replicas: 1
      resources:
        limits:
          gpu: "2"
          custom: { "vpc.amazonaws.com/efa": "32" }
        requests:
          gpu: "2"
          custom: { "vpc.amazonaws.com/efa": "32" }
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1
          workingDir: /workspace/components/backends/vllm
          command:
            - python3
            - -m
            - dynamo.vllm
          args:
            - --model
            - nvidia/Llama-3.3-70B-Instruct-FP8
            - --no-enable-prefix-caching
            - --tensor-parallel-size
            - "2"
            - --quantization
            - modelopt
            - --kv-cache-dtype
            - fp8
            - --max-model-len
            - "128000"
            # - "--kv-transfer-config"
            # - "{\"kv_connector\": \"NixlConnector\", \"kv_role\": \"kv_both\", \"kv_connector_extra_config\": {\"backends\": [\"UCX\", \"LIBFABRIC\"]}}"
            - --is-prefill-worker
          volumeMounts:
            - name: model-volume
              mountPath: /root/.cache/huggingface
        volumes:
          - name: model-volume
            persistentVolumeClaim:
              claimName: vllm-model-claim
```
3. Run a [guidellm](https://github.com/vllm-project/guidellm/tree/main) benchmark with the following `Job`

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: benchmark-pod
spec:
  template:
    spec:
      containers:
        - name: benchmark-container
          image: ghcr.io/vllm-project/guidellm:latest
          env:
            - name: GUIDELLM_TARGET
              value: http://vllm-disagg-frontend.dynamo-system.svc.cluster.local:8000
            - name: GUIDELLM_RATE_TYPE
              value: concurrent
            - name: GUIDELLM_RATE
              value: "1"
            - name: GUIDELLM_MAX_SECONDS
              value: "120"
            - name: GUIDELLM_DATA
              value: "prompt_tokens=7000,output_tokens=128"
      restartPolicy: Never
  backoffLimit: 0
```

### Expected Behavior

Mean/P90 TTFT should be less than 1 second

### Actual Behavior

TTFT is 2.5 seconds:

| Benchmark   |Req/Second|Concurrency|Out Tok/sec (mean)|Tot Tok/sec (mean)|Latency (mean)|Latency (median)|Latency (p99)|TTFT (mean)|TTFT (median)|TTFT (p99)|ITL (mean)|ITL (median)|ITL (p99)|TPOT (mean)|TPOT (median)|TPOT (p99)|
|------------|----------|-----------|------------------|------------------|--------------|----------------|-------------|-----------|-------------|----------|----------|------------|---------|-----------|-------------|----------|
|concurrent@1|0.21      |1.00       |26.6              |1481.5            |4.81          |4.81            |4.84         |2651.2     |2647.8       |2678.6    |17.0      |17.0        |17.1     |16.9       |16.8         |16.9      |

### Environment

AWS EKS Kubernetes version v1.34
Two p5.48xlarge Instances with EFA-enabled
Dynamo vLLM Runtime v0.6.1

### Additional Context

My initialization of the cluster was very similar to this doc: https://github.com/ai-dynamo/dynamo/blob/main/examples/deployments/EKS/Create_EKS_EFS.md

### Screenshots

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: AWS EFA Performance #4185

Describe the Bug

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Additional Context

Screenshots

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: AWS EFA Performance #4185

Description

Describe the Bug

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Additional Context

Screenshots

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions