# Prefill-decode disaggregation

During LLM inference, computation occurs in two stages: prefill and decode. In the prefill phase, the model processes the entire input prompt to generate the first token — a highly parallel, compute-bound process. The decode phase then predicts one token at a time, reusing the growing KV cache, and is memory-bound.

Because these phases have fundamentally different characteristics, prefill-decode (PD) disaggregation executes them on separate GPU resources. The prefill runs first on compute-optimized machines, then the KV cache is transferred to memory-optimized ones for decoding. This separation allows each phase to use its optimal parallelization, batch size, and configurations, while preventing interference between concurrent requests.

PD disaggregation can improve key metrics such as time to first token (TTFT) and time per output token (TPOT) — since TTFT depends on prefill and TPOT on decode, dedicated optimization for each leads to better overall performance. However, because it also introduces communication overhead, which may negatively affect TTFT, PD disaggregation should be applied judiciously to ensure net efficiency gains.

# Key features

  • The Heimdall scheduler runs prefill-only and decode-only instances separately, allowing each to scale independently and managing request routing between them.
  • The framework can automatically determine whether to apply PD disaggregation and how to scale each phase according to defined service level objectives (SLOs).
  • Moreh vLLM is optimized to efficiently execute both prefill and decode phases of various models on AMD MI200 and MI300 series GPUs. It applies distinct parallelization and optimization strategies tailored to prefill-only and decode-only instances.

# Example: PD disaggregation on Llama 3.3 70B

# Benchmarking environment and configuration

Item Description
Servers 4x servers, each equipped with 4x AMD MI250 GPUs
Networking InfiniBand HDR
Inference Engine vLLM (0.10.1rc2.dev59+g0167efe20)
Model meta-llama/Llama-3.3-70B-Instruct
Benchmarking tool genai-bench
Benchmarking scenario Input sequence length ~ N(3000, 300), output sequence length ~ N(200, 20), concurrency = 64

# Deployment

The following configuration files show how to set up PD disaggregation on the Heimdall scheduler and the Odin inference service. Prefill-only and decode-only vLLM instances each use two AMD MI250 GPUs. As a result, a total of eight instances can run across four servers. The ratio between prefill and decode instances is dynamically adjusted by the Heimdall scheduler.

heimdall-values.yaml
config:
  apiVersion: inference.networking.x-k8s.io/v1alpha1
  kind: EndpointPickerConfig
  plugins:
    - type: pd-profile-handler
    - type: prefill-filter
    - type: decode-filter
    - type: queue-scorer
    - type: max-score-picker
      parameters:
        maxNumOfEndpoints: 2
  schedulingProfiles:
    - name: prefill
      plugins:
        - pluginRef: prefill-filter
        - pluginRef: queue-scorer
          weight: 1
        - pluginRef: max-score-picker
    - name: decode
      plugins:
        - pluginRef: decode-filter
        - pluginRef: queue-scorer
          weight: 1
        - pluginRef: max-score-picker
inference-service-values.yaml
inferenceModel:
  modelName: meta-llama/Llama-3.3-70B-Instruct
  poolRef:
    name: heimdall

extraArgs:
  - "{{ .Values.inferenceModel.modelName }}"
  - --quantization
  - "None"
  - --tensor-parallel-size
  - "4"
  - --max-num-batched-tokens
  - "8192"
  - --no-enable-prefix-caching
  - --kv-transfer-config
  - '{"kv_connector":"NixlConnector", "kv_role":"kv_both"}'
  - --no-enable-log-requests
  - --disable-uvicorn-access-log

commonResources: &commonResources
  limits: &commonLimits
    amd.com/gpu: "4"
    mellanox/hca: "1"
  requests: *commonLimits

Run the following command to deploy the services.

helm install heimdall moreh/heimdall
helm install inference-service moreh/inference-service

# Benchmarking

Use the genai-bench tool as follows to measure performance for the benchmarking scenario described above. Note that the --api-base option must be set to your actual endpoint URL.

genai-bench benchmark \
  --api-backend vLLM \
  --api-key anything \
  --api-base http://heimdall-istio.{NAMESPACE}.svc.cluster.local \
  --api-model-name meta-llama/Llama-3.3-70B-Instruct \
  --model-tokenizer meta-llama/Llama-3.3-70B-Instruct \
  --task text-to-text \
  --max-time-per-run 1000 \
  --max-requests-per-run 3200 \
  --server-engine vLLM \
  --traffic-scenario "N(3000,300)/(200,20)" \
  --num-concurrency 64 \
  --warmup-ratio 0.05 \
  --cooldown-ratio 0.05

# Experimental results

We compared the performance of our PD disaggregation setup with that of a baseline configuration using a Kubernetes Service, where requests were simply distributed in a round-robin manner across eight vLLM instances without disaggregation. Time per output token (TPOT) was reduced by approximately 30% (133 → 96 ms), and as a result, the total benchmark runtime decreased by about 20% (1428.6 → 1165.8 s), even though time to first token (TTFT) was sacrificed.

End-to-end latency:

Router PD disaggregation Total duration (s) Mean P50 P90 P95 P99
Heimdall Applied 1165.8 22.826 22.620 26.734 28.103 30.582
K8s Service Not applied 1428.6 28.107 27.858 31.723 33.231 35.352

TTFT (time to first token):

Router PD disaggregation Mean (s) P50 P90 P95 P99
Heimdall Applied 3.7633 3.1132 6.9578 8.3785 10.023
K8s Service Not applied 1.6022 1.5994 1.8094 1.9000 2.0133

TPOT (time per output token):

Router PD disaggregation Mean (ms) P50 P90 P95 P99
Heimdall Applied 96.029 96.166 101.29 103.34 105.94
K8s Service Not applied 133.34 133.14 141.30 143.36 147.86

However, the degradation in TTFT also implies that PD disaggregation should be applied carefully depending on the SLO. The method for automating scheduling in an SLO-driven manner is described in a separate document.