#
Prefill-decode disaggregation
During LLM inference, computation occurs in two stages: prefill and decode. In the prefill phase, the model processes the entire input prompt to generate the first token — a highly parallel, compute-bound process. The decode phase then predicts one token at a time, reusing the growing KV cache, and is memory-bound.
Because these phases have fundamentally different characteristics, prefill-decode (PD) disaggregation executes them on separate GPU resources. The prefill runs first on compute-optimized machines, then the KV cache is transferred to memory-optimized ones for decoding. This separation allows each phase to use its optimal parallelization, batch size, and configurations, while preventing interference between concurrent requests.
PD disaggregation can improve key metrics such as time to first token (TTFT) and time per output token (TPOT) — since TTFT depends on prefill and TPOT on decode, dedicated optimization for each leads to better overall performance. However, because it also introduces communication overhead, which may negatively affect TTFT, PD disaggregation should be applied judiciously to ensure net efficiency gains.
#
Key features
- The Heimdall scheduler runs prefill-only and decode-only instances separately, allowing each to scale independently and managing request routing between them.
- The framework can automatically determine whether to apply PD disaggregation and how to scale each phase according to defined service level objectives (SLOs).
- Moreh vLLM is optimized to efficiently execute both prefill and decode phases of various models on AMD MI200 and MI300 series GPUs. It applies distinct parallelization and optimization strategies tailored to prefill-only and decode-only instances.
#
Example: PD disaggregation on Llama 3.3 70B
#
Benchmarking environment and configuration
#
Deployment
The following configuration files show how to set up PD disaggregation on the Heimdall scheduler and the Odin inference service. Prefill-only and decode-only vLLM instances each use two AMD MI250 GPUs. As a result, a total of eight instances can run across four servers. The ratio between prefill and decode instances is dynamically adjusted by the Heimdall scheduler.
In the inference-service-values.yaml file, the number of amd.com/gpu is set to 4 because each MI250 GPU is recognized as two logical devices at the device driver level. Therefore, four logical devices correspond to two physical GPUs. This behavior is specific to the MI250 model.
config:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: pd-profile-handler
- type: prefill-filter
- type: decode-filter
- type: queue-scorer
- type: max-score-picker
parameters:
maxNumOfEndpoints: 2
schedulingProfiles:
- name: prefill
plugins:
- pluginRef: prefill-filter
- pluginRef: queue-scorer
weight: 1
- pluginRef: max-score-picker
- name: decode
plugins:
- pluginRef: decode-filter
- pluginRef: queue-scorer
weight: 1
- pluginRef: max-score-picker
inferenceModel:
modelName: meta-llama/Llama-3.3-70B-Instruct
poolRef:
name: heimdall
extraArgs:
- "{{ .Values.inferenceModel.modelName }}"
- --quantization
- "None"
- --tensor-parallel-size
- "4"
- --max-num-batched-tokens
- "8192"
- --no-enable-prefix-caching
- --kv-transfer-config
- '{"kv_connector":"NixlConnector", "kv_role":"kv_both"}'
- --no-enable-log-requests
- --disable-uvicorn-access-log
commonResources: &commonResources
limits: &commonLimits
amd.com/gpu: "4"
mellanox/hca: "1"
requests: *commonLimits
Run the following command to deploy the services.
helm install heimdall moreh/heimdall
helm install inference-service moreh/inference-service
#
Benchmarking
Use the genai-bench tool as follows to measure performance for the benchmarking scenario described above. Note that the --api-base option must be set to your actual endpoint URL.
genai-bench benchmark \
--api-backend vLLM \
--api-key anything \
--api-base http://heimdall-istio.{NAMESPACE}.svc.cluster.local \
--api-model-name meta-llama/Llama-3.3-70B-Instruct \
--model-tokenizer meta-llama/Llama-3.3-70B-Instruct \
--task text-to-text \
--max-time-per-run 1000 \
--max-requests-per-run 3200 \
--server-engine vLLM \
--traffic-scenario "N(3000,300)/(200,20)" \
--num-concurrency 64 \
--warmup-ratio 0.05 \
--cooldown-ratio 0.05
#
Experimental results
We compared the performance of our PD disaggregation setup with that of a baseline configuration using a Kubernetes Service, where requests were simply distributed in a round-robin manner across eight vLLM instances without disaggregation. Time per output token (TPOT) was reduced by approximately 30% (133 → 96 ms), and as a result, the total benchmark runtime decreased by about 20% (1428.6 → 1165.8 s), even though time to first token (TTFT) was sacrificed.
End-to-end latency:
TTFT (time to first token):
TPOT (time per output token):
However, the degradation in TTFT also implies that PD disaggregation should be applied carefully depending on the SLO. The method for automating scheduling in an SLO-driven manner is described in a separate document.