#
Context length-aware routing
Context length-aware routing filters and scores pods based on the context length (token count) of LLM requests. This enables routing requests to pods optimized for specific context length ranges, improving resource utilization and performance.
#
Key features
- The Heimdall scheduler supports context length-aware filtering and scoring to route requests to appropriate pods.
- Pods can be labeled with supported context length ranges, allowing flexible resource allocation based on request characteristics.
- The framework supports both filtering mode (strict exclusion) and scoring mode (preference-based selection).
#
Use cases
- Resource optimization: Route short context requests to smaller pods and long context requests to larger pods.
- Performance improvement: Utilize pods with specialized model configurations for specific context lengths.
- Cost reduction: Allocate appropriate resources based on request size.
#
Scorer
Context length-aware routing is applied by enabling and configuring context-length-aware in the Heimdall scheduler. The following configuration file shows an example of setting up the scorer.
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: pd-profile-handler
- type: prefill-filter
- type: decode-filter
- type: context-length-aware
parameters:
enableFiltering: false # Set to true to enable filtering, false for scoring only
charToTokenMultiplier: 0.25 # 1 token ≈ 4 characters
- type: max-score-picker
schedulingProfiles:
- name: prefill
plugins:
- pluginRef: prefill-filter
- pluginRef: context-length-aware
- pluginRef: max-score-picker
- name: decode
plugins:
- pluginRef: decode-filter
- pluginRef: context-length-aware
- pluginRef: max-score-picker
#
Parameters
#
Pod label setup
Pods must be labeled with their supported context length ranges for routing to work correctly.
#
Label format
<label-name>: "<min>-<max>"
min: Minimum token count (0 or greater)max: Maximum token count (must be >= min)
#
Single range example
apiVersion: v1
kind: Pod
metadata:
name: inference-pod-small
labels:
mif.moreh.io/context-length-range: "0-2048"
#
Multiple ranges example
Multiple ranges can be specified using comma separation:
apiVersion: v1
kind: Pod
metadata:
name: inference-pod-mixed
labels:
mif.moreh.io/context-length-range: "0-2048,8192-16384"
#
How it works
#
Context length estimation
The context length of a request is estimated as follows:
Estimated tokens = Total characters × 0.25
- Chat Completions: Sum of all message
contentcharacter counts - Completions:
promptcharacter count
#
Filter mode (enableFiltering: true)
When filtering is enabled:
- Pods without label: Accept all context lengths → Pass
- Pods with label: If request context length is within range → Pass
- Range mismatch: → Excluded
#
Score mode (always active)
All pods are assigned scores based on context length matching:
#
Scoring algorithm
When context length matches a range:
score = 0.7 × widthScore + 0.3 × positionScore
widthScore = 1.0 / (1.0 + rangeWidth / 10000)
positionScore = (max - contextLength) / rangeWidth
- widthScore: Narrower ranges receive higher scores (specialized pods are preferred)
- positionScore: More headroom to max results in higher scores (stable processing)
#
Examples
The following examples demonstrate scoring mode (enableFiltering: false), which assigns scores to all pods without excluding any.
#
Scenario 1: Short context request (500 tokens)
Request context length: 500 tokens
Pod configuration:
- Pod A:
0-2048(Small) - Pod B:
2048-8192(Medium) - Pod C: No label
Result:
→ Pod A selected
#
Scenario 2: Overflow (50,000 tokens)
Request context length: 50,000 tokens (exceeds all pod ranges)
Pod configuration:
- Pod A:
0-8192 - Pod B:
0-32768
Result:
→ Pod B selected (larger max value)
#
Example: Deployment with context length-aware routing
This example shows how to deploy inference services with context length-aware routing. Different pods are configured to handle different context length ranges, routing short context requests to smaller pods and long context requests to larger pods.
#
Deployment
The following configuration files show how to set up the context-length-aware plugin on the Heimdall scheduler and configure inference services with appropriate context length labels.
global:
imagePullSecrets:
- name: moreh-registry
config:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: single-profile-handler
- type: queue-scorer
- type: context-length-aware
- type: max-score-picker
parameters:
maxNumOfEndpoints: 2
schedulingProfiles:
- name: default
plugins:
- pluginRef: queue-scorer
weight: 1
- pluginRef: context-length-aware
weight: 2
- pluginRef: max-score-picker
gateway:
name: mif
gatewayClassName: istio
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceService
metadata:
name: inference-small
spec:
replicas: 1
inferencePoolRefs:
- name: heimdall
templateRefs:
- name: vllm
- name: quickstart-vllm-meta-llama-llama-3.2-1b-instruct-amd-mi250-tp2
template:
metadata:
labels:
mif.moreh.io/context-length-range: "0-2048"
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceService
metadata:
name: inference-large
spec:
replicas: 1
inferencePoolRefs:
- name: heimdall
templateRefs:
- name: vllm
- name: quickstart-vllm-meta-llama-llama-3.2-1b-instruct-amd-mi250-tp2
template:
metadata:
labels:
mif.moreh.io/context-length-range: "2048-32768"
With this configuration, requests with context lengths up to 2,048 tokens are routed to the small inference pod, while requests with context lengths between 2,048 and 32,768 tokens are routed to the large inference pod. This ensures efficient resource utilization by matching request sizes to appropriately sized pods.