#
Load-aware routing
Load-aware routing monitors the number of assigned requests and real-time utilization metrics of each inference instance (Pod) to determine where the next request should be routed. Since individual requests have different workload characteristics and processing times, applying load-aware routing can achieve higher system-level efficiency than round-robin routing and especially help reduce latency variance across requests. Similar to other routing strategies such as prefix cache-aware routing, load-aware routing cannot serve as the sole routing criterion and should be combined with other metrics for optimal decision-making.
#
Key features
- The Heimdall scheduler supports various scoring methods for load-aware routing.
- The framework can dynamically adjust the importance of load-aware routing based on defined service level objectives (SLOs) and the current traffic volume.
#
Scorer
The Heimdall scheduler currently supports four scoring methods that can be manually enabled, disabled, or weighted to adjust their influence. All scores are normalized to values between 0 and 1, and a higher score indicates a lighter load — meaning the Pod is more preferred for routing.
- queue-scorer: Returns a score based on the number of queued requests. The Pod with the fewest queued requests receives a score of 1.0, the one with the most receives 0.0, and the others are assigned proportionally based on their relative queue lengths.
- load-aware-scorer: Returns a score based on the total number of requests.
- active-request-scorer: Returns a score based on the number of active requests.
- session-affinity-scorer: Returns a higher score if a Pod has previously handled a request from the same session (with the same
x-session-tokenvalue in the HTTP header). This indirectly produces a similar effect to prefix cache-aware routing.
The following configuration file shows an example of manully enabling all scorers and assigning them equal weights.
...
config:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
...
- type: single-profile-handler
- type: queue-scorer
- type: load-aware-scorer
parameters:
threshold: 128
- type: active-request-scorer
parameters:
requestTimeout: "2m"
- type: session-affinity-scorer
- type: max-score-picker
parameters:
maxNumOfEndpoints: 2
schedulingProfiles:
- name: default
plugins:
...
- pluginRef: queue-scorer
weight: 1
- pluginRef: load-aware-scorer
weight: 1
- pluginRef: active-request-scorer
weight: 1
- pluginRef: session-affinity-scorer
weight: 1
- pluginRef: max-score-picker
...
extraArgs:
- -enablePprof=true
- -modelServerMetricsPath=/metrics
- -modelServerMetricsScheme=http
- -modelServerMetricsHttpsInsecureSkipVerify=true
- -zap-encoder=json
...