# Load-aware routing

Load-aware routing monitors the number of assigned requests and real-time utilization metrics of each inference instance (Pod) to determine where the next request should be routed. Since individual requests have different workload characteristics and processing times, applying load-aware routing can achieve higher system-level efficiency than round-robin routing and especially help reduce latency variance across requests. Similar to other routing strategies such as prefix cache-aware routing, load-aware routing cannot serve as the sole routing criterion and should be combined with other metrics for optimal decision-making.

# Key features

  • The Heimdall scheduler supports various scoring methods for load-aware routing.
  • The framework can dynamically adjust the importance of load-aware routing based on defined service level objectives (SLOs) and the current traffic volume.

# Scorer

The Heimdall scheduler currently supports four scoring methods that can be manually enabled, disabled, or weighted to adjust their influence. All scores are normalized to values between 0 and 1, and a higher score indicates a lighter load — meaning the Pod is more preferred for routing.

  • queue-scorer: Returns a score based on the number of queued requests. The Pod with the fewest queued requests receives a score of 1.0, the one with the most receives 0.0, and the others are assigned proportionally based on their relative queue lengths.
  • load-aware-scorer: Returns a score based on the total number of requests.
  • active-request-scorer: Returns a score based on the number of active requests.
  • session-affinity-scorer: Returns a higher score if a Pod has previously handled a request from the same session (with the same x-session-token value in the HTTP header). This indirectly produces a similar effect to prefix cache-aware routing.

The following configuration file shows an example of manully enabling all scorers and assigning them equal weights.

heimdall-values.yaml
...
config:
  apiVersion: inference.networking.x-k8s.io/v1alpha1
  kind: EndpointPickerConfig
  plugins:
    ...
    - type: single-profile-handler
    - type: queue-scorer
    - type: load-aware-scorer
      parameters:
        threshold: 128
    - type: active-request-scorer
      parameters:
        requestTimeout: "2m"
    - type: session-affinity-scorer
    - type: max-score-picker
      parameters:
        maxNumOfEndpoints: 2
  schedulingProfiles:
    - name: default
      plugins:
        ...
        - pluginRef: queue-scorer
          weight: 1
        - pluginRef: load-aware-scorer
          weight: 1
        - pluginRef: active-request-scorer
          weight: 1
        - pluginRef: session-affinity-scorer
          weight: 1
        - pluginRef: max-score-picker
        ...
        
extraArgs:
  - -enablePprof=true
  - -modelServerMetricsPath=/metrics
  - -modelServerMetricsScheme=http
  - -modelServerMetricsHttpsInsecureSkipVerify=true
  - -zap-encoder=json
...