#
Prefix cache-aware routing
Prefix caching refers to a technique that stores the KV cache from previous queries, allowing subsequent queries with an identical prefix to reuse it, thereby eliminating redundant computation and improving performance. Since multiple queries often share ommon prefixes — such as system prompts, conversation history, or contextual documents, — recomputing the KV cache for every request would be highly inefficient.
In a system composed of multiple inference instances (Pods), each instance maintains its own (L1) prefix cache in GPU memory. As a result, the cache hit rate (the length of the cached prefix) can vary depending on which instance a request is routed to. Prefix cache-aware routing calculates the cache hit rate of the given request for each Pod and prioritizes routing to the Pod with the highest cache coverage. This reduces redundant KV computation and improves both time to first token (TTFT) and overall throughput.
However, in real-world inference systems, the cache hit rate alone cannot serve as the sole routing criterion. It must be considered alongside other factors — such as the workload characteristics of the requests and the current state of each Pod — to make optimal routing decisions.
#
Key features
- The Heimdall scheduler tokenizes the request prompt, calculates the cache hit rate for each Pod, and assigns a normalized score to each Pod so that it can be used as a routing decision criterion. It continuously receives updates on each Pod's cache status through ZMQ events.
- The framework can determine how much importance to assign to prefix cache-aware routing based on the given service level objectives (SLOs) and the computation characteristics of the GPUs (the penalty of KV cache recomputation).
#
Scoring configuration
The following configuration file shows an example of setting up information about the prefix cache of each Pod and the model tokenizer for prefix cache-aware scoring of the Heimdall scheduler.
plugins:
- type: precise-prefix-cache-scorer
parameters:
indexerConfig:
prefixStoreConfig:
cacheSize: 500000
blockSize: 256
tokenProcessorConfig:
blockSize: 16
hashSeed: "12345"
kvBlockIndexConfig:
inMemoryConfig:
size: 100000000
podCacheSize: 10
enableMetrics: true
tokenizersPoolConfig:
workersCount: 8
minPrefixOverlapRatio: 0.8
huggingFaceToken: "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"
tokenizersCacheDir: "/tmp"
kvEventsConfig:
zmqEndpoint: "tcp://*:5557"
topicFilter: "kv@"
concurrency: 8