#
Home
MoAI Inference Framework is a distributed inference framework that optimizes LLM inference at data center scale.
- Support for diverse accelerators: Supports AMD GPUs and Tenstorrent AI accelerators in addition to NVIDIA GPUs, enabling broader chip options for AI data centers. The entire software stack from GPU kernels and libraries to model implementation and distributed inference is highly optimized for such non-NVIDIA accelerators to deliver performance comparable to — or even surpassing — that of NVIDIA.
- Model disaggregation and parallelization: Applies model disaggregation techniques such as prefill-decode disaggregation and expert parallelism to maximize overall throughput of the entire cluster.
- Optimal routing and scheduling: Distributes incoming requests to the most suitable inference instances by considering various factors such as prefix cache locality and performance characteristics, resulting in better latency and throughput compared to using a simple load balancer.
- Auto scaling: Dynamically adjusts both the total number of GPUs and the number of GPUs assigned to each disaggregated model, depending on the amount and pattern of incoming requests. This ensures efficient resource utilization at data center scale.
- Heterogeneous accelerator utilization: Distributes different workloads (e.g., prefill and decode) across different types of accelerators to improve the overall efficiency of the system. For example, it can mix older and newer GPUs, NVIDIA and AMD GPUs, or even combine GPUs with CPX or Tenstorrent chips.
- SLO-based automated distributed inference: Automatically combines all the aforementioned techniques to maximize system throughput while satisfying defined service level objectives (SLOs).
#
Materials
- Distributed Inference on Heterogeneous Accelerators Including GPUs, Rubin CPX, and AI Accelerators (blog article)
- Moreh vLLM Performance Evaluation: DeepSeek V3/R1 671B on AMD Instinct MI300X GPUs (technical report)
- Moreh vLLM Performance Evaluation: Llama 3.3 70B on AMD Instinct MI300X GPUs (technical report)
- Moreh-Tenstorrent AI Data Center Solution System Architecture (technical report)