# Expert parallelism

Expert parallelism (EP) refers to a parallelization method that assigns and executes multiple experts of a Mixture-of-Experts (MoE) model across different GPUs. While applying EP is not mandatory for MoE models, it is generally helpful for increasing overall inference throughput (total output tokens per second).

However, implementing EP efficiently is highly challenging because routing between different experts — and ultimately workload distributin across GPUs — is determined dynamically. GPU kernels and libraries must ensure good efficiency across a wide range of such cases. EP requires a complex all-to-all communication pattern known as dispatch and combine, where minimizing the associated overhead is critical. There is also an inherent imbalance in routing frequency among different experts, and it is crucial to maintain even workload distribution across GPUs despite these differences.

In addition, while EP is effective for improving throughput, it can lead to higher latency compared to other parallelization methods such as tensor parallelism (TP). Therefore, it should be applied carefully according to the given service level objectives (SLOs) to achieve optimal results.

# Key features

  • For various MoE models, the framework allows experts to be distributed across multiple GPUs within a single server or across GPUs in multiple servers. The Odin inference service manages the latter case by launching and coordinating multiple instances simultaneously when experts are distributed across servers.
  • The framework can automatically determine whether to apply EP and how to allocate experts to GPUs according to defined SLOs.
  • Moreh vLLM is optimized to execute inter-GPU communication efficiently and minimize load imbalance across GPUs on AMD MI200 and MI300 series GPUs.