Back to GlossaryArchitecture

Mixture of Experts

Definition

An architecture where multiple specialized sub-networks (experts) are combined with a gating mechanism that routes each input to only a subset of experts, enabling massive model capacity with efficient computation.

Mixture of Experts (MoE) allows models to scale to enormous parameter counts while keeping computational cost manageable. A router network learns to select the most relevant experts (typically 1-2 out of many) for each input token, so only a fraction of the total parameters are active for any given computation. This means a model with trillions of total parameters might only use billions per forward pass. Google's Switch Transformer demonstrated the approach at scale, and models like Mixtral by Mistral AI have popularized sparse MoE for open-source LLMs. Rumored architectures of GPT-4 also use MoE. The approach trades off model size and memory for computational efficiency during inference.

Companies in Architecture

View Architecture companies →