Mixture of Experts
Definition
An architecture where multiple specialized sub-networks (experts) are combined with a gating mechanism that routes each input to only a subset of experts, enabling massive model capacity with efficient computation.
Mixture of Experts (MoE) allows models to scale to enormous parameter counts while keeping computational cost manageable. A router network learns to select the most relevant experts (typically 1-2 out of many) for each input token, so only a fraction of the total parameters are active for any given computation. This means a model with trillions of total parameters might only use billions per forward pass. Google's Switch Transformer demonstrated the approach at scale, and models like Mixtral by Mistral AI have popularized sparse MoE for open-source LLMs. Rumored architectures of GPT-4 also use MoE. The approach trades off model size and memory for computational efficiency during inference.
Related Terms
Inference
The process of using a trained AI model to generate predictions or outputs on new data, as opposed t...
Large Language Model
A neural network with billions of parameters trained on massive text datasets, capable of understand...
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequ...