Infrastructure
Distributed Training
Definition
“
The practice of training AI models across multiple GPUs or machines simultaneously to handle models too large for a single device. Distributed training techniques include data parallelism, model parallelism, and pipeline parallelism. Training frontier models requires thousands of GPUs coordinated through frameworks like DeepSpeed and Megatron.
”
Related Terms
No related terms linked yet.
Explore all terms →