Batch Size
Definition
The number of training examples processed together in one forward and backward pass before the model's weights are updated.
Batch size is a key hyperparameter that affects training speed, memory usage, and model quality. Larger batch sizes allow better GPU utilization and more stable gradient estimates but require more memory and can lead to sharper minima that generalize less well. Smaller batch sizes provide noisier gradient estimates that can act as regularization but are less computationally efficient. For large language model training, batch sizes are often very large (millions of tokens per batch) and may be gradually increased during training. Gradient accumulation allows simulating larger batch sizes on hardware with limited memory by accumulating gradients over multiple mini-batches before performing a weight update.
Related Terms
Epoch
One complete pass through the entire training dataset during model training.
GPU
Graphics Processing Unit — a specialized processor originally designed for rendering graphics but no...
Gradient Descent
An optimization algorithm that iteratively adjusts model parameters in the direction that most reduc...
Hyperparameter
A configuration value set before training begins that controls the training process itself, as oppos...