Back to GlossaryTraining

Batch Size

Definition

The number of training examples processed together in one forward and backward pass before the model's weights are updated.

Batch size is a key hyperparameter that affects training speed, memory usage, and model quality. Larger batch sizes allow better GPU utilization and more stable gradient estimates but require more memory and can lead to sharper minima that generalize less well. Smaller batch sizes provide noisier gradient estimates that can act as regularization but are less computationally efficient. For large language model training, batch sizes are often very large (millions of tokens per batch) and may be gradually increased during training. Gradient accumulation allows simulating larger batch sizes on hardware with limited memory by accumulating gradients over multiple mini-batches before performing a weight update.

Companies in Training

View Training companies →