Multi-Head Attention
Definition
An extension of the attention mechanism that runs multiple attention operations in parallel, allowing the model to jointly attend to information from different representation subspaces.
Multi-Head Attention is a core component of the transformer architecture. Instead of computing a single attention function, the model projects queries, keys, and values into multiple lower-dimensional subspaces (heads), computes attention independently in each head, and then concatenates and linearly projects the results. Each head can learn to focus on different types of relationships — one might capture syntactic dependencies, another semantic similarities, and another positional patterns. The original transformer used 8 heads, while large models like GPT-4 may use 96 or more. This parallel structure is efficient on GPU hardware and provides the model with a richer capacity to capture diverse patterns in the data.
Related Terms
Attention Mechanism
A technique that allows neural networks to focus on the most relevant parts of the input when produc...
Positional Encoding
A technique that injects information about the position of each token in a sequence into the model, ...
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequ...