Back to GlossaryArchitecture

Multi-Head Attention

Definition

An extension of the attention mechanism that runs multiple attention operations in parallel, allowing the model to jointly attend to information from different representation subspaces.

Multi-Head Attention is a core component of the transformer architecture. Instead of computing a single attention function, the model projects queries, keys, and values into multiple lower-dimensional subspaces (heads), computes attention independently in each head, and then concatenates and linearly projects the results. Each head can learn to focus on different types of relationships — one might capture syntactic dependencies, another semantic similarities, and another positional patterns. The original transformer used 8 heads, while large models like GPT-4 may use 96 or more. This parallel structure is efficient on GPU hardware and provides the model with a richer capacity to capture diverse patterns in the data.

Companies in Architecture

View Architecture companies →