Transformer
Definition
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel, becoming the dominant architecture for language and increasingly for vision and other modalities.
The Transformer was introduced in the landmark paper "Attention Is All You Need" by Vaswani et al. at Google in 2017. Unlike RNNs that process sequences one element at a time, transformers use self-attention to relate all positions in a sequence simultaneously, enabling massive parallelization during training. The architecture consists of encoder and decoder blocks, each containing multi-head attention layers and feed-forward networks. Transformers power virtually all modern LLMs (GPT, Claude, Gemini, Llama) and have been adapted for vision (ViT), audio, and multimodal tasks. The architecture scales remarkably well with data and compute, following scaling laws that predict performance improvements.
Related Terms
Attention Mechanism
A technique that allows neural networks to focus on the most relevant parts of the input when produc...
Encoder-Decoder
A neural network architecture with two components: an encoder that compresses input data into a comp...
Multi-Head Attention
An extension of the attention mechanism that runs multiple attention operations in parallel, allowin...
Positional Encoding
A technique that injects information about the position of each token in a sequence into the model, ...