Infrastructure
Flash Attention
Definition
“
An optimized attention algorithm that dramatically reduces the memory requirements and speeds up transformer model training and inference. Flash Attention achieves this by tiling the attention computation and avoiding materializing the full attention matrix in GPU memory. It is now standard in most LLM training pipelines.
”
Related Terms
No related terms linked yet.
Explore all terms →