Flash Attention Mechanics: How Tiled Attention Fits in SRAM

A new technique called Flash Attention uses tiled attention to fit the N×N attention matrix into SRAM, reducing memory reads/writes and speeding up self-attention in transformers.

Self-attention is the operation that lets every token in a sequence influence every other token. The cost is an N×N matrix of pairwise… Continue reading on Towards AI »