The Flow of Attention

A new geometric framework explains how transformer attention reconfigures token embeddings layer by layer, driving semantically related tokens into clusters. The process reduces to two operators acting in a single vector space, revealing the structure behind contextualized representations.

Picture an input prompt to a large language model as a cloud of points in a high-dimensional vector space E, one point for each token. As the model processes the prompt layer by layer, the cloud reconfigures itself — points shift position, some drift together, others pull apart — so that each token’s location better reflects how it relates to the others in context. By the final layer, the cloud has settled into a configuration that encodes the contextualized meaning of the prompt. This final configuration is not a random scatter: semantically related tokens have drifted into clusters, a structure we will see emerge as the natural endpoint of the dynamics described in this article. The next token is then read off from a single point in this configuration: the last token’s final vector determines a set of alignments with the rows of the unembedding matrix and softmax over these alignments produces a probability distribution over the vocabulary, from which the next token is sampled. This article is about that reconfiguration — what drives it, what shape it takes, and what it reveals about transformer attention. Let X = { x 1 , x 2 , …, x N } be a sequence of N token embeddings in the d-dimensional space E, and let q = x i be the embedding of the i-th token, which we call the query. Attention seeks a vector Δ q, X in E such that q = q + Δ q, X is a contextualized version of q, that is, the i-th token’s representation updated by content drawn from X. The vector Δ q, X is called the contextual update of q and is computed by an attention head. By letting i run from 1 to N in q = x i , the attention head generates a contextual update for every token of X. Arranging these updates as the columns of a d × N matrix gives the sequence Δ X , which transforms the original cloud X into a new cloud X by X = X + Δ X where each point of X is displaced by its contextual update. This is one layer’s worth of reconfiguration. The two previous articles in this series developed the geometry underlying how a single attention head computes Δ q, X , step by step and at a more leisurely pace: Article 1 https://medium.com/towards-artificial-intelligence/from-gauss-to-transformers-a-surprising-link-between-weighted-least-squares-and-self-attention-8a5e5d3feca3 recast standard attention as a problem of estimation, and Article 2 https://medium.com/towards-artificial-intelligence/the-geometry-of-attention-one-space-two-operators-a17f757c72aa reduced its machinery to two operators acting in a single space. What follows is the compressed version. Note both articles focused on the cross-token interaction that defines attention and set aside other layer components such as layer normalization and feed-forward networks, which do not facilitate information transfer between tokens. Positional encoding does shape the underlying geometry of attention, but how relevance scores are determined from that geometry is unchanged; a short note at the end of this article explains why. Article 1 showed that Δ q, X is computed in two steps. First, attention weights are inferred from relevance scores given by scaled dot products of learned query Q and key K projections into low-dimensional subspaces of E. The inference is a softmax normalization of these scores — a computationally efficient, closed-form solution to a free-energy minimization problem that fits the relevance signal while preventing the weights from collapsing onto a single token. Second, the contextual update Δ q, X is computed as an attention-weighted average of N learned value V projections, the closed-form solution to a classical weighted least squares problem. This two-step formulation, built around four projection matrices and three intermediate spaces, is the standard QKV model of attention. Article 2 showed that the QKV machinery — multiple projections, dot products, softmax normalizations, weighted sums — reduces to two operators acting in the single space E: • A bilinear form M that assigns a relevance score to an ordered pair of embeddings from E. • A linear operator F that extracts content from a token embedding. For a fixed query q, the function M q, t is a linear functional on E a linear map from E to the real numbers that scores the relevance of any token t to q, that is, the degree to which q should attend to t. Geometrically, it stratifies E into parallel hyperplanes of constant relevance, all normal to the relevance gradient r q = Mᵀq, where M also denotes the matrix of the bilinear form such that M x₁, x₂ = x₁ᵀMx₂. Mathematicians call this stack of hyperplanes a foliation https://en.wikipedia.org/wiki/Foliation ; the individual hyperplanes are its leaves. Every token x sits on exactly one leaf, with score M q, x = ⟨r q, x⟩ — the projection of x onto r q. Tokens on the same leaf score identically, and since r q points toward increasing relevance, the leaves are ordered by score along r q. Change q and r q changes, reorienting the foliation and rescoring every token in X. Relevance, then, is determined by leaf membership. Solving the free-energy minimization problem on these scores yields the attention weights via softmax, and the contextual update Δ q, X is the attention-weighted average of the content vectors F xⱼ . This update is a vector in the range of F — the query-agnostic content subspace shared by all queries — and is identical to the update produced by the QKV model. This single-space formulation is the EMF model of attention. The EMF model is what we use in the rest of this article. Its advantage for what follows is geometric: because everything happens in E, the contextual updates Δ q, X are vectors in the same space as the tokens themselves. The reconfiguration X → X + Δ X therefore represents a motion of the cloud through E, governed by two operators per layer — M shaping where each query looks, F shaping what it finds. That motion, layer after layer, is what the next sections describe. A transformer isn’t just a single attention head; it is a stack of L layers, each containing multiple heads, with each layer feeding the next through a structure called the residual stream. To keep the geometry in focus, we treat each layer for now as a single attention head characterized by a learned pair of operators M, F , and return to the multi-head structure shortly. Understanding what this stack of L layers does as it sequentially processes a prompt requires extending the EMF model from one layer to the full depth of the model. The previous section showed how a single layer takes a cloud X and produces a contextualized version X = X + Δ X . A transformer with L layers repeats this operation, each layer reading the previous layer’s output and producing a new one. We call X ℓ the configuration of the prompt at layer ℓ, meaning the arrangement of all N token embeddings in E after ℓ layers of processing. X 0 is the initial configuration: the uncontextualized embeddings of the prompt. The update rule is X ℓ+1 = X ℓ + Δ X ℓ for ℓ = 0, 1, …, L−1 where each layer ℓ has its own operators M, F that determine its contextual update. The output of the final layer, X L , is the fully contextualized configuration from which the next-token prediction is read. The fact that each layer carries its own M, F hints at a division of labor in what the layers write. The updates are not interchangeable. Empirical probing of trained transformers suggests that early layers tend to write short-range structure — local syntax, the relations between nearby tokens — while later layers, reading a configuration that earlier layers have already partially organized, contribute longer-range structure: coreference, discourse role, relations that span the whole prompt. Depth, in other words, is not repetition; each layer refines the configuration at a progressively wider contextual range. Now focus on a single token — say the i-th token of the prompt. At layer 0 it sits at some initial position in E. At layer 1 it has been displaced by a contextual update that depends on the entire configuration X 0 . At layer 2 it is displaced again, this time by an update that depends on the already-reconfigured X 1 . Layer after layer, the token traces a trajectory through E, each step informed by the current positions of all the other tokens. This trajectory is the token’s residual stream, and its per-layer update rule is x i, ℓ+1 = x i, ℓ + Δ x i, ℓ , X ℓ for ℓ = 0, 1, …, L−1 The word “residual” points to a specific structural fact: the update at each layer is added to the token’s current position rather than replacing it. Unrolling the layer-by-layer updates, the i-th token’s position after L layers is simply its initial embedding plus the sum of every contextual update it has ever received: x i, L = x i, 0 + Δ x i, 0 , X 0 + Δ x i, 1 , X 1 + … + Δ x i, L−1 , X L−1 The initial embedding x i, 0 is never overwritten. It persists as a baseline, and each layer’s contribution is a correction layered on top of every previous correction. This additive structure is not an incidental implementation choice; it is what makes the geometric picture of the introduction possible. Because updates are added rather than composed through nonlinearities, the token’s final position is a sum of displacements in E, and the trajectory connecting them is a path through E that we can visualize, measure, and eventually interpret as a flow. Why addition? The strongest justification is empirical. Directions in embedding space encode semantic relationships — gender, tense, syntactic role — so that vector arithmetic in E corresponds to compositional changes in meaning. The vector space structure of E aligns with semantic structure: addition in E is addition in meaning. When a layer writes a contextual update to the residual stream, it is nudging the token’s representation along a semantically meaningful direction, and additive accumulation preserves this property across layers. Each layer’s correction shifts the token further along directions that refine its contextual meaning. This is only possible because the EMF model keeps all computation in one space: if corrections had to be translated from some external representation, the semantic geometry of E could not be exploited directly. Each of the N tokens traces its own residual stream through E. Taken together, the N streams define the evolution of the configuration: X 0 , X 1 , …, X L is a sequence of configurations, each derived from the last by a collective reconfiguration step in which every token moves simultaneously, each one’s displacement depending on the current positions of all the others. This is the layer-wise view of the residual stream, and it is the view a later section will formalize as a flow. Note the coupling: the trajectory of any single token cannot be understood in isolation, because its update at each layer depends on the positions of all N tokens at that layer. The residual stream is not N independent paths through E; it is a coupled dynamical system in which the configuration moves as a whole. In practice, each layer does not have a single M, F pair. It has H of them — H attention heads, each with its own operators M h and F h that score relevance and extract content differently. The layer’s contextual update is the sum of the H individual head updates: Δ X ℓ = Δ 1, X ℓ + Δ 2, X ℓ + … + Δ H, X ℓ where Δ h, X ℓ is the contextual update computed by the h-th head. This does not change the residual stream’s structure; it is still additive. Each layer still contributes a single combined displacement to every token. That displacement just happens to be assembled from H components, each one a rank-limited update confined to the range of its head’s content operator F h . The flow picture developed below treats each layer’s combined update as one step, which is all it needs. Attention is not the only operation that contributes to the residual stream. Each layer also applies layer normalization and a feed-forward network FFN , and their outputs are likewise added to the stream. The key distinction is that these components act on each token independently: layer normalization rescales a token’s embedding based on its own statistics, and the FFN transforms a token’s embedding through a learned nonlinearity. Neither one consults the rest of the configuration. They modify each token’s trajectory without moving information between tokens. Layer normalization does deserve one geometric remark. By recentering each token’s embedding and rescaling it to a fixed length, it effectively places the representation each layer reads on a sphere in E. The cloud that attention actually sees is therefore not scattered through all of E but arranged on a spherical stage, and the dynamics of reconfiguration play out on that stage. Idealized mathematical treatments of attention take this constraint literally, modeling the tokens as particles evolving on the unit sphere; we will meet this idealization again when the flow picture is made precise. In a real transformer the residual stream itself accumulates updates in ambient E unnormalized — each layer reads a normalized copy of the stream — so the sphere describes where each layer’s computation happens, not where the stream lives. For the purposes of this article, what matters is that attention carries the entire burden of cross-token interaction. It is the only layer component that makes one token’s update depend on the positions of the others. The feed-forward network and layer normalization shape each trajectory individually but do not couple them. The configuration’s collective dynamics are governed by attention alone. The residual stream gives us a concrete object to study: a sequence of configurations X 0 , X 1 , …, X L in E, each derived from the last by an additive update driven by cross-token attention. The next section asks: what kind of dynamical system is this, and what, if anything, does it optimize? The previous section described the residual stream as a coupled dynamical system in which the configuration moves as a whole. That description is geometrically accurate but mathematically informal. To say something more precise about what kind of system this is, we need a shift in viewpoint. So far we have tracked the cloud by following its individual points — N tokens, each tracing its own trajectory through E. An equivalent picture, more useful for what comes next, is to view the cloud as a distribution on E. Place a point mass of weight 1/N at each token’s current location; the resulting probability measure https://en.wikipedia.org/wiki/Probability measure on E carries exactly the information that defines the configuration. The dynamics are unchanged. Only the bookkeeping is different. What was a list of N moving points becomes a single object — This shift brings into focus a question that was awkward to ask before: what kind of motion is the cloud undergoing? Consider the contrast with ordinary gradient descent https://en.wikipedia.org/wiki/Gradient descent . Gradient descent moves a single point downhill on an energy landscape. At each step, you compute the gradient of the energy at the point’s current location and take a step opposite to it. The point descends; the landscape is fixed. This is the workhorse of optimization, and its geometry is the geometry of E itself: straight-line steps in a flat space. What the residual stream does is not this. The cloud is not a single point, and its motion is not a sequence of straight-line steps in E. It is the simultaneous motion of every point in a distribution, with each point’s displacement depending on where all the others currently are. To describe this kind of motion, we need a geometry not on E but on the space of distributions over E — a notion of distance between configurations that measures how much the cloud as a whole would have to rearrange itself to turn one configuration into another. That notion exists, and it has a name: the Wasserstein distance https://en.wikipedia.org/wiki/Wasserstein metric . Informally, the Wasserstein distance between two distributions is the minimum total cost of moving mass from the first to the second, where cost is measured by how far each unit of mass has to travel. It is the natural way to measure how different two clouds are when the clouds themselves are made of movable mass — and the token cloud is exactly that: each token is a point mass, and attention is what moves it. With this distance in hand, the space of distributions on E acquires its own geometry — a curved geometry, very different from the flat Euclidean geometry of E itself. Mathematicians call this space of distributions Wasserstein space, and it is the arena in which the cloud’s motion can be described as a path. A Wasserstein gradient flow WGF is then exactly what the name suggests: gradient descent, but on a distribution rather than a point, with the geometry given by the Wasserstein distance. At each instant, the flow chooses the direction in distribution-space that decreases an energy functional fastest under the Wasserstein cost of moving mass. The cloud descends an energy landscape, with the cost of motion built into the geometry. This is the right mathematical object for “gradient descent on a cloud.” How much of this picture applies to the transformer? The unconditional part is the arena. The residual stream moves a distribution through E layer by layer, each token’s displacement coupled to all the others through attention. The evolution of the cloud is therefore a flow in Wasserstein space — a path traced one layer at a time. The question is whether this flow is a gradient flow: whether there exists an energy functional whose gradient the motion follows . In an idealized model, the answer is yes, and the result is a theorem. Geshkovski, Letrouit, Polyanskiy, and Rigollet NeurIPS 2023 https://arxiv.org/abs/2305.05465 showed that in a continuous-time idealization of attention — fixed operators shared across all layers, tokens constrained to the unit sphere as layer normalization suggests, and a relevance form M that is symmetric — the token distribution evolves as a WGF of an explicit interaction energy. In that setting the flow is not merely flow-like; the energy exists, the gradient is computable, and the motion provably follows it. Strictly, the cleanest form of the theorem applies to attention with the softmax normalization removed. Keeping the softmax does not destroy the gradient structure so much as relocate it: the normalizing factor reappears inside the transport metric as a mobility, rescaling how readily different regions of the cloud move. The actual transformer is several steps away from this idealization, and these steps matter. First, transformer flow is discrete — it takes one finite step per layer — while the idealized motion is continuous. Second, the geometric landscape changes underfoot: each layer of the transformer carries its own pair of operators M, F , so even if every step descended some energy landscape, it would be a different landscape at every step — nothing is steadily minimized from the first layer to the last. Third, gradient forces are mutual: when an energy couples two particles, each pulls on the other, action and reaction equally in both directions. Attention is not mutual. A token can attend to another that ignores it — causal masking guarantees this — but even in the absence of masking, the relevance operator M is inherently asymmetric. This one-way influence can never come from a gradient. No energy functional governing the token dynamics of a trained transformer has been identified, and these three obstructions suggest the flow of attention is not a WGF. What survives for the transformer is not the theorem of Geshkovski et al. itself, but its structure. Every ingredient of a WGF has a counterpart in the residual stream, and the next section names the counterparts one by one. The honest statement, and the one this article defends, is that the transformer is a transport flow in Wasserstein space bearing the fingerprints of a gradient flow — literally in the idealized model of Geshkovski et al., structurally in the actual transformer. The claim that the residual stream is a flow in Wasserstein space is not a metaphor. The stronger claim — that it is a WGF — is the one the previous section ruled out for the transformer. What survives is a structural correspondence. A WGF needs four things: a distribution to evolve, an energy functional to minimize, an entropy term to prevent collapse, and an update rule that moves the distribution by descending an energy landscape under the geometry of mass transport. The residual update of a transformer layer has a structural counterpart for each, summarized in the table below and described one by one in what follows. The distribution . The cloud at layer ℓ is the empirical distribution of the N token embeddings in E: a point mass of weight 1/N at each token’s position. The weights stay uniform at 1/N throughout — layers move the mass, never reweight it. Each layer takes this distribution as input and produces another. The residual update X ℓ+1 = X ℓ + Δ X ℓ is, in distributional language, a map from the cloud at layer ℓ to the cloud at layer ℓ+1. This is the object the flow acts on. The energy. Article 1 showed that each query’s attention weights are the closed-form solution to a free-energy minimization problem — a tradeoff between fitting the relevance scores M q, x j and spreading the weights to prevent collapse onto a single token. The relevance scores entering this free energy are read off the foliations of Article 2: each token slices E into leaves of constant relevance, and every other token’s score is the leaf it occupies. Because the relevance gradient moves with the query, the cloud’s motion reorients all N foliations at every layer — the configuration carries its own measuring instruments, and moving the particles moves the geometry that scores them. Summing the per-query free energies across all queries gives a functional on the cloud as a whole, depending on the layer’s bilinear form M and the current configuration. This is the counterpart of the WGF’s energy functional, but with a crucial difference of scope. The transformer’s free energy is minimized over each token’s attention weights, with token positions held fixed; a WGF’s energy functional, by contrast, is descended by the token positions themselves. So free energy shapes how each step is taken — it selects the attention weights at every layer — but this isn’t the same thing as having an energy functional that the cloud of tokens descends. Free energy in a transformer steers each step, but no energy functional has been shown to drive this motion across layers in a WGF sense. The entropy. Inside the per-query free energy, the entropy term is the same one Article 1 identified: it forces each query’s attention weights to spread across the token cloud rather than collapsing onto a single token. Here the correspondence needs the most care, because the two entropies are entropies of different objects. In a WGF, the entropy term is the entropy of the cloud itself — a functional of the particle positions whose gradient pushes particles apart, keeping the distribution from collapsing to a point. The transformer’s entropy is the entropy of each query’s attention weights: a distribution over which tokens to draw content from, not over where the tokens sit. It spreads influence, not positions: every displacement draws on content from across the whole cloud, but nothing pushes the particles apart, and in the idealized model they do drift together, clustering as depth increases. Softmax, the closed-form solution to the entropy-regularized problem, enforces the constraint at every head of every layer. The WGF counterpart holds, but with its scope shifted: here WGF’s entropy keeps the cloud from collapsing onto a single point; the transformer’s keeps each query’s attention weights from collapsing onto a single token. The update rule . The residual step x i ’ = x i + Σⱼ αᵢⱼ F x j moves each token by the weighted average of content vectors it has attended to. In measure-theoretic language, the displacement is the mean of the attention measure pushed forward through F. This is the counterpart of the WGF’s continuous-time velocity: a configuration-dependent displacement field that moves the mass through E — though, as the previous section established, not one derived from any gradient. The fact that updates are added rather than composed through a nonlinearity is what makes the flow picture available — additive updates are paths in E; composed nonlinear updates would not be. The four ingredients line up — as structure, not as theorem. The residual update of a transformer layer is a transport step in Wasserstein space assembled from gradient-flow anatomy: M shaping the relevance geometry that defines each step’s variational problem, F shaping the displacement, and softmax enforcing the entropy constraint that keeps the weights from collapsing. What does this flow actually do, configuration by configuration, as the layers stack up? The continuous-time analysis of Geshkovski et al. provides a partial answer: under the simplified dynamics they study, the cloud does not diffuse and does not stay uniform — it clusters. Tokens organize into groups, and the groups themselves can drift toward common attractors as depth increases. The flow is structured organization, not random walk. In trained transformers, empirical analyses of the residual stream https://arxiv.org/abs/2302.00294 show qualitatively similar behavior: representations that begin as raw embeddings are reorganized by depth into geometries where related tokens sit closer together and unrelated ones separate. The flow picture explains why this is the natural outcome rather than an empirical surprise. An interacting-particle flow whose idealization provably follows an energy gradient, with attractors determined by the learned geometry of M and F, is exactly the kind of dynamics that produces clustering. The transformer’s depth, viewed through this lens, is not a stack of repeated computations. It is the integration of a flow. After L layers, the flow has run its course. Each token now sits at a position in E that reflects not just its own initial embedding but the cumulative pull of every other token, mediated through L rounds of attention. The geometry that began as N raw embeddings — a configuration shaped by the training data’s vocabulary but innocent of the prompt’s semantics — has been reorganized into something specific: a final form that encodes how this particular sequence of tokens stands in relation to itself. To predict the next token, the transformer does not consult the whole cloud. It consults a single point in it. The chosen point is the final-layer position of the last token in the prompt. By construction, this token has had L layers’ worth of attention to draw on over the entire sequence; its residual stream carries the most complete contextual record available anywhere in the configuration. Its position in E is the article’s answer to the question the introduction posed: the cloud has reorganized itself, and this point is the one the prediction is read off from. The reading itself is a single linear operation followed by a probability assignment. The unembedding matrix has one row per vocabulary entry, each row a vector in E learned during training. Multiplying the last token’s final vector by this matrix produces a list of alignments — one number for each vocabulary token that measures how well it aligns with the contextualized direction that the last token’s residual stream has settled on. Softmax converts these alignments into a probability distribution over the vocabulary, and the next token is sampled from that distribution. The same softmax that drove the flow inside the network, preventing attention weights from collapsing onto a single token at every head of every layer, appears one final time at the readout, now preventing the prediction from collapsing onto a single vocabulary entry. The flow that organized the cloud and the readout that interprets its final state are governed by the same closed-form solution to the same kind of entropy-regularized variational problem. The mechanism is consistent from input to output. This is what a transformer is, viewed through the lens of the EMF formulation of attention. A cloud of token embeddings enters the network. Each layer reshapes the cloud with a transport step whose geometry is set by that layer’s learned operators — M steering where each token looks, F shaping what it receives. The cloud’s motion is a transport flow in Wasserstein space bearing the fingerprints of a gradient flow, steered at every step by an entropy term that keeps each token’s attention weights spread rather than collapsed. After L layers of flow, the last token’s position determines a direction in E. The unembedding most aligned with that direction is the model’s most likely prediction of the next token. The cloud reconfigures, the flow runs, the prediction is read. This note addresses how positional encoding — specifically the rotary positional encoding RoPE used in many modern models — affects the geometric account of attention developed in this article. The simplest case no positional encoding, or NoPE gives a single foliation of E generated by M q, · , where each token is scored for relevance to q by leaf membership. RoPE enriches this picture: position-dependent rotations applied to the query and key projections at every layer turn a single foliation under NoPE into a family of foliations, one per token, with leaves tilting smoothly as distance from q varies. Nearby tokens see nearly aligned geometries; distant tokens see geometries rotated further apart. The geometry becomes more intricate, but the way it is used for relevance scoring is unchanged — a token’s score is still based on the leaf it occupies. The flow picture and the WGF correspondence carry over without modification. About the author: Gordon is a mathematician, bioinformatician, AI researcher, and singer-songwriter who explores hidden mathematical patterns at the interface of machine learning, modern AI, biology, and music. Acknowledgment: This article was developed in collaboration with AI, disclosed here in full in the belief that transparency about this way of working serves both readers and the practice itself. Claude Anthropic served as the primary editorial partner, contributing structural revisions, mathematical vetting, and expository refinement throughout the drafting process. ChatGPT OpenAI and Gemini Google were consulted periodically for alternative perspectives on framing and direction. The mathematics, research direction, and final text are the author’s responsibility. The Flow of Attention https://pub.towardsai.net/the-flow-of-attention-1795b1d6aaf9 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.