{"slug": "how-self-attention-works-qkv-softmax-and-matrix-computation", "title": "How Self-Attention Works — QKV, Softmax, and Matrix Computation", "summary": "A developer explains that self-attention is a matrix operation, not just a mechanism for looking at important words. The post breaks down the QKV projection, scaling, softmax, and weighted sum that enable Transformers to scale effectively by turning isolated token vectors into relationship-aware representations.", "body_md": "Self-Attention is not just “looking at important words.”\n\nIt is a matrix operation.\n\nAnd that is exactly why Transformers scale.\n\nSelf-Attention lets each token compare itself with every other token in the same sequence.\n\nEach token asks:\n\nWhich other tokens are useful for updating my representation?\n\nThis matters because meaning is contextual.\n\nA token should not stay as a static embedding.\n\nIt should become a representation shaped by the sentence around it.\n\nSelf-Attention follows this pipeline:\n\nInput Embeddings\n\n→ Query, Key, Value Projection\n\n→ Similarity Scores\n\n→ Scaling\n\n→ Softmax Weights\n\n→ Weighted Sum of Values\n\n→ Contextual Token Output\n\nMore compactly:\n\nSelf-Attention = matching + weighting + information mixing\n\nThe full formula is:\n\nAttention(Q, K, V) = softmax(QKᵀ / √dₖ) V\n\nThis equation looks dense.\n\nBut the idea is simple:\n\nCompare tokens.\n\nConvert scores into weights.\n\nUse weights to mix information.\n\nAt a high level, Self-Attention works like this:\n\n```\nX = token_embeddings\n\nQ = X @ W_Q\nK = X @ W_K\nV = X @ W_V\n\nscores = Q @ K.T\n\nscaled_scores = scores / sqrt(d_k)\n\nweights = softmax(scaled_scores)\n\noutput = weights @ V\n```\n\nThat is the core computation.\n\nIn real Transformer implementations, this is done for all tokens at once.\n\nNot token by token.\n\nThat is why the matrix form matters.\n\nTake this sentence:\n\nI love you\n\nWhen updating the token “love”, Self-Attention compares it with:\n\nI\n\nlove\n\nyou\n\nThe token “love” may strongly attend to “I” and “you”.\n\nSo its representation becomes more contextual.\n\nIt no longer means only the word “love.”\n\nIt becomes something closer to:\n\nlove as an action between I and you\n\nThat is why Self-Attention is powerful.\n\nIt turns isolated token vectors into relationship-aware vectors.\n\nEach token is projected into three roles:\n\nQuery, Key, and Value.\n\nQuery:\n\nWhat am I looking for?\n\nKey:\n\nWhat do I contain that others can match against?\n\nValue:\n\nWhat information do I pass forward if selected?\n\nSearch analogy:\n\nQuery = search request\n\nKey = searchable index\n\nValue = retrieved content\n\nThis separation is important.\n\nThe model can learn different spaces for matching and information transfer.\n\nGiven input embeddings X:\n\nQ = XW_Q\n\nK = XW_K\n\nV = XW_V\n\nW_Q, W_K, and W_V are learned matrices.\n\nThey are trained with the model.\n\nThis means QKV is not manually designed.\n\nThe model learns how to project tokens into attention roles.\n\nImplementation-wise, this is just matrix multiplication.\n\nConceptually, it creates three different views of the same token.\n\nThe model compares Query and Key vectors.\n\nFor one token:\n\nscore = q · k\n\nA larger dot product means stronger similarity.\n\nExample:\n\nq₁ · k₁ = 112\n\nq₁ · k₂ = 96\n\nThe first key matches more strongly.\n\nBut these are still raw scores.\n\nThey are not probabilities yet.\n\nDot products can become large when vector dimensions grow.\n\nLarge scores can make Softmax too sharp.\n\nThat can make training unstable.\n\nSo Self-Attention scales the scores:\n\nscore = (q · k) / √dₖ\n\nThen Softmax converts scores into weights.\n\nExample:\n\nscores = [14, 12]\n\nsoftmax(scores) ≈ [0.88, 0.12]\n\nNow the model has attention weights.\n\nThese weights say how much each token should contribute.\n\nThis matters in practice.\n\nWithout scaling, attention can collapse too aggressively onto one token.\n\nThe final output is a weighted sum of Value vectors.\n\nz = Σ αᵢvᵢ\n\nExample:\n\nvalues = [10, 20]\n\nweights = [0.88, 0.12]\n\noutput = 0.88 × 10 + 0.12 × 20 = 11.2\n\nThe first value contributes more.\n\nThe second value contributes less.\n\nThat is the basic meaning of attention output.\n\nIt is not a simple average.\n\nIt is selective information mixing.\n\nSelf-Attention:\n\nCross-Attention:\n\nIn short:\n\nSelf-Attention = inside the same sequence\n\nCross-Attention = between different sequences\n\nThis difference matters when reading Transformer code.\n\nIf Q, K, and V come from the same tensor, it is Self-Attention.\n\nIf Q comes from one tensor and K/V come from another, it is Cross-Attention.\n\nNaive view:\n\nEach token compares with every other token one by one.\n\nMatrix view:\n\nAll token relationships are computed at once.\n\nNaive logic:\n\n```\nfor token_i in tokens:\n    for token_j in tokens:\n        compute_similarity(token_i, token_j)\n```\n\nMatrix logic:\n\n```\nscores = Q @ K.T\n```\n\nThat single matrix multiplication computes all pairwise token scores.\n\nThis is why Transformers are GPU-friendly.\n\nThey replace sequential loops with dense linear algebra.\n\nThe attention matrix contains token-to-token relationships.\n\nIf the sequence length is n, the score matrix is n × n.\n\nEach row means:\n\nHow much one token attends to every token.\n\nEach column means:\n\nHow much that token is attended to by others.\n\nThis structure is powerful.\n\nBut it also creates a cost problem.\n\nFull Self-Attention grows roughly with O(n²).\n\nLonger context means more computation and memory.\n\nSo the same design that makes attention expressive also makes it expensive.\n\nThat is why efficient attention methods exist.\n\nSelf-Attention needs positional information.\n\nBy itself, attention compares token content.\n\nIt does not automatically know token order.\n\nSelf-Attention also gets expensive as sequence length grows.\n\nFor short and medium sequences, full attention is powerful.\n\nFor very long sequences, memory and compute become major constraints.\n\nAnother important point:\n\nAttention weights are not always perfect explanations.\n\nThey show how information is mixed.\n\nBut they should not always be treated as human-level reasoning traces.\n\nIn real models, QKV projection is often implemented as one combined linear layer.\n\nInstead of computing three separate matrix multiplications:\n\nQ = XW_Q\n\nK = XW_K\n\nV = XW_V\n\nImplementations often compute:\n\nQKV = XW_QKV\n\nThen split the result into Q, K, and V.\n\nThis is faster and cleaner.\n\nThe math stays the same.\n\nThe implementation is optimized.\n\nThat is the developer mindset:\n\nUnderstand the formula.\n\nThen recognize the optimized tensor layout in code.\n\nSelf-Attention is the core operation behind Transformers.\n\nIt works by projecting tokens into Q, K, and V.\n\nQ and K compute relevance.\n\nSoftmax turns relevance into weights.\n\nWeights mix V into contextual outputs.\n\nThe shortest version is:\n\nSelf-Attention = compare tokens → weight information → update representations\n\nIf you understand QKᵀ and weighted Values, you understand the heart of Transformer computation.\n\nWhen reading Transformer code, which part feels most confusing?\n\nQKV projection, Softmax attention weights, or the final matrix multiplication with V?\n\nOriginally published at zeromathai.com.\n\nOriginal article: [https://zeromathai.com/en/self-attention-qkv-matrix-en/](https://zeromathai.com/en/self-attention-qkv-matrix-en/)\n\nGitHub Resources\n\nAI diagrams, study notes, and visual guides:\n\n[https://github.com/zeromathai/zeromathai-ai](https://github.com/zeromathai/zeromathai-ai)", "url": "https://wpnews.pro/news/how-self-attention-works-qkv-softmax-and-matrix-computation", "canonical_source": "https://dev.to/zeromathai/how-self-attention-works-qkv-softmax-and-matrix-computation-514j", "published_at": "2026-06-18 14:19:01+00:00", "updated_at": "2026-06-18 14:51:57.547735+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "neural-networks", "natural-language-processing"], "entities": ["Transformer"], "alternates": {"html": "https://wpnews.pro/news/how-self-attention-works-qkv-softmax-and-matrix-computation", "markdown": "https://wpnews.pro/news/how-self-attention-works-qkv-softmax-and-matrix-computation.md", "text": "https://wpnews.pro/news/how-self-attention-works-qkv-softmax-and-matrix-computation.txt", "jsonld": "https://wpnews.pro/news/how-self-attention-works-qkv-softmax-and-matrix-computation.jsonld"}}