How Self-Attention Works — QKV, Softmax, and Matrix Computation

wpnews.pro

Self-Attention is not just “looking at important words.”

It is a matrix operation.

And that is exactly why Transformers scale.

Self-Attention lets each token compare itself with every other token in the same sequence.

Each token asks:

Which other tokens are useful for updating my representation?

This matters because meaning is contextual.

A token should not stay as a static embedding.

It should become a representation shaped by the sentence around it.

Self-Attention follows this pipeline:

Input Embeddings

→ Query, Key, Value Projection

→ Similarity Scores

→ Scaling

→ Softmax Weights

→ Weighted Sum of Values

→ Contextual Token Output

More compactly:

Self-Attention = matching + weighting + information mixing

The full formula is:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V

This equation looks dense.

But the idea is simple:

Compare tokens.

Convert scores into weights.

Use weights to mix information.

At a high level, Self-Attention works like this:

X = token_embeddings

Q = X @ W_Q
K = X @ W_K
V = X @ W_V

scores = Q @ K.T

scaled_scores = scores / sqrt(d_k)

weights = softmax(scaled_scores)

output = weights @ V

That is the core computation.

In real Transformer implementations, this is done for all tokens at once.

Not token by token.

That is why the matrix form matters.

Take this sentence:

I love you

When updating the token “love”, Self-Attention compares it with:

I

love

you

The token “love” may strongly attend to “I” and “you”.

So its representation becomes more contextual.

It no longer means only the word “love.”

It becomes something closer to:

love as an action between I and you

That is why Self-Attention is powerful.

It turns isolated token vectors into relationship-aware vectors.

Each token is projected into three roles:

Query, Key, and Value.

Query:

What am I looking for?

Key:

What do I contain that others can match against?

Value:

What information do I pass forward if selected?

Search analogy:

Query = search request

Key = searchable index

Value = retrieved content

This separation is important.

The model can learn different spaces for matching and information transfer.

Given input embeddings X:

Q = XW_Q

K = XW_K

V = XW_V

W_Q, W_K, and W_V are learned matrices.

They are trained with the model.

This means QKV is not manually designed.

The model learns how to project tokens into attention roles.

Implementation-wise, this is just matrix multiplication.

Conceptually, it creates three different views of the same token.

The model compares Query and Key vectors.

For one token:

score = q · k

A larger dot product means stronger similarity.

Example:

q₁ · k₁ = 112

q₁ · k₂ = 96

The first key matches more strongly.

But these are still raw scores.

They are not probabilities yet.

Dot products can become large when vector dimensions grow.

Large scores can make Softmax too sharp.

That can make training unstable.

So Self-Attention scales the scores:

score = (q · k) / √dₖ

Then Softmax converts scores into weights.

Example:

scores = [14, 12]

softmax(scores) ≈ [0.88, 0.12]

Now the model has attention weights.

These weights say how much each token should contribute.

This matters in practice.

Without scaling, attention can collapse too aggressively onto one token.

The final output is a weighted sum of Value vectors.

z = Σ αᵢvᵢ

Example:

values = [10, 20]

weights = [0.88, 0.12]

output = 0.88 × 10 + 0.12 × 20 = 11.2

The first value contributes more.

The second value contributes less.

That is the basic meaning of attention output.

It is not a simple average.

It is selective information mixing.

Self-Attention:

Cross-Attention:

In short:

Self-Attention = inside the same sequence

Cross-Attention = between different sequences

This difference matters when reading Transformer code.

If Q, K, and V come from the same tensor, it is Self-Attention.

If Q comes from one tensor and K/V come from another, it is Cross-Attention.

Naive view:

Each token compares with every other token one by one.

Matrix view:

All token relationships are computed at once.

Naive logic:

for token_i in tokens:
    for token_j in tokens:
        compute_similarity(token_i, token_j)

Matrix logic:

scores = Q @ K.T

That single matrix multiplication computes all pairwise token scores.

This is why Transformers are GPU-friendly.

They replace sequential loops with dense linear algebra.

The attention matrix contains token-to-token relationships.

If the sequence length is n, the score matrix is n × n.

Each row means:

How much one token attends to every token.

Each column means:

How much that token is attended to by others.

This structure is powerful.

But it also creates a cost problem.

Full Self-Attention grows roughly with O(n²).

Longer context means more computation and memory.

So the same design that makes attention expressive also makes it expensive.

That is why efficient attention methods exist.

Self-Attention needs positional information.

By itself, attention compares token content.

It does not automatically know token order.

Self-Attention also gets expensive as sequence length grows.

For short and medium sequences, full attention is powerful.

For very long sequences, memory and compute become major constraints.

Another important point:

Attention weights are not always perfect explanations.

They show how information is mixed.

But they should not always be treated as human-level reasoning traces.

In real models, QKV projection is often implemented as one combined linear layer.

Instead of computing three separate matrix multiplications:

Q = XW_Q

K = XW_K

V = XW_V

Implementations often compute:

QKV = XW_QKV

Then split the result into Q, K, and V.

This is faster and cleaner.

The math stays the same.

The implementation is optimized.

That is the developer mindset:

Understand the formula.

Then recognize the optimized tensor layout in code.

Self-Attention is the core operation behind Transformers.

It works by projecting tokens into Q, K, and V.

Q and K compute relevance.

Softmax turns relevance into weights.

Weights mix V into contextual outputs.

The shortest version is:

Self-Attention = compare tokens → weight information → update representations

If you understand QKᵀ and weighted Values, you understand the heart of Transformer computation.

When reading Transformer code, which part feels most confusing?

QKV projection, Softmax attention weights, or the final matrix multiplication with V?

Originally published at zeromathai.com.

Original article: https://zeromathai.com/en/self-attention-qkv-matrix-en/

GitHub Resources

AI diagrams, study notes, and visual guides:

https://github.com/zeromathai/zeromathai-ai

source & further reading

dev.to — original article Cadence Over Volume — Orchestrating Multiple Projects with AI Agents One API Key Across OpenAI, Claude and Gemini: Chatbot Fallback Options for SaaS Apps Claude Code hooks: why "just tell it not to" doesn't hold up

How Self-Attention Works — QKV, Softmax, and Matrix Computation

Run your AI side-project on zahid.host