Query, Key, Values

wpnews.pro

cd /news/large-language-models/query-key-values · home › topics › large-language-models › article

[ARTICLE · art-35770] src=anup.io ↗ pub=2026-06-01T05:36Z topic=large-language-models verified=true sentiment=· neutral

Query, Key, Values

The transformer attention mechanism uses three learned projections—Query, Key, and Value—to enable each token to selectively gather information from other tokens. The Query determines what the token is looking for, the Key advertises what each token contains, and the Value carries the content to be passed on if selected. This design allows the model to perform soft content-based routing, where attention weights computed from Query-Key similarity determine which Values are blended into the output.

read3 min views10 publishedJun 1, 2026

[As part of my TIL series, building an intuition about Q, K, V]

A good way to understand QKV is this:

Attention is a soft lookup operation.

Given a token, the model asks:

“What information should I pull from the other tokens?”

Q, K and V are just three different projections of the same input token embeddings.

The simplest mental model

For each token, the model creates three vectors:

Query-> "What am I looking for?"** Key**-> "What do I contain/advertise?"** Value**-> "What information should I pass on if selected?"

So attention works like this:

Compare a token’s Query against every other token’sKey. - Turn those similarities into weights.
Use those weights to take a weighted average of the Values.

The formula is:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V

Meaning:

similarity scores = QKᵀ
attention weights = softmax(similarity scores)
output = attention weights × V

Concrete example

Take the sentence:

The dog chased the ball because it was excited.

When processing the token “it”, the model needs to decide what “it” refers to.

For the token “it”:

Q_it = “I am looking for the thing this pronoun refers to”

Other tokens expose keys:

K_dog  = “I am an animal / possible subject”
K_ball = “I am an object / possible noun”

The model compares:

Q_it · K_dog
Q_it · K_ball

If Q_it · K_dog

is higher, then “it” attends more strongly to “dog”.

Then the output for “it” becomes a weighted mixture of the value vectors, especially:

V_dog

So the model enriches the representation of “it” with information from “dog”.

Why separate Q, K and V?

This is the key bit.

The model does not use the raw token embedding directly. It learns three different views of each token:

Q = XW_Q
K = XW_K
V = XW_V

Same input X

, different learned matrices.

Why?

Because “what I am looking for”, “how I should be matched”, and “what information I should contribute” are different jobs.

For example, the word “bank” might need to:

Q: look for context that disambiguates meaning
K: advertise that it is a noun, place, institution, river edge, etc.
V: contribute semantic content once selected

One embedding cannot do all of that cleanly. QKV gives the model specialised subspaces for matching and information transfer.

The database analogy

This is probably the most useful analogy:

Query  = search query
Key    = index / searchable metadata
Value  = retrieved content

Attention is like searching a database where every token is a record.

Token = record
Key   = searchable field
Value = payload
Query = search request from current token

The attention score says:

How relevant is this token’s key to my query?

The output says:

Give me the values from the most relevant tokens.

The important correction

People often say:

“Q asks a question, K answers it, V stores the answer.”

That is okay as a beginner analogy, but slightly misleading.

More accurately:

Q and K decide routing.
V carries content.

Q and K determine where to attend.

V determines what information gets copied/mixed into the output.

One-line understanding

QKV attention is learned content-based routing: each token forms a query, matches it against other tokens’ keys, then pulls back a weighted blend of their values.

No spam, no sharing to third party. Only you and me.

source & further reading

anup.io — original article Who Still Understands the Code? Designing Teams for an Agentic World The Frontier of Agent Memory: From Recall to Experience

── more in #large-language-models 4 stories · sorted by recency

dev.to · 25 Jun · #large-language-models

7 Free In-Browser AI Prompt Engineering Tools (No Sign-Up, No Servers)

dev.to · 25 Jun · #large-language-models

How to Build a RAG Knowledge Base from Any Documentation Site in 5 Minutes

macworld.com · 25 Jun · #large-language-models

iOS 27’s Shortcuts is AI at its best

theregister.com · 25 Jun · #large-language-models

Scientists speak their brains: Please don't call us boffins (2023)

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required