Self-Attention can compare every token with every other token.
But there is a catch.
By itself, it does not know the order of tokens.
That is a serious problem because “dog bites man” and “man bites dog” use the same words but mean completely different things.
A Transformer needs two kinds of information:
what the token is
where the token is
Token embeddings provide the “what.”
Positional embeddings provide the “where.”
This matters because attention without position is order-blind.
It can compare tokens, but it does not naturally know which token came first.
A simple positional embedding flow looks like this:
Token Embedding + Positional Information → Input Representation
For Absolute Positional Embedding:
E = X + P
Where:
X = token embedding
P = positional embedding
E = final input representation
More compactly:
Transformer input = meaning vector + position signal
Different positional methods change how the position signal is injected.
Basic positional injection:
tokens = tokenize(text)
x = embedding(tokens)
position = positional_embedding(token_positions)
input_representation = x + position
For attention-based position methods:
q = project_query(x)
k = project_key(x)
q = apply_position(q)
k = apply_position(k)
attention_scores = q @ k.T
APE usually modifies the input embedding.
RPE usually modifies the attention score.
RoPE usually modifies Query and Key.
That difference is the whole story.
Compare these two sentences:
dog bites man
man bites dog
The token set is the same:
dog, bites, man
But the order changes the meaning.
Without positional information, Self-Attention sees token relationships but has no built-in sequence order.
With positional information, each token representation includes location.
So “dog” at position 1 is different from “dog” at position 3.
This is why positional encoding is not optional.
It is required for language understanding.
Absolute Positional Embedding assigns a vector to each position index.
Position 1 has one vector.
Position 2 has another vector.
Position 3 has another vector.
Then the model adds that position vector to the token embedding.
Example:
Token embedding:
X = [0.2, 0.5]
Position embedding:
P = [0.1, -0.2]
Final representation:
E = [0.3, 0.3]
APE is easy to understand.
It says:
this token is at this exact position
APE is simple.
It is easy to implement.
It works well when sequence lengths stay close to what the model saw during training.
Implementation-wise, it is just:
x = token_embedding + position_embedding
That makes it cheap and clean.
But the simplicity has a cost.
APE treats position as a fixed index.
If the model sees much longer inputs than it was trained on, unseen positions can become unreliable.
That makes APE weaker for long-context extrapolation.
Relative Positional Embedding focuses on distance.
Instead of asking:
What position is this token at?
It asks:
How far apart are these two tokens?
This is often more natural for language.
A subject and verb may appear at different absolute positions.
But their relative distance and direction still matter.
A simplified RPE attention score looks like this:
Aᵢⱼ = (QᵢKⱼᵀ + Rᵢ₋ⱼ) / √d
Rᵢ₋ⱼ represents the relative position between token i and token j.
This means position directly affects attention.
Suppose:
QᵢKⱼᵀ = 12
Rᵢ₋ⱼ = 4
√d = 4
Then:
Aᵢⱼ = (12 + 4) / 4 = 4
Without the relative term:
Aᵢⱼ = 12 / 4 = 3
So the distance relationship increased the attention score.
That is the intuition.
RPE lets the model say:
This token is more relevant because of where it is relative to me.
Rotary Positional Embedding takes a different path.
It does not add a position vector to the input.
It rotates Query and Key vectors based on position.
The core idea:
position becomes rotation
A 2D rotation matrix looks like this:
Rθ = [[cosθ, -sinθ], [sinθ, cosθ]]
If you rotate [1, 0] by 90 degrees:
[1, 0] → [0, 1]
RoPE applies this idea across Query and Key dimensions.
Different positions get different rotations.
Then attention scores naturally include relative position.
RoPE uses absolute position to rotate Q and K.
But when Q and K are compared, the score depends on their relative position difference.
The key relationship is:
(RθⁱQ)ᵀ(RθʲK) = QᵀRθʲ⁻ⁱK
This means the attention score contains j - i.
That is the relative distance.
So RoPE gives you a useful combination:
absolute-position injection + relative-position behavior
This is why RoPE became popular in modern LLMs.
APE:
RPE:
RoPE:
The key difference:
APE = where am I?
RPE = how far are we?
RoPE = rotate Q/K so distance appears in attention
If you are reading Transformer code, look at where position enters the model.
APE usually appears near the embedding layer:
x = token_embedding + position_embedding
RPE usually appears inside attention score computation:
scores = q @ k.T + relative_position_bias
RoPE usually appears after Q and K projection:
q = apply_rope(q, positions)
k = apply_rope(k, positions)
scores = q @ k.T
This is the developer shortcut.
Find the injection point.
Then you know which positional method the model uses.
Naive view:
Positional embedding just tells the model token order.
Practical view:
Positional design affects long-context behavior, caching, memory, and attention quality.
Naive mindset:
add positions
run attention
Practical mindset:
choose how position enters attention
consider context length
consider extrapolation
consider KV Cache compatibility
consider implementation complexity
This matters because positional encoding is not a small detail.
It changes how the model behaves when the context becomes long.
Short inputs can hide positional weaknesses.
Long-context models expose them.
If positional information does not extrapolate well, the model may become unstable outside its training length.
This is why modern LLMs care so much about RoPE variants and long-context scaling.
The position method affects whether a model can reliably handle long prompts, code files, documents, and conversations.
APE is easy but tied to absolute indices.
RPE is expressive but can complicate attention computation.
RoPE is efficient and practical, but still needs careful scaling for very long contexts.
Also:
Positional embeddings do not create reasoning by themselves.
They only give attention a way to use order.
The model still needs training to learn useful patterns.
Self-Attention needs positional information because it is order-blind by default.
APE adds absolute position to embeddings.
RPE adds relative distance to attention scores.
RoPE rotates Query and Key vectors so relative position appears naturally.
The shortest version:
Positional Embedding = the order signal that makes attention understand sequence structure
If you understand where position enters the model, you understand the difference between APE, RPE, and RoPE.
When learning Transformer internals, which positional method feels most intuitive to you?
APE, RPE, or RoPE?
Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/advanced-positional-embeddings-en/
GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai