# How Transformers Work — From Self-Attention to Modern LLM Architecture

> Source: <https://dev.to/zeromathai/how-transformers-work-from-self-attention-to-modern-llm-architecture-4j1o>
> Published: 2026-06-15 15:12:47+00:00

Transformers changed AI because they stopped reading sequences one token at a time.

Instead of moving step by step like an RNN, a Transformer compares tokens directly.

That one design shift made modern LLMs possible.

A Transformer is a neural network architecture built around attention.

It looks at a sequence of tokens and learns how those tokens relate to each other.

This matters because language is contextual.

A word is not understood alone.

It is understood through its relationship with surrounding words.

That is why Self-Attention became the core mechanism.

A simplified Transformer flow looks like this:

Tokens → Embeddings → Positional Information → Self-Attention → Feed-Forward Network → Output

More compactly:

Transformer = token representations + attention + position + stacked blocks

The model first converts text into token vectors.

Then it injects position information.

Then each Transformer block updates the token representations using attention and feed-forward layers.

At a high level, a Transformer processes text like this:

```
split text into tokens

convert tokens into embeddings

add positional information

for each Transformer block:
    compute Self-Attention

    mix token information

    apply feed-forward transformation

    keep stable flow with residual connections and normalization

produce contextual token representations
```

For decoder-based LLMs, generation continues like this:

```
predict next token

append generated token

reuse cached keys and values

repeat until stopping condition
```

This is why Transformers are practical for large-scale generation.

They can learn relationships across many tokens.

And with caching, they can generate efficiently.

Take this sentence:

The animal did not cross the street because it was tired.

What does “it” refer to?

A simple left-to-right model may struggle if long context matters.

Self-Attention lets the token “it” compare itself with other tokens like “animal” and “street.”

The model can assign stronger attention to the token that best explains the meaning.

That is the intuition.

Attention lets tokens ask:

Which other tokens matter for understanding me?

This comparison explains why Transformers became so important.

RNN:

Transformer:

So the Transformer was not just faster.

It changed how sequence relationships are represented.

RNNs remember through recurrence.

Transformers relate through attention.

Self-Attention computes relationships between tokens in the same sequence.

Each token creates three vectors:

The intuition is simple:

Query = what this token is looking for

Key = what each token offers for matching

Value = information to retrieve if the match is strong

The core formula is:

Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k))V

This means:

That is how each token becomes context-aware.

One attention calculation is useful.

But one view is not enough.

Multi-Head Attention runs several attention heads in parallel.

Each head can focus on a different type of relationship.

One head may track syntax.

Another may track semantic similarity.

Another may track long-distance references.

Then the outputs are combined into one representation.

This makes attention richer than a single similarity calculation.

Self-Attention does not automatically know token order.

If you only give it a bag of token embeddings, the model needs another signal to know which token came first.

That is why positional information is added.

Common positional methods include:

APE gives each position its own vector.

RPE focuses on relative distance between tokens.

RoPE rotates query and key vectors based on position, making relative position work naturally inside attention.

This is why RoPE became common in modern LLMs.

The original Transformer used an Encoder-Decoder structure.

Encoder:

Decoder:

Encoder-Decoder:

Modern GPT-style LLMs are mostly decoder-based.

They generate text one token at a time.

The decoder predicts the next token, appends it, and repeats.

Once the model produces logits, it needs to choose the next token.

Different decoding strategies create different behavior.

Greedy decoding:

Beam search:

Top-k sampling:

Top-p sampling:

So generation quality is not only about the model.

It also depends on decoding.

Full Attention is powerful but expensive.

If the sequence length is n, attention has roughly O(n^2) cost.

That means longer context becomes expensive quickly.

This is why efficient attention matters.

Local Attention reduces the view to nearby tokens.

Sparse Attention computes only selected attention links.

FlashAttention keeps the formula but improves GPU memory access.

The key idea:

Do less unnecessary work, or move data more efficiently.

Both make longer context more practical.

Autoregressive generation has another problem.

When generating one token at a time, the model repeatedly needs past key and value tensors.

KV Cache stores those tensors.

So the model does not recompute them from scratch at every step.

The flow looks like this:

Generated tokens → cached keys and values → new query attends to cache → next token

This makes inference faster.

But it creates a memory problem.

Longer context means a larger KV Cache.

That is why modern LLMs use techniques like:

These methods reduce the memory cost of storing key-value information.

Modern LLMs still use the Transformer idea.

But the block has evolved.

A typical modern block looks like this:

Input

→ RMSNorm or Pre-Layer Normalization

→ Self-Attention with GQA and RoPE

→ Residual Connection

→ RMSNorm or Pre-Layer Normalization

→ Feed-Forward Network with SwiGLU or Mixture of Experts

→ Residual Connection

Important upgrades include:

So today’s Transformer is not exactly the 2017 Transformer copied directly.

It is an evolved architecture family.

Original Transformer:

Modern LLM architecture:

The core idea stayed the same.

The engineering changed dramatically.

If Transformer architecture feels too large, learn it in this order:

This order works because you first understand the relationship mechanism.

Then you understand generation.

Then you understand why modern LLMs needed efficiency upgrades.

The Transformer is the architecture language of modern LLMs.

The shortest version is:

Transformer = attention + position + stacked blocks + efficient generation

Self-Attention computes token relationships.

Positional encoding injects order.

The decoder generates tokens.

KV Cache makes autoregressive inference practical.

Modern upgrades like RoPE, RMSNorm, GQA, SwiGLU, and MoE make the architecture scalable.

If you remember one idea, remember this:

Transformers work by turning a sequence into a set of contextual relationships, then refining those relationships through stacked attention-based blocks.

When learning Transformers, do you find it easier to start from the attention formula, the decoder generation loop, or the modern LLM block structure?

Originally published at zeromathai.com.

Original article: [https://zeromathai.com/en/transformer-architecture-overview-en/](https://zeromathai.com/en/transformer-architecture-overview-en/)

GitHub Resources

AI diagrams, study notes, and visual guides:

[https://github.com/zeromathai/zeromathai-ai](https://github.com/zeromathai/zeromathai-ai)
