cd /news/large-language-models/92-bert-the-model-that-reads-in-both… · home topics large-language-models article
[ARTICLE · art-3983] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

92. BERT: The Model That Reads in Both Directions

BERT (Bidirectional Encoder Representations from Transformers) is an encoder-only transformer model that reads all tokens in a sentence simultaneously, using masked language modeling (MLM) and next sentence prediction (NSP) to achieve bidirectional understanding. Unlike GPT, which reads left-to-right and excels at text generation, BERT is optimized for understanding tasks such as classification and has dominated NLP benchmarks since its 2018 release. The model uses special tokens like [CLS] for classification and [SEP] for separating sentences, with popular variants including bert-base-uncased (110M parameters) and the more efficient DistilBERT.

read12 min views6 publishedMay 20, 2026

GPT generates text by predicting the next word. It reads left to right.

BERT does something different. It masks random words in a sentence and tries to predict what they are. To do that well, it has to understand every word in relation to every other word simultaneously. Left and right context both matter.

That bidirectional understanding is why BERT dominated NLP benchmarks when it came out in 2018, and why encoder-only transformers are still the go-to for understanding tasks.

What You'll Learn Here

  • What makes BERT different from GPT
  • Masked Language Modeling: how BERT learns
  • Next Sentence Prediction: the second pretraining task
  • The [CLS] and [SEP] tokens and what they do
  • Fine-tuning BERT for text classification
  • Fine-tuning for Named Entity Recognition
  • Fine-tuning for Question Answering
  • Using HuggingFace to do all of this in under 20 lines

BERT vs GPT: The Key Difference

Both are transformer-based. The architecture is similar. The difference is in how they're pretrained and which part of the transformer they use.

GPT (decoder-only):
  - Reads left to right with causal masking
  - Trained to predict the next token
  - Great at generation
  - Context: only left side available

BERT (encoder-only):
  - Reads all tokens simultaneously
  - Trained to predict masked tokens + next sentence
  - Great at understanding
  - Context: both left and right sides available

For classification tasks, BERT wins. For generation tasks, GPT wins. For most NLP applications you actually want to build, BERT is the starting point.

How BERT Was Pretrained

BERT was pretrained on two tasks simultaneously on a massive corpus (BooksCorpus + English Wikipedia, 3.3 billion words).

Task 1: Masked Language Modeling (MLM)

15% of tokens are randomly masked. The model predicts the original token from context.

Input:  "The cat [MASK] on the [MASK]"
Target: "The cat sat  on the mat"

Of the 15% selected tokens:

  • 80% replaced with [MASK]
  • 10% replaced with a random token
  • 10% left unchanged

The random and unchanged cases prevent the model from only learning to predict [MASK] tokens.

Task 2: Next Sentence Prediction (NSP)

Two sentences are given. The model predicts whether sentence B actually follows sentence A in the original text.

Input:   [CLS] The cat sat on the mat. [SEP] It was a lazy afternoon. [SEP]
Label:   IsNext (1)

Input:   [CLS] The cat sat on the mat. [SEP] The stock market crashed. [SEP]
Label:   NotNext (0)

NSP was later found to be less useful than MLM and was dropped in RoBERTa. But it's part of the original BERT.

Special Tokens in BERT

BERT uses three special tokens you need to know:

[CLS]: Classification token. Always the first token. Its final hidden state is used as the sentence-level representation for classification tasks.

[SEP]: Separator token. Marks the end of a sentence or separates two sentences in pairs.

[PAD]: Padding token. Used to make all sequences in a batch the same length.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = "The cat sat on the mat."
tokens = tokenizer(text)

print(f"Input IDs:      {tokens['input_ids']}")
print(f"Token type IDs: {tokens['token_type_ids']}")
print(f"Attention mask: {tokens['attention_mask']}")
print()

decoded = tokenizer.convert_ids_to_tokens(tokens['input_ids'])
print(f"Tokens: {decoded}")

Output:

Input IDs:      [101, 1996, 4937, 2938, 2006, 1996, 13523, 1012, 102]
Token type IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0]
Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1]

Tokens: ['[CLS]', 'the', 'cat', 'sat', 'on', 'the', 'mat', '.', '[SEP]']

101 is [CLS]. 102 is [SEP]. Every BERT input starts with [CLS] and ends with [SEP].

text_pair = ("The cat sat on the mat.", "It was a lazy afternoon.")
tokens_pair = tokenizer(*text_pair)

decoded_pair = tokenizer.convert_ids_to_tokens(tokens_pair['input_ids'])
print(f"Pair tokens: {decoded_pair}")
print(f"Token types: {tokens_pair['token_type_ids']}")

Output:

Pair tokens: ['[CLS]', 'the', 'cat', 'sat', 'on', 'the', 'mat', '.', '[SEP]', 'it', 'was', 'a', 'lazy', 'afternoon', '.', '[SEP]']
Token types: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

Token type 0 = first sentence. Token type 1 = second sentence. BERT uses this to distinguish the two.

BERT Model Variants

bert-base-uncased:  12 layers, 768 hidden, 12 heads, 110M params
bert-large-uncased: 24 layers, 1024 hidden, 16 heads, 340M params
bert-base-cased:    Same as base but case-sensitive tokenization
distilbert-base:    6 layers, 66M params, 97% of BERT performance, 60% faster
roberta-base:       BERT without NSP, trained longer, better performance

For most tasks, start with bert-base-uncased

or distilbert-base-uncased

. Only go larger if you need the extra capacity.

Task 1: Text Classification With BERT

The most common use of BERT. Add a linear layer on top of the [CLS] token output.

from transformers import BertForSequenceClassification, BertTokenizer
from torch.utils.data import Data, Dataset
import torch
import torch.nn as nn
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup

texts = [
    "This movie was absolutely fantastic!",
    "I hated every minute of it.",
    "An incredible performance by the lead actor.",
    "Terrible writing, terrible acting.",
    "One of the best films I've seen this year.",
    "Complete waste of time and money.",
    "Beautifully crafted and deeply moving.",
    "Boring and predictable from start to finish.",
]
labels = [1, 0, 1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=64):
        self.encodings = tokenizer(
            texts,
            truncation=True,
            padding=True,
            max_length=max_len,
            return_tensors='pt'
        )
        self.labels = torch.tensor(labels)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            'input_ids':      self.encodings['input_ids'][idx],
            'attention_mask': self.encodings['attention_mask'][idx],
            'labels':         self.labels[idx]
        }

dataset = SentimentDataset(texts, labels, tokenizer)
  = Data(dataset, batch_size=4, shuffle=True)

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

device    = 'cuda' if torch.cuda.is_available() else 'cpu'
model     = model.to(device)
optimizer = AdamW(model.parameters(), lr=2e-5)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=len() * 3
)

print("Fine-tuning BERT for sentiment classification...")
for epoch in range(3):
    model.train()
    total_loss = 0
    for batch in :
        optimizer.zero_grad()
        input_ids      = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels         = batch['labels'].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        loss = outputs.loss
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}: loss={total_loss/len():.4f}")

model.eval()
new_texts = [
    "I absolutely loved this film!",
    "This was the worst movie I have ever seen."
]

new_encoding = tokenizer(
    new_texts, truncation=True, padding=True,
    max_length=64, return_tensors='pt'
).to(device)

with torch.no_grad():
    outputs = model(**new_encoding)
    preds   = torch.argmax(outputs.logits, dim=1)

for text, pred in zip(new_texts, preds):
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"'{text[:50]}...' -> {sentiment}")

Output:

Fine-tuning BERT for sentiment classification...
Epoch 1: loss=0.6834
Epoch 2: loss=0.4123
Epoch 3: loss=0.2187
'I absolutely loved this film!...' -> Positive
'This was the worst movie I have ever seen....' -> Negative

What Happens Inside During Fine-Tuning

from transformers import BertModel
import torch.nn as nn

class BertClassifier(nn.Module):
    def __init__(self, n_classes, dropout=0.3):
        super().__init__()
        self.bert    = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(768, n_classes)  # 768 = bert-base hidden size

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )


        cls_output = outputs.pooler_output      # (batch, 768)
        cls_output = self.dropout(cls_output)
        logits     = self.classifier(cls_output) # (batch, n_classes)

        return logits

model_manual = BertClassifier(n_classes=2)

total    = sum(p.numel() for p in model_manual.parameters())
trainable = sum(p.numel() for p in model_manual.parameters() if p.requires_grad)
print(f"Total parameters:     {total:,}")
print(f"Trainable parameters: {trainable:,}")
print()

for param in model_manual.bert.parameters():
    param.requires_grad = False

frozen_trainable = sum(p.numel() for p in model_manual.parameters() if p.requires_grad)
print(f"Trainable (head only): {frozen_trainable:,}")
print("(Only the 2-layer classifier is being trained)")

Output:

Total parameters:     109,484,546
Trainable parameters: 109,484,546

Trainable (head only): 1,538
(Only the 2-layer classifier is being trained)

When you fine-tune the entire BERT, all 109M parameters update. When you freeze BERT and only train the head, only 1,538 parameters update. Freezing is faster but usually less accurate. Fine-tuning everything gives better results when you have enough data.

Task 2: Named Entity Recognition (NER)

NER classifies each token. Person, Organization, Location, Date, Other. It's a token-level classification task, not sentence-level.

from transformers import BertForTokenClassification, BertTokenizerFast

label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
label2id   = {l: i for i, l in enumerate(label_list)}
id2label   = {i: l for i, l in enumerate(label_list)}

ner_model = BertForTokenClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

tokenizer_fast = BertTokenizerFast.from_pretrained('bert-base-uncased')

sentence = "Elon Musk founded Tesla in California."
words    = sentence.split()
word_labels = ['B-PER', 'I-PER', 'O', 'B-ORG', 'O', 'B-LOC', 'O']

encoding = tokenizer_fast(
    words,
    is_split_into_words=True,
    return_offsets_mapping=True,
    padding=True,
    truncation=True
)

word_ids    = encoding.word_ids()
token_labels = []
prev_word_id = None

for word_id in word_ids:
    if word_id is None:
        token_labels.append(-100)    # ignore [CLS] and [SEP] in loss
    elif word_id != prev_word_id:
        token_labels.append(label2id[word_labels[word_id]])  # first subword
    else:
        token_labels.append(-100)    # subsequent subwords: ignore
    prev_word_id = word_id

tokens = tokenizer_fast.convert_ids_to_tokens(encoding['input_ids'])
print("Token -> Label alignment:")
for token, label_id in zip(tokens, token_labels):
    label = id2label.get(label_id, 'IGN')
    print(f"  {token:<15} {label}")

Output:

Token -> Label alignment:
  [CLS]           IGN
  elon            B-PER
  mu              IGN
  ##sk            IGN
  founded         O
  tesla           B-ORG
  in              O
  california      B-LOC
  .               O
  [SEP]           IGN

"Elon" maps to B-PER. "mu" and "##sk" (subwords of "Musk") are ignored in the loss. This is the standard way to handle subword tokenization for token-level tasks.

Task 3: Question Answering

BERT predicts the start and end position of the answer span within the context passage.

from transformers import BertForQuestionAnswering, BertTokenizer
import torch

qa_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
qa_model     = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

def answer_question(question, context):
    inputs = qa_tokenizer(
        question, context,
        return_tensors='pt',
        truncation=True,
        max_length=512
    )

    with torch.no_grad():
        outputs = qa_model(**inputs)

    start_logits = outputs.start_logits
    end_logits   = outputs.end_logits

    start_idx = torch.argmax(start_logits)
    end_idx   = torch.argmax(end_logits) + 1

    tokens = qa_tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    answer = qa_tokenizer.convert_tokens_to_string(tokens[start_idx:end_idx])

    return answer

context = """
The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.
It is named after the engineer Gustave Eiffel, whose company designed and built the tower.
Constructed from 1887 to 1889 as the entrance arch to the 1889 World's Fair, it was initially
criticized by some of France's leading artists and intellectuals but has become a global
cultural icon of France and one of the most recognisable structures in the world.
"""

questions = [
    "Where is the Eiffel Tower located?",
    "Who designed the Eiffel Tower?",
    "When was the Eiffel Tower built?",
]

for q in questions:
    answer = answer_question(q, context)
    print(f"Q: {q}")
    print(f"A: {answer}")
    print()

Output:

Q: Where is the Eiffel Tower located?
A: Champ de Mars in Paris, France

Q: Who designed the Eiffel Tower?
A: Gustave Eiffel

Q: When was the Eiffel Tower built?
A: 1887 to 1889

A pretrained BERT fine-tuned on SQuAD (Stanford Question Answering Dataset) extracts answers directly from context. No generation. Just span extraction.

The Fastest Way: HuggingFace Pipeline

For common tasks, HuggingFace pipelines wrap everything into one function call.

from transformers import pipeline

sentiment = pipeline('sentiment-analysis')
results = sentiment([
    "I absolutely loved this product!",
    "Terrible quality, fell apart after a day.",
    "It's okay, nothing special."
])
for r in results:
    print(f"{r['label']:<10} {r['score']:.3f}")

print()

ner = pipeline('ner', grouped_entities=True)
text = "Apple CEO Tim Cook announced a new product at their Cupertino headquarters."
entities = ner(text)
for e in entities:
    print(f"{e['entity_group']:<8} {e['word']:<25} score={e['score']:.3f}")

print()

qa = pipeline('question-answering')
result = qa(
    question="Who is the CEO of Apple?",
    context="Apple CEO Tim Cook announced a new product at their Cupertino headquarters."
)
print(f"Answer: {result['answer']}  (score: {result['score']:.3f})")

print()

classifier = pipeline('zero-shot-classification')
text = "The government announced new economic policies today."
candidate_labels = ['politics', 'technology', 'sports', 'entertainment']
result = classifier(text, candidate_labels=candidate_labels)
for label, score in zip(result['labels'], result['scores']):
    print(f"{label:<15}: {score:.3f}")

Output:

POSITIVE   0.999
NEGATIVE   0.998
NEGATIVE   0.612

ORG      Apple                     score=0.998
PER      Tim Cook                  score=0.997
LOC      Cupertino                 score=0.986

Answer: Tim Cook  (score: 0.998)

politics       : 0.942
technology     : 0.031
entertainment  : 0.017
sports         : 0.010

Fine-Tuning Tips for BERT

Learning rate: BERT is sensitive. Use 2e-5 to 5e-5. Lower than typical deep learning.

Batch size: 16 or 32. Larger batches work better for BERT.

Epochs: 2 to 4 epochs. BERT fine-tunes quickly. More epochs usually causes overfitting.

Warmup steps: Schedule the LR to warm up for 10% of training, then linearly decay. Helps stability.

Gradient clipping: Clip at 1.0 to prevent exploding gradients.

from transformers import get_linear_schedule_with_warmup

EPOCHS         = 3
LEARNING_RATE  = 2e-5
WARMUP_RATIO   = 0.1

total_steps   = len() * EPOCHS
warmup_steps  = int(total_steps * WARMUP_RATIO)

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE, eps=1e-8)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

print(f"Total training steps: {total_steps}")
print(f"Warmup steps: {warmup_steps}")
print(f"Peak LR: {LEARNING_RATE}, then linear decay to 0")

BERT vs RoBERTa vs DistilBERT

Model            Params  Speed   Accuracy  Notes
-----------      ------  -----   --------  -----
bert-base        110M    1x      baseline  Original, safe choice
bert-large       340M    0.4x    +2-3%     Slower, better accuracy
roberta-base     125M    1x      +1-2%     Better pretraining, no NSP
distilbert-base   66M    1.6x    -3%       Great for production
albert-base        12M   0.9x    ~same     Much fewer params via sharing

For most projects: start with distilbert-base-uncased

for speed, switch to roberta-base

for accuracy.

Quick Cheat Sheet

Task Model Code
Text classification BertForSequenceClassification pipeline('sentiment-analysis')
NER BertForTokenClassification pipeline('ner')
QA BertForQuestionAnswering pipeline('question-answering')
Zero-shot NLI model pipeline('zero-shot-classification')
Custom BertModel + linear head outputs.pooler_output
Setting Value
Learning rate 2e-5 to 5e-5
Batch size 16 or 32
Epochs 2 to 4
Max sequence length 128 to 512
Warmup steps 10% of total steps

Practice Challenges

Level 1:

Use pipeline('sentiment-analysis')

on 20 movie reviews you write yourself (10 positive, 10 negative). Print each prediction and confidence score. Where does it get confused?

Level 2:

Fine-tune distilbert-base-uncased

on any small classification dataset (you can use load_dataset('imdb')

from HuggingFace). Train for 3 epochs. Compare accuracy to a TF-IDF + LogisticRegression baseline from Post 62. How much better is BERT?

Level 3:

Use BertForTokenClassification

to tag a paragraph of news text with NER labels. Then visualize the output by color-coding each entity type in the text. Use the fine-tuned dslim/bert-base-NER

model from HuggingFace hub.

References

BERT paper: Pre-training of Deep Bidirectional TransformersThe Illustrated BERT (Jay Alammar)HuggingFace: BERT docsHuggingFace: Fine-tuning tutorialRoBERTa paper

Next up, Post 93:GPT: The Model That Predicts the Next Word Forever. Autoregressive generation, temperature and sampling strategies, and how a simple next-token prediction objective produces models that can write, code, and reason.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/92-bert-the-model-th…] indexed:0 read:12min 2026-05-20 ·