{"slug": "92-bert-the-model-that-reads-in-both-directions", "title": "92. BERT: The Model That Reads in Both Directions", "summary": "BERT (Bidirectional Encoder Representations from Transformers) is an encoder-only transformer model that reads all tokens in a sentence simultaneously, using masked language modeling (MLM) and next sentence prediction (NSP) to achieve bidirectional understanding. Unlike GPT, which reads left-to-right and excels at text generation, BERT is optimized for understanding tasks such as classification and has dominated NLP benchmarks since its 2018 release. The model uses special tokens like [CLS] for classification and [SEP] for separating sentences, with popular variants including bert-base-uncased (110M parameters) and the more efficient DistilBERT.", "body_md": "GPT generates text by predicting the next word. It reads left to right.\n\nBERT does something different. It masks random words in a sentence and tries to predict what they are. To do that well, it has to understand every word in relation to every other word simultaneously. Left and right context both matter.\n\nThat bidirectional understanding is why BERT dominated NLP benchmarks when it came out in 2018, and why encoder-only transformers are still the go-to for understanding tasks.\n\n### What You'll Learn Here\n\n- What makes BERT different from GPT\n- Masked Language Modeling: how BERT learns\n- Next Sentence Prediction: the second pretraining task\n- The [CLS] and [SEP] tokens and what they do\n- Fine-tuning BERT for text classification\n- Fine-tuning for Named Entity Recognition\n- Fine-tuning for Question Answering\n- Using HuggingFace to do all of this in under 20 lines\n\n### BERT vs GPT: The Key Difference\n\nBoth are transformer-based. The architecture is similar. The difference is in how they're pretrained and which part of the transformer they use.\n\n```\nGPT (decoder-only):\n  - Reads left to right with causal masking\n  - Trained to predict the next token\n  - Great at generation\n  - Context: only left side available\n\nBERT (encoder-only):\n  - Reads all tokens simultaneously\n  - Trained to predict masked tokens + next sentence\n  - Great at understanding\n  - Context: both left and right sides available\n```\n\nFor classification tasks, BERT wins. For generation tasks, GPT wins. For most NLP applications you actually want to build, BERT is the starting point.\n\n### How BERT Was Pretrained\n\nBERT was pretrained on two tasks simultaneously on a massive corpus (BooksCorpus + English Wikipedia, 3.3 billion words).\n\n**Task 1: Masked Language Modeling (MLM)**\n\n15% of tokens are randomly masked. The model predicts the original token from context.\n\n```\nInput:  \"The cat [MASK] on the [MASK]\"\nTarget: \"The cat sat  on the mat\"\n```\n\nOf the 15% selected tokens:\n\n- 80% replaced with [MASK]\n- 10% replaced with a random token\n- 10% left unchanged\n\nThe random and unchanged cases prevent the model from only learning to predict [MASK] tokens.\n\n**Task 2: Next Sentence Prediction (NSP)**\n\nTwo sentences are given. The model predicts whether sentence B actually follows sentence A in the original text.\n\n```\nInput:   [CLS] The cat sat on the mat. [SEP] It was a lazy afternoon. [SEP]\nLabel:   IsNext (1)\n\nInput:   [CLS] The cat sat on the mat. [SEP] The stock market crashed. [SEP]\nLabel:   NotNext (0)\n```\n\nNSP was later found to be less useful than MLM and was dropped in RoBERTa. But it's part of the original BERT.\n\n### Special Tokens in BERT\n\nBERT uses three special tokens you need to know:\n\n**[CLS]:** Classification token. Always the first token. Its final hidden state is used as the sentence-level representation for classification tasks.\n\n**[SEP]:** Separator token. Marks the end of a sentence or separates two sentences in pairs.\n\n**[PAD]:** Padding token. Used to make all sequences in a batch the same length.\n\n``` python\nfrom transformers import BertTokenizer\n\ntokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n\ntext = \"The cat sat on the mat.\"\ntokens = tokenizer(text)\n\nprint(f\"Input IDs:      {tokens['input_ids']}\")\nprint(f\"Token type IDs: {tokens['token_type_ids']}\")\nprint(f\"Attention mask: {tokens['attention_mask']}\")\nprint()\n\n# Decode back to see what they are\ndecoded = tokenizer.convert_ids_to_tokens(tokens['input_ids'])\nprint(f\"Tokens: {decoded}\")\n```\n\nOutput:\n\n```\nInput IDs:      [101, 1996, 4937, 2938, 2006, 1996, 13523, 1012, 102]\nToken type IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0]\nAttention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1]\n\nTokens: ['[CLS]', 'the', 'cat', 'sat', 'on', 'the', 'mat', '.', '[SEP]']\n```\n\n101 is [CLS]. 102 is [SEP]. Every BERT input starts with [CLS] and ends with [SEP].\n\n```\n# Two sentences\ntext_pair = (\"The cat sat on the mat.\", \"It was a lazy afternoon.\")\ntokens_pair = tokenizer(*text_pair)\n\ndecoded_pair = tokenizer.convert_ids_to_tokens(tokens_pair['input_ids'])\nprint(f\"Pair tokens: {decoded_pair}\")\nprint(f\"Token types: {tokens_pair['token_type_ids']}\")\n```\n\nOutput:\n\n```\nPair tokens: ['[CLS]', 'the', 'cat', 'sat', 'on', 'the', 'mat', '.', '[SEP]', 'it', 'was', 'a', 'lazy', 'afternoon', '.', '[SEP]']\nToken types: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]\n```\n\nToken type 0 = first sentence. Token type 1 = second sentence. BERT uses this to distinguish the two.\n\n### BERT Model Variants\n\n```\nbert-base-uncased:  12 layers, 768 hidden, 12 heads, 110M params\nbert-large-uncased: 24 layers, 1024 hidden, 16 heads, 340M params\nbert-base-cased:    Same as base but case-sensitive tokenization\ndistilbert-base:    6 layers, 66M params, 97% of BERT performance, 60% faster\nroberta-base:       BERT without NSP, trained longer, better performance\n```\n\nFor most tasks, start with `bert-base-uncased`\n\nor `distilbert-base-uncased`\n\n. Only go larger if you need the extra capacity.\n\n### Task 1: Text Classification With BERT\n\nThe most common use of BERT. Add a linear layer on top of the [CLS] token output.\n\n``` python\nfrom transformers import BertForSequenceClassification, BertTokenizer\nfrom torch.utils.data import DataLoader, Dataset\nimport torch\nimport torch.nn as nn\nfrom torch.optim import AdamW\nfrom transformers import get_linear_schedule_with_warmup\n\n# Simple sentiment dataset\ntexts = [\n    \"This movie was absolutely fantastic!\",\n    \"I hated every minute of it.\",\n    \"An incredible performance by the lead actor.\",\n    \"Terrible writing, terrible acting.\",\n    \"One of the best films I've seen this year.\",\n    \"Complete waste of time and money.\",\n    \"Beautifully crafted and deeply moving.\",\n    \"Boring and predictable from start to finish.\",\n]\nlabels = [1, 0, 1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative\n\n# Tokenize\ntokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n\nclass SentimentDataset(Dataset):\n    def __init__(self, texts, labels, tokenizer, max_len=64):\n        self.encodings = tokenizer(\n            texts,\n            truncation=True,\n            padding=True,\n            max_length=max_len,\n            return_tensors='pt'\n        )\n        self.labels = torch.tensor(labels)\n\n    def __len__(self):\n        return len(self.labels)\n\n    def __getitem__(self, idx):\n        return {\n            'input_ids':      self.encodings['input_ids'][idx],\n            'attention_mask': self.encodings['attention_mask'][idx],\n            'labels':         self.labels[idx]\n        }\n\ndataset = SentimentDataset(texts, labels, tokenizer)\nloader  = DataLoader(dataset, batch_size=4, shuffle=True)\n\n# Load pretrained BERT with classification head\nmodel = BertForSequenceClassification.from_pretrained(\n    'bert-base-uncased',\n    num_labels=2\n)\n\ndevice    = 'cuda' if torch.cuda.is_available() else 'cpu'\nmodel     = model.to(device)\noptimizer = AdamW(model.parameters(), lr=2e-5)\nscheduler = get_linear_schedule_with_warmup(\n    optimizer,\n    num_warmup_steps=0,\n    num_training_steps=len(loader) * 3\n)\n\n# Fine-tune\nprint(\"Fine-tuning BERT for sentiment classification...\")\nfor epoch in range(3):\n    model.train()\n    total_loss = 0\n    for batch in loader:\n        optimizer.zero_grad()\n        input_ids      = batch['input_ids'].to(device)\n        attention_mask = batch['attention_mask'].to(device)\n        labels         = batch['labels'].to(device)\n\n        outputs = model(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            labels=labels\n        )\n        loss = outputs.loss\n        loss.backward()\n        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n        optimizer.step()\n        scheduler.step()\n        total_loss += loss.item()\n\n    print(f\"Epoch {epoch+1}: loss={total_loss/len(loader):.4f}\")\n\n# Predict on new examples\nmodel.eval()\nnew_texts = [\n    \"I absolutely loved this film!\",\n    \"This was the worst movie I have ever seen.\"\n]\n\nnew_encoding = tokenizer(\n    new_texts, truncation=True, padding=True,\n    max_length=64, return_tensors='pt'\n).to(device)\n\nwith torch.no_grad():\n    outputs = model(**new_encoding)\n    preds   = torch.argmax(outputs.logits, dim=1)\n\nfor text, pred in zip(new_texts, preds):\n    sentiment = \"Positive\" if pred == 1 else \"Negative\"\n    print(f\"'{text[:50]}...' -> {sentiment}\")\n```\n\nOutput:\n\n```\nFine-tuning BERT for sentiment classification...\nEpoch 1: loss=0.6834\nEpoch 2: loss=0.4123\nEpoch 3: loss=0.2187\n'I absolutely loved this film!...' -> Positive\n'This was the worst movie I have ever seen....' -> Negative\n```\n\n### What Happens Inside During Fine-Tuning\n\n``` python\n# Look at what BertForSequenceClassification adds\nfrom transformers import BertModel\nimport torch.nn as nn\n\nclass BertClassifier(nn.Module):\n    def __init__(self, n_classes, dropout=0.3):\n        super().__init__()\n        self.bert    = BertModel.from_pretrained('bert-base-uncased')\n        self.dropout = nn.Dropout(dropout)\n        self.classifier = nn.Linear(768, n_classes)  # 768 = bert-base hidden size\n\n    def forward(self, input_ids, attention_mask):\n        outputs = self.bert(\n            input_ids=input_ids,\n            attention_mask=attention_mask\n        )\n\n        # outputs.last_hidden_state: (batch, seq_len, 768)\n        # outputs.pooler_output: (batch, 768) - the [CLS] token, passed through a linear+tanh\n\n        cls_output = outputs.pooler_output      # (batch, 768)\n        cls_output = self.dropout(cls_output)\n        logits     = self.classifier(cls_output) # (batch, n_classes)\n\n        return logits\n\nmodel_manual = BertClassifier(n_classes=2)\n\n# Check what's trainable vs frozen\ntotal    = sum(p.numel() for p in model_manual.parameters())\ntrainable = sum(p.numel() for p in model_manual.parameters() if p.requires_grad)\nprint(f\"Total parameters:     {total:,}\")\nprint(f\"Trainable parameters: {trainable:,}\")\nprint()\n\n# Often you freeze BERT layers and only train the head\nfor param in model_manual.bert.parameters():\n    param.requires_grad = False\n\nfrozen_trainable = sum(p.numel() for p in model_manual.parameters() if p.requires_grad)\nprint(f\"Trainable (head only): {frozen_trainable:,}\")\nprint(\"(Only the 2-layer classifier is being trained)\")\n```\n\nOutput:\n\n```\nTotal parameters:     109,484,546\nTrainable parameters: 109,484,546\n\nTrainable (head only): 1,538\n(Only the 2-layer classifier is being trained)\n```\n\nWhen you fine-tune the entire BERT, all 109M parameters update. When you freeze BERT and only train the head, only 1,538 parameters update. Freezing is faster but usually less accurate. Fine-tuning everything gives better results when you have enough data.\n\n### Task 2: Named Entity Recognition (NER)\n\nNER classifies each token. Person, Organization, Location, Date, Other. It's a token-level classification task, not sentence-level.\n\n``` python\nfrom transformers import BertForTokenClassification, BertTokenizerFast\n\n# NER labels\nlabel_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']\nlabel2id   = {l: i for i, l in enumerate(label_list)}\nid2label   = {i: l for i, l in enumerate(label_list)}\n\n# Load NER model\nner_model = BertForTokenClassification.from_pretrained(\n    'bert-base-uncased',\n    num_labels=len(label_list),\n    id2label=id2label,\n    label2id=label2id\n)\n\ntokenizer_fast = BertTokenizerFast.from_pretrained('bert-base-uncased')\n\n# Example: align word labels to subword tokens\nsentence = \"Elon Musk founded Tesla in California.\"\nwords    = sentence.split()\nword_labels = ['B-PER', 'I-PER', 'O', 'B-ORG', 'O', 'B-LOC', 'O']\n\n# Tokenize with word_ids to handle subwords\nencoding = tokenizer_fast(\n    words,\n    is_split_into_words=True,\n    return_offsets_mapping=True,\n    padding=True,\n    truncation=True\n)\n\n# Map word-level labels to subword-level\nword_ids    = encoding.word_ids()\ntoken_labels = []\nprev_word_id = None\n\nfor word_id in word_ids:\n    if word_id is None:\n        token_labels.append(-100)    # ignore [CLS] and [SEP] in loss\n    elif word_id != prev_word_id:\n        token_labels.append(label2id[word_labels[word_id]])  # first subword\n    else:\n        token_labels.append(-100)    # subsequent subwords: ignore\n    prev_word_id = word_id\n\ntokens = tokenizer_fast.convert_ids_to_tokens(encoding['input_ids'])\nprint(\"Token -> Label alignment:\")\nfor token, label_id in zip(tokens, token_labels):\n    label = id2label.get(label_id, 'IGN')\n    print(f\"  {token:<15} {label}\")\n```\n\nOutput:\n\n``` php\nToken -> Label alignment:\n  [CLS]           IGN\n  elon            B-PER\n  mu              IGN\n  ##sk            IGN\n  founded         O\n  tesla           B-ORG\n  in              O\n  california      B-LOC\n  .               O\n  [SEP]           IGN\n```\n\n\"Elon\" maps to B-PER. \"mu\" and \"##sk\" (subwords of \"Musk\") are ignored in the loss. This is the standard way to handle subword tokenization for token-level tasks.\n\n### Task 3: Question Answering\n\nBERT predicts the start and end position of the answer span within the context passage.\n\n``` python\nfrom transformers import BertForQuestionAnswering, BertTokenizer\nimport torch\n\n# Load pretrained QA model (already fine-tuned on SQuAD)\nqa_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\nqa_model     = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n\ndef answer_question(question, context):\n    inputs = qa_tokenizer(\n        question, context,\n        return_tensors='pt',\n        truncation=True,\n        max_length=512\n    )\n\n    with torch.no_grad():\n        outputs = qa_model(**inputs)\n\n    start_logits = outputs.start_logits\n    end_logits   = outputs.end_logits\n\n    # Find best start and end positions\n    start_idx = torch.argmax(start_logits)\n    end_idx   = torch.argmax(end_logits) + 1\n\n    tokens = qa_tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])\n    answer = qa_tokenizer.convert_tokens_to_string(tokens[start_idx:end_idx])\n\n    return answer\n\n# Test it\ncontext = \"\"\"\nThe Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.\nIt is named after the engineer Gustave Eiffel, whose company designed and built the tower.\nConstructed from 1887 to 1889 as the entrance arch to the 1889 World's Fair, it was initially\ncriticized by some of France's leading artists and intellectuals but has become a global\ncultural icon of France and one of the most recognisable structures in the world.\n\"\"\"\n\nquestions = [\n    \"Where is the Eiffel Tower located?\",\n    \"Who designed the Eiffel Tower?\",\n    \"When was the Eiffel Tower built?\",\n]\n\nfor q in questions:\n    answer = answer_question(q, context)\n    print(f\"Q: {q}\")\n    print(f\"A: {answer}\")\n    print()\n```\n\nOutput:\n\n```\nQ: Where is the Eiffel Tower located?\nA: Champ de Mars in Paris, France\n\nQ: Who designed the Eiffel Tower?\nA: Gustave Eiffel\n\nQ: When was the Eiffel Tower built?\nA: 1887 to 1889\n```\n\nA pretrained BERT fine-tuned on SQuAD (Stanford Question Answering Dataset) extracts answers directly from context. No generation. Just span extraction.\n\n### The Fastest Way: HuggingFace Pipeline\n\nFor common tasks, HuggingFace pipelines wrap everything into one function call.\n\n``` python\nfrom transformers import pipeline\n\n# Sentiment analysis (fine-tuned BERT on SST-2)\nsentiment = pipeline('sentiment-analysis')\nresults = sentiment([\n    \"I absolutely loved this product!\",\n    \"Terrible quality, fell apart after a day.\",\n    \"It's okay, nothing special.\"\n])\nfor r in results:\n    print(f\"{r['label']:<10} {r['score']:.3f}\")\n\nprint()\n\n# Named Entity Recognition\nner = pipeline('ner', grouped_entities=True)\ntext = \"Apple CEO Tim Cook announced a new product at their Cupertino headquarters.\"\nentities = ner(text)\nfor e in entities:\n    print(f\"{e['entity_group']:<8} {e['word']:<25} score={e['score']:.3f}\")\n\nprint()\n\n# Question Answering\nqa = pipeline('question-answering')\nresult = qa(\n    question=\"Who is the CEO of Apple?\",\n    context=\"Apple CEO Tim Cook announced a new product at their Cupertino headquarters.\"\n)\nprint(f\"Answer: {result['answer']}  (score: {result['score']:.3f})\")\n\nprint()\n\n# Zero-shot classification (no fine-tuning needed)\nclassifier = pipeline('zero-shot-classification')\ntext = \"The government announced new economic policies today.\"\ncandidate_labels = ['politics', 'technology', 'sports', 'entertainment']\nresult = classifier(text, candidate_labels=candidate_labels)\nfor label, score in zip(result['labels'], result['scores']):\n    print(f\"{label:<15}: {score:.3f}\")\n```\n\nOutput:\n\n```\nPOSITIVE   0.999\nNEGATIVE   0.998\nNEGATIVE   0.612\n\nORG      Apple                     score=0.998\nPER      Tim Cook                  score=0.997\nLOC      Cupertino                 score=0.986\n\nAnswer: Tim Cook  (score: 0.998)\n\npolitics       : 0.942\ntechnology     : 0.031\nentertainment  : 0.017\nsports         : 0.010\n```\n\n### Fine-Tuning Tips for BERT\n\n**Learning rate:** BERT is sensitive. Use 2e-5 to 5e-5. Lower than typical deep learning.\n\n**Batch size:** 16 or 32. Larger batches work better for BERT.\n\n**Epochs:** 2 to 4 epochs. BERT fine-tunes quickly. More epochs usually causes overfitting.\n\n**Warmup steps:** Schedule the LR to warm up for 10% of training, then linearly decay. Helps stability.\n\n**Gradient clipping:** Clip at 1.0 to prevent exploding gradients.\n\n``` python\n# Standard fine-tuning setup\nfrom transformers import get_linear_schedule_with_warmup\n\nEPOCHS         = 3\nLEARNING_RATE  = 2e-5\nWARMUP_RATIO   = 0.1\n\ntotal_steps   = len(loader) * EPOCHS\nwarmup_steps  = int(total_steps * WARMUP_RATIO)\n\noptimizer = AdamW(model.parameters(), lr=LEARNING_RATE, eps=1e-8)\nscheduler = get_linear_schedule_with_warmup(\n    optimizer,\n    num_warmup_steps=warmup_steps,\n    num_training_steps=total_steps\n)\n\nprint(f\"Total training steps: {total_steps}\")\nprint(f\"Warmup steps: {warmup_steps}\")\nprint(f\"Peak LR: {LEARNING_RATE}, then linear decay to 0\")\n```\n\n### BERT vs RoBERTa vs DistilBERT\n\n```\nModel            Params  Speed   Accuracy  Notes\n-----------      ------  -----   --------  -----\nbert-base        110M    1x      baseline  Original, safe choice\nbert-large       340M    0.4x    +2-3%     Slower, better accuracy\nroberta-base     125M    1x      +1-2%     Better pretraining, no NSP\ndistilbert-base   66M    1.6x    -3%       Great for production\nalbert-base        12M   0.9x    ~same     Much fewer params via sharing\n```\n\nFor most projects: start with `distilbert-base-uncased`\n\nfor speed, switch to `roberta-base`\n\nfor accuracy.\n\n### Quick Cheat Sheet\n\n| Task | Model | Code |\n|---|---|---|\n| Text classification | BertForSequenceClassification | `pipeline('sentiment-analysis')` |\n| NER | BertForTokenClassification | `pipeline('ner')` |\n| QA | BertForQuestionAnswering | `pipeline('question-answering')` |\n| Zero-shot | NLI model | `pipeline('zero-shot-classification')` |\n| Custom | BertModel + linear head | `outputs.pooler_output` |\n\n| Setting | Value |\n|---|---|\n| Learning rate | 2e-5 to 5e-5 |\n| Batch size | 16 or 32 |\n| Epochs | 2 to 4 |\n| Max sequence length | 128 to 512 |\n| Warmup steps | 10% of total steps |\n\n### Practice Challenges\n\n**Level 1:**\n\nUse `pipeline('sentiment-analysis')`\n\non 20 movie reviews you write yourself (10 positive, 10 negative). Print each prediction and confidence score. Where does it get confused?\n\n**Level 2:**\n\nFine-tune `distilbert-base-uncased`\n\non any small classification dataset (you can use `load_dataset('imdb')`\n\nfrom HuggingFace). Train for 3 epochs. Compare accuracy to a TF-IDF + LogisticRegression baseline from Post 62. How much better is BERT?\n\n**Level 3:**\n\nUse `BertForTokenClassification`\n\nto tag a paragraph of news text with NER labels. Then visualize the output by color-coding each entity type in the text. Use the fine-tuned `dslim/bert-base-NER`\n\nmodel from HuggingFace hub.\n\n### References\n\n[BERT paper: Pre-training of Deep Bidirectional Transformers](https://arxiv.org/abs/1810.04805)[The Illustrated BERT (Jay Alammar)](https://jalammar.github.io/illustrated-bert/)[HuggingFace: BERT docs](https://huggingface.co/docs/transformers/model_doc/bert)[HuggingFace: Fine-tuning tutorial](https://huggingface.co/docs/transformers/training)[RoBERTa paper](https://arxiv.org/abs/1907.11692)\n\nNext up, Post 93:GPT: The Model That Predicts the Next Word Forever. Autoregressive generation, temperature and sampling strategies, and how a simple next-token prediction objective produces models that can write, code, and reason.", "url": "https://wpnews.pro/news/92-bert-the-model-that-reads-in-both-directions", "canonical_source": "https://dev.to/yakhilesh/92-bert-the-model-that-reads-in-both-directions-2cga", "published_at": "2026-05-20 22:26:02+00:00", "updated_at": "2026-05-20 22:32:38.856354+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "artificial-intelligence", "research"], "entities": ["BERT", "GPT", "BooksCorpus", "English Wikipedia"], "alternates": {"html": "https://wpnews.pro/news/92-bert-the-model-that-reads-in-both-directions", "markdown": "https://wpnews.pro/news/92-bert-the-model-that-reads-in-both-directions.md", "text": "https://wpnews.pro/news/92-bert-the-model-that-reads-in-both-directions.txt", "jsonld": "https://wpnews.pro/news/92-bert-the-model-that-reads-in-both-directions.jsonld"}}