92. BERT: The Model That Reads in Both Directions

BERT (Bidirectional Encoder Representations from Transformers) is an encoder-only transformer model that reads all tokens in a sentence simultaneously, using masked language modeling (MLM) and next sentence prediction (NSP) to achieve bidirectional understanding. Unlike GPT, which reads left-to-right and excels at text generation, BERT is optimized for understanding tasks such as classification and has dominated NLP benchmarks since its 2018 release. The model uses special tokens like [CLS] for classification and [SEP] for separating sentences, with popular variants including bert-base-uncased (110M parameters) and the more efficient DistilBERT.

GPT generates text by predicting the next word. It reads left to right. BERT does something different. It masks random words in a sentence and tries to predict what they are. To do that well, it has to understand every word in relation to every other word simultaneously. Left and right context both matter. That bidirectional understanding is why BERT dominated NLP benchmarks when it came out in 2018, and why encoder-only transformers are still the go-to for understanding tasks. What You'll Learn Here - What makes BERT different from GPT - Masked Language Modeling: how BERT learns - Next Sentence Prediction: the second pretraining task - The CLS and SEP tokens and what they do - Fine-tuning BERT for text classification - Fine-tuning for Named Entity Recognition - Fine-tuning for Question Answering - Using HuggingFace to do all of this in under 20 lines BERT vs GPT: The Key Difference Both are transformer-based. The architecture is similar. The difference is in how they're pretrained and which part of the transformer they use. GPT decoder-only : - Reads left to right with causal masking - Trained to predict the next token - Great at generation - Context: only left side available BERT encoder-only : - Reads all tokens simultaneously - Trained to predict masked tokens + next sentence - Great at understanding - Context: both left and right sides available For classification tasks, BERT wins. For generation tasks, GPT wins. For most NLP applications you actually want to build, BERT is the starting point. How BERT Was Pretrained BERT was pretrained on two tasks simultaneously on a massive corpus BooksCorpus + English Wikipedia, 3.3 billion words . Task 1: Masked Language Modeling MLM 15% of tokens are randomly masked. The model predicts the original token from context. Input: "The cat MASK on the MASK " Target: "The cat sat on the mat" Of the 15% selected tokens: - 80% replaced with MASK - 10% replaced with a random token - 10% left unchanged The random and unchanged cases prevent the model from only learning to predict MASK tokens. Task 2: Next Sentence Prediction NSP Two sentences are given. The model predicts whether sentence B actually follows sentence A in the original text. Input: CLS The cat sat on the mat. SEP It was a lazy afternoon. SEP Label: IsNext 1 Input: CLS The cat sat on the mat. SEP The stock market crashed. SEP Label: NotNext 0 NSP was later found to be less useful than MLM and was dropped in RoBERTa. But it's part of the original BERT. Special Tokens in BERT BERT uses three special tokens you need to know: CLS : Classification token. Always the first token. Its final hidden state is used as the sentence-level representation for classification tasks. SEP : Separator token. Marks the end of a sentence or separates two sentences in pairs. PAD : Padding token. Used to make all sequences in a batch the same length. python from transformers import BertTokenizer tokenizer = BertTokenizer.from pretrained 'bert-base-uncased' text = "The cat sat on the mat." tokens = tokenizer text print f"Input IDs: {tokens 'input ids' }" print f"Token type IDs: {tokens 'token type ids' }" print f"Attention mask: {tokens 'attention mask' }" print Decode back to see what they are decoded = tokenizer.convert ids to tokens tokens 'input ids' print f"Tokens: {decoded}" Output: Input IDs: 101, 1996, 4937, 2938, 2006, 1996, 13523, 1012, 102 Token type IDs: 0, 0, 0, 0, 0, 0, 0, 0, 0 Attention mask: 1, 1, 1, 1, 1, 1, 1, 1, 1 Tokens: ' CLS ', 'the', 'cat', 'sat', 'on', 'the', 'mat', '.', ' SEP ' 101 is CLS . 102 is SEP . Every BERT input starts with CLS and ends with SEP . Two sentences text pair = "The cat sat on the mat.", "It was a lazy afternoon." tokens pair = tokenizer text pair decoded pair = tokenizer.convert ids to tokens tokens pair 'input ids' print f"Pair tokens: {decoded pair}" print f"Token types: {tokens pair 'token type ids' }" Output: Pair tokens: ' CLS ', 'the', 'cat', 'sat', 'on', 'the', 'mat', '.', ' SEP ', 'it', 'was', 'a', 'lazy', 'afternoon', '.', ' SEP ' Token types: 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1 Token type 0 = first sentence. Token type 1 = second sentence. BERT uses this to distinguish the two. BERT Model Variants bert-base-uncased: 12 layers, 768 hidden, 12 heads, 110M params bert-large-uncased: 24 layers, 1024 hidden, 16 heads, 340M params bert-base-cased: Same as base but case-sensitive tokenization distilbert-base: 6 layers, 66M params, 97% of BERT performance, 60% faster roberta-base: BERT without NSP, trained longer, better performance For most tasks, start with bert-base-uncased or distilbert-base-uncased . Only go larger if you need the extra capacity. Task 1: Text Classification With BERT The most common use of BERT. Add a linear layer on top of the CLS token output. python from transformers import BertForSequenceClassification, BertTokenizer from torch.utils.data import DataLoader, Dataset import torch import torch.nn as nn from torch.optim import AdamW from transformers import get linear schedule with warmup Simple sentiment dataset texts = "This movie was absolutely fantastic ", "I hated every minute of it.", "An incredible performance by the lead actor.", "Terrible writing, terrible acting.", "One of the best films I've seen this year.", "Complete waste of time and money.", "Beautifully crafted and deeply moving.", "Boring and predictable from start to finish.", labels = 1, 0, 1, 0, 1, 0, 1, 0 1=positive, 0=negative Tokenize tokenizer = BertTokenizer.from pretrained 'bert-base-uncased' class SentimentDataset Dataset : def init self, texts, labels, tokenizer, max len=64 : self.encodings = tokenizer texts, truncation=True, padding=True, max length=max len, return tensors='pt' self.labels = torch.tensor labels def len self : return len self.labels def getitem self, idx : return { 'input ids': self.encodings 'input ids' idx , 'attention mask': self.encodings 'attention mask' idx , 'labels': self.labels idx } dataset = SentimentDataset texts, labels, tokenizer loader = DataLoader dataset, batch size=4, shuffle=True Load pretrained BERT with classification head model = BertForSequenceClassification.from pretrained 'bert-base-uncased', num labels=2 device = 'cuda' if torch.cuda.is available else 'cpu' model = model.to device optimizer = AdamW model.parameters , lr=2e-5 scheduler = get linear schedule with warmup optimizer, num warmup steps=0, num training steps=len loader 3 Fine-tune print "Fine-tuning BERT for sentiment classification..." for epoch in range 3 : model.train total loss = 0 for batch in loader: optimizer.zero grad input ids = batch 'input ids' .to device attention mask = batch 'attention mask' .to device labels = batch 'labels' .to device outputs = model input ids=input ids, attention mask=attention mask, labels=labels loss = outputs.loss loss.backward torch.nn.utils.clip grad norm model.parameters , 1.0 optimizer.step scheduler.step total loss += loss.item print f"Epoch {epoch+1}: loss={total loss/len loader :.4f}" Predict on new examples model.eval new texts = "I absolutely loved this film ", "This was the worst movie I have ever seen." new encoding = tokenizer new texts, truncation=True, padding=True, max length=64, return tensors='pt' .to device with torch.no grad : outputs = model new encoding preds = torch.argmax outputs.logits, dim=1 for text, pred in zip new texts, preds : sentiment = "Positive" if pred == 1 else "Negative" print f"'{text :50 }...' - {sentiment}" Output: Fine-tuning BERT for sentiment classification... Epoch 1: loss=0.6834 Epoch 2: loss=0.4123 Epoch 3: loss=0.2187 'I absolutely loved this film ...' - Positive 'This was the worst movie I have ever seen....' - Negative What Happens Inside During Fine-Tuning python Look at what BertForSequenceClassification adds from transformers import BertModel import torch.nn as nn class BertClassifier nn.Module : def init self, n classes, dropout=0.3 : super . init self.bert = BertModel.from pretrained 'bert-base-uncased' self.dropout = nn.Dropout dropout self.classifier = nn.Linear 768, n classes 768 = bert-base hidden size def forward self, input ids, attention mask : outputs = self.bert input ids=input ids, attention mask=attention mask outputs.last hidden state: batch, seq len, 768 outputs.pooler output: batch, 768 - the CLS token, passed through a linear+tanh cls output = outputs.pooler output batch, 768 cls output = self.dropout cls output logits = self.classifier cls output batch, n classes return logits model manual = BertClassifier n classes=2 Check what's trainable vs frozen total = sum p.numel for p in model manual.parameters trainable = sum p.numel for p in model manual.parameters if p.requires grad print f"Total parameters: {total:,}" print f"Trainable parameters: {trainable:,}" print Often you freeze BERT layers and only train the head for param in model manual.bert.parameters : param.requires grad = False frozen trainable = sum p.numel for p in model manual.parameters if p.requires grad print f"Trainable head only : {frozen trainable:,}" print " Only the 2-layer classifier is being trained " Output: Total parameters: 109,484,546 Trainable parameters: 109,484,546 Trainable head only : 1,538 Only the 2-layer classifier is being trained When you fine-tune the entire BERT, all 109M parameters update. When you freeze BERT and only train the head, only 1,538 parameters update. Freezing is faster but usually less accurate. Fine-tuning everything gives better results when you have enough data. Task 2: Named Entity Recognition NER NER classifies each token. Person, Organization, Location, Date, Other. It's a token-level classification task, not sentence-level. python from transformers import BertForTokenClassification, BertTokenizerFast NER labels label list = 'O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC' label2id = {l: i for i, l in enumerate label list } id2label = {i: l for i, l in enumerate label list } Load NER model ner model = BertForTokenClassification.from pretrained 'bert-base-uncased', num labels=len label list , id2label=id2label, label2id=label2id tokenizer fast = BertTokenizerFast.from pretrained 'bert-base-uncased' Example: align word labels to subword tokens sentence = "Elon Musk founded Tesla in California." words = sentence.split word labels = 'B-PER', 'I-PER', 'O', 'B-ORG', 'O', 'B-LOC', 'O' Tokenize with word ids to handle subwords encoding = tokenizer fast words, is split into words=True, return offsets mapping=True, padding=True, truncation=True Map word-level labels to subword-level word ids = encoding.word ids token labels = prev word id = None for word id in word ids: if word id is None: token labels.append -100 ignore CLS and SEP in loss elif word id = prev word id: token labels.append label2id word labels word id first subword else: token labels.append -100 subsequent subwords: ignore prev word id = word id tokens = tokenizer fast.convert ids to tokens encoding 'input ids' print "Token - Label alignment:" for token, label id in zip tokens, token labels : label = id2label.get label id, 'IGN' print f" {token:<15} {label}" Output: php Token - Label alignment: CLS IGN elon B-PER mu IGN sk IGN founded O tesla B-ORG in O california B-LOC . O SEP IGN "Elon" maps to B-PER. "mu" and " sk" subwords of "Musk" are ignored in the loss. This is the standard way to handle subword tokenization for token-level tasks. Task 3: Question Answering BERT predicts the start and end position of the answer span within the context passage. python from transformers import BertForQuestionAnswering, BertTokenizer import torch Load pretrained QA model already fine-tuned on SQuAD qa tokenizer = BertTokenizer.from pretrained 'bert-large-uncased-whole-word-masking-finetuned-squad' qa model = BertForQuestionAnswering.from pretrained 'bert-large-uncased-whole-word-masking-finetuned-squad' def answer question question, context : inputs = qa tokenizer question, context, return tensors='pt', truncation=True, max length=512 with torch.no grad : outputs = qa model inputs start logits = outputs.start logits end logits = outputs.end logits Find best start and end positions start idx = torch.argmax start logits end idx = torch.argmax end logits + 1 tokens = qa tokenizer.convert ids to tokens inputs 'input ids' 0 answer = qa tokenizer.convert tokens to string tokens start idx:end idx return answer Test it context = """ The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower. Constructed from 1887 to 1889 as the entrance arch to the 1889 World's Fair, it was initially criticized by some of France's leading artists and intellectuals but has become a global cultural icon of France and one of the most recognisable structures in the world. """ questions = "Where is the Eiffel Tower located?", "Who designed the Eiffel Tower?", "When was the Eiffel Tower built?", for q in questions: answer = answer question q, context print f"Q: {q}" print f"A: {answer}" print Output: Q: Where is the Eiffel Tower located? A: Champ de Mars in Paris, France Q: Who designed the Eiffel Tower? A: Gustave Eiffel Q: When was the Eiffel Tower built? A: 1887 to 1889 A pretrained BERT fine-tuned on SQuAD Stanford Question Answering Dataset extracts answers directly from context. No generation. Just span extraction. The Fastest Way: HuggingFace Pipeline For common tasks, HuggingFace pipelines wrap everything into one function call. python from transformers import pipeline Sentiment analysis fine-tuned BERT on SST-2 sentiment = pipeline 'sentiment-analysis' results = sentiment "I absolutely loved this product ", "Terrible quality, fell apart after a day.", "It's okay, nothing special." for r in results: print f"{r 'label' :<10} {r 'score' :.3f}" print Named Entity Recognition ner = pipeline 'ner', grouped entities=True text = "Apple CEO Tim Cook announced a new product at their Cupertino headquarters." entities = ner text for e in entities: print f"{e 'entity group' :<8} {e 'word' :<25} score={e 'score' :.3f}" print Question Answering qa = pipeline 'question-answering' result = qa question="Who is the CEO of Apple?", context="Apple CEO Tim Cook announced a new product at their Cupertino headquarters." print f"Answer: {result 'answer' } score: {result 'score' :.3f} " print Zero-shot classification no fine-tuning needed classifier = pipeline 'zero-shot-classification' text = "The government announced new economic policies today." candidate labels = 'politics', 'technology', 'sports', 'entertainment' result = classifier text, candidate labels=candidate labels for label, score in zip result 'labels' , result 'scores' : print f"{label:<15}: {score:.3f}" Output: POSITIVE 0.999 NEGATIVE 0.998 NEGATIVE 0.612 ORG Apple score=0.998 PER Tim Cook score=0.997 LOC Cupertino score=0.986 Answer: Tim Cook score: 0.998 politics : 0.942 technology : 0.031 entertainment : 0.017 sports : 0.010 Fine-Tuning Tips for BERT Learning rate: BERT is sensitive. Use 2e-5 to 5e-5. Lower than typical deep learning. Batch size: 16 or 32. Larger batches work better for BERT. Epochs: 2 to 4 epochs. BERT fine-tunes quickly. More epochs usually causes overfitting. Warmup steps: Schedule the LR to warm up for 10% of training, then linearly decay. Helps stability. Gradient clipping: Clip at 1.0 to prevent exploding gradients. python Standard fine-tuning setup from transformers import get linear schedule with warmup EPOCHS = 3 LEARNING RATE = 2e-5 WARMUP RATIO = 0.1 total steps = len loader EPOCHS warmup steps = int total steps WARMUP RATIO optimizer = AdamW model.parameters , lr=LEARNING RATE, eps=1e-8 scheduler = get linear schedule with warmup optimizer, num warmup steps=warmup steps, num training steps=total steps print f"Total training steps: {total steps}" print f"Warmup steps: {warmup steps}" print f"Peak LR: {LEARNING RATE}, then linear decay to 0" BERT vs RoBERTa vs DistilBERT Model Params Speed Accuracy Notes ----------- ------ ----- -------- ----- bert-base 110M 1x baseline Original, safe choice bert-large 340M 0.4x +2-3% Slower, better accuracy roberta-base 125M 1x +1-2% Better pretraining, no NSP distilbert-base 66M 1.6x -3% Great for production albert-base 12M 0.9x ~same Much fewer params via sharing For most projects: start with distilbert-base-uncased for speed, switch to roberta-base for accuracy. Quick Cheat Sheet | Task | Model | Code | |---|---|---| | Text classification | BertForSequenceClassification | pipeline 'sentiment-analysis' | | NER | BertForTokenClassification | pipeline 'ner' | | QA | BertForQuestionAnswering | pipeline 'question-answering' | | Zero-shot | NLI model | pipeline 'zero-shot-classification' | | Custom | BertModel + linear head | outputs.pooler output | | Setting | Value | |---|---| | Learning rate | 2e-5 to 5e-5 | | Batch size | 16 or 32 | | Epochs | 2 to 4 | | Max sequence length | 128 to 512 | | Warmup steps | 10% of total steps | Practice Challenges Level 1: Use pipeline 'sentiment-analysis' on 20 movie reviews you write yourself 10 positive, 10 negative . Print each prediction and confidence score. Where does it get confused? Level 2: Fine-tune distilbert-base-uncased on any small classification dataset you can use load dataset 'imdb' from HuggingFace . Train for 3 epochs. Compare accuracy to a TF-IDF + LogisticRegression baseline from Post 62. How much better is BERT? Level 3: Use BertForTokenClassification to tag a paragraph of news text with NER labels. Then visualize the output by color-coding each entity type in the text. Use the fine-tuned dslim/bert-base-NER model from HuggingFace hub. References BERT paper: Pre-training of Deep Bidirectional Transformers https://arxiv.org/abs/1810.04805 The Illustrated BERT Jay Alammar https://jalammar.github.io/illustrated-bert/ HuggingFace: BERT docs https://huggingface.co/docs/transformers/model doc/bert HuggingFace: Fine-tuning tutorial https://huggingface.co/docs/transformers/training RoBERTa paper https://arxiv.org/abs/1907.11692 Next up, Post 93:GPT: The Model That Predicts the Next Word Forever. Autoregressive generation, temperature and sampling strategies, and how a simple next-token prediction objective produces models that can write, code, and reason.