A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication, Tokenization, and Large-Scale Web Corpus Analytics

A tutorial demonstrates streaming, filtering, deduplication, tokenization, and analytics on the FineWeb dataset using Python, reproducing quality-filtering pipelines and MinHash-based near-duplicate detection on a sample of 3,000 documents.

In this tu https://github.com/MARKTECHPOST-AI-MEDIA-INC/AI-Agents-Projects-Tutorials/blob/main/Data%20Analysis/fineweb streaming filtering dedup tokenization tutorial marktechpost.py t https://github.com/MARKTECHPOST-AI-MEDIA-INC/AI-Agents-Projects-Tutorials/blob/main/Data%20Analysis/fineweb streaming filtering dedup tokenization tutorial marktechpost.py orial https://github.com/MARKTECHPOST-AI-MEDIA-INC/AI-Agents-Projects-Tutorials/blob/main/Data%20Analysis/fineweb streaming filtering dedup tokenization tutorial marktechpost.py , we explore the FineWeb https://huggingface.co/datasets/HuggingFaceFW/fineweb dataset through an advanced hands-on workflow. We stream a manageable sample of the dataset without downloading the full multi-terabyte corpus, inspect its schema and metadata, and analyze key fields such as URL, language, language score, and token count. We also reproduce simplified versions of FineWeb’s quality-filtering pipeline, apply MinHash-based near-duplicate detection, verify token counts with the GPT-2 tokenizer, and generate useful analytics on domains, language scores, document lengths, and tokenizer efficiency. python import subprocess, sys def pip pkgs : subprocess.run sys.executable, "-m", "pip", "install", "-q", pkgs , check=True pip "datasets =2.19", "datasketch", "tiktoken", "pandas", "matplotlib", "tqdm" import re, math, random, collections from urllib.parse import urlparse import pandas as pd import numpy as np import matplotlib.pyplot as plt from tqdm.auto import tqdm from datasets import load dataset random.seed 0 ; np.random.seed 0 pd.set option "display.max colwidth", 90 We begin by installing all required libraries for streaming, analysis, deduplication, tokenization, and visualization. We import the core Python packages needed to process FineWeb documents and work with tabular data. We also set random seeds and display options so that our results remain consistent and easier to inspect. N DOCS = 3000 print f"Streaming {N DOCS} docs from FineWeb sample-10BT ..." stream = load dataset "HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True, docs = for i, doc in enumerate tqdm stream, total=N DOCS : docs.append doc if i + 1 = N DOCS: break df = pd.DataFrame docs print "\nColumns:", list df.columns print df "url", "language", "language score", "token count" .head 5 ex = docs 0 print "\n--- Example record fields ---" for k, v in ex.items : preview = v :120 + "…" if isinstance v, str and len v 120 else v print f"{k: 16}: {preview}" We stream a fixed number of documents from the FineWeb sample-10BT subset without downloading the full dataset. We convert the streamed records into a DataFrame and inspect key metadata fields, including URL, language, language score, and token count. We also print a complete example record to better understand the dataset’s structure. python WORD = re.compile r"\b\w+\b" def gopher quality text : words = WORD.findall text n = len words if n < 50 or n 100 000: return False, "word count out of range" mean len = sum len w for w in words / n if mean len < 3 or mean len 10: return False, "bad mean word length" if text.count " " + text.count "..." / n 0.1: return False, "too many symbols" lines = text.split "\n" if lines and sum l.lstrip .startswith "•", "-", " " for l in lines / len lines 0.9: return False, "mostly bullets" stops = {"the", "be", "to", "of", "and", "that", "have", "with"} if len stops & {w.lower for w in words} < 2: return False, "too few stopwords" return True, "ok" def c4 quality text : lines = l for l in text.split "\n" if l.strip if not lines: return False, "empty" low = text.lower for bad in "lorem ipsum", "javascript is disabled" : if bad in low: return False, f"boilerplate:{bad}" if text.count "{" 0 and text.count "{" / max len lines , 1 0.5: return False, "too many braces" return True, "ok" def fineweb custom text : lines = l.strip for l in text.split "\n" if l.strip if not lines: return False, "empty" dup frac = 1 - len set lines / len lines if dup frac 0.3: return False, "duplicated lines" short frac = sum len l < 30 for l in lines / len lines if short frac 0.67 and len lines 5: return False, "list like" return True, "ok" results = for d in docs: t = d "text" g ok, g r = gopher quality t c ok, c r = c4 quality t f ok, f r = fineweb custom t reason = "kept" if g ok and c ok and f ok else g r if not g ok else c r if not c ok else f r results.append reason filter summary = pd.Series results .value counts print "\n--- Quality-filter outcomes on already-clean FineWeb data ---" print " Most pass: FineWeb is pre-filtered. Rejections show what the rules catch. " print filter summary We recreate simplified versions of FineWeb’s quality filters using Gopher-style, C4-style, and custom text-cleaning heuristics. We check each document for issues such as abnormal word counts, poor word statistics, boilerplate text, repeated lines, and list-like structure. We summarize how many documents pass or fail these filters to understand the quality of the already-cleaned FineWeb sample. python from datasketch import MinHash, MinHashLSH def shingles text, k=5 : toks = WORD.findall text.lower return {" ".join toks i:i+k for i in range max len toks - k + 1, 1 } NUM PERM = 128 THRESHOLD = 0.7 lsh = MinHashLSH threshold=THRESHOLD, num perm=NUM PERM minhashes = {} for idx, d in enumerate tqdm docs, desc="MinHashing" : m = MinHash num perm=NUM PERM for s in shingles d "text" : m.update s.encode "utf8" minhashes idx = m lsh.insert str idx , m dup pairs = set for idx, m in minhashes.items : for cand in lsh.query m : c = int cand if c = idx: dup pairs.add tuple sorted idx, c print f"\nFound {len dup pairs } near-duplicate pairs Jaccard ≥ {THRESHOLD} ." if dup pairs: a, b = next iter dup pairs j = minhashes a .jaccard minhashes b print f"Example pair estimated Jaccard ≈ {j:.2f} :" print " DOC A:", docs a "text" :160 .replace "\n", " " , "…" print " DOC B:", docs b "text" :160 .replace "\n", " " , "…" else: print "No near-dupes in this slice — expected, since FineWeb is dedup'd per crawl." We implement MinHash-based near-duplicate detection to approximate how large web corpora identify repeated or highly similar documents. We convert each document into word shingles, generate MinHash signatures, and index them with Locality Sensitive Hashing. We then search for near-duplicate document pairs and inspect an example if any similar texts are found. python import tiktoken enc = tiktoken.get encoding "gpt2" check = docs :200 recomputed = len enc.encode d "text" for d in tqdm check, desc="Tokenizing" stored = d "token count" for d in check diffs = np.array recomputed - np.array stored print f"\n--- Verifying token count field gpt2 on 200 docs ---" print f"Mean abs diff vs stored token count: {np.abs diffs .mean :.2f} tokens" print f"Exact matches: { diffs == 0 .mean 100:.0f}% small drift = tokenizer version " df "chars per token" = df "text" .str.len / df "token count" .clip lower=1 print f"Avg characters per token: {df 'chars per token' .mean :.2f}" We verify the dataset’s token count field by recomputing GPT-2 token counts with the tiktoken tokenizer. We compare the recomputed token counts with the stored values and measure the average difference between them. We also calculate characters per token to understand tokenizer efficiency across the sampled documents. df "domain" = df "url" .apply lambda u: urlparse u .netloc.replace "www.", "" if isinstance u, str else "?" top domains = df "domain" .value counts .head 15 print "\n--- Top 15 domains in sample ---" print top domains fig, axes = plt.subplots 2, 2, figsize= 14, 10 axes 0, 0 .hist df "token count" .clip upper=4000 , bins=50, color=" 7b2d26" axes 0, 0 .set title "Token count per document gpt2 " axes 0, 0 .set xlabel "tokens" ; axes 0, 0 .set ylabel "docs" axes 0, 1 .hist df "language score" , bins=40, color=" 2d5d7b" axes 0, 1 .axvline 0.65, color="red", ls="--", label="FineWeb cutoff 0.65" axes 0, 1 .set title "fastText English language score" axes 0, 1 .set xlabel "score" ; axes 0, 1 .legend axes 1, 0 .hist df "chars per token" .clip upper=8 , bins=40, color=" 3f7b2d" axes 1, 0 .set title "Characters per token compression " axes 1, 0 .set xlabel "chars / token" top domains.iloc ::-1 .plot kind="barh", ax=axes 1, 1 , color=" 7b5d2d" axes 1, 1 .set title "Top domains" plt.tight layout plt.show print "\n" + "=" 70 print "SUMMARY" print "=" 70 print f"Docs streamed : {len df :,}" print f"Total gpt2 tokens : {df 'token count' .sum :,}" print f"Median tokens/doc : {int df 'token count' .median }" print f"Unique domains : {df 'domain' .nunique :,}" print f"Mean language score : {df 'language score' .mean :.3f}" print f"Near-duplicate pairs : {len dup pairs }" print f"Docs flagged by filters : { pd.Series results = 'kept' .sum } / {len results }" print "\nNext steps:" print " • Swap name='sample-10BT' for a real crawl, e.g. name='CC-MAIN-2024-10'" print " • Raise N DOCS for stronger statistics" print " • Use the full datatrove pipeline to reproduce FineWeb end-to-end" We extract domain names from URLs and identify the most frequent domains present in the FineWeb sample. We create visualizations for token count distribution, language score distribution, characters per token, and top domains. We finish by printing a compact summary of streamed documents, total tokens, median length, unique domains, language quality, duplicate count, and filter results. In conclusion, we developed a practical understanding of how large-scale web datasets such as FineWeb are explored, filtered, deduplicated, and analyzed for language model training. We worked efficiently with streaming data, tested quality heuristics on real documents, identified near-duplicate text patterns, and validated token-level metadata using a production-style tokenizer. It can be used to scale the workflow to larger FineWeb crawls, perform deeper corpus analysis, and design high-quality preprocessing pipelines for LLM dataset preparation. Check out the Full Codes with Notebook. Also, feel free to follow us on and don’t forget to join our Twitter https://x.com/intent/follow?screen name=marktechpost and Subscribe to 150k+ML SubReddit https://www.reddit.com/r/machinelearningnews/ . Wait are you on telegram? our Newsletter https://www.aidevsignals.com/ now you can join us on telegram as well. https://t.me/machinelearningresearchnews Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us https://forms.gle/wbash1wF6efRj8G58 Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions. - Sana Hassan - Sana Hassan - Sana Hassan - Sana Hassan