{"slug": "a-coding-hands-on-on-fineweb-for-streaming-filtering-deduplication-tokenization", "title": "A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication, Tokenization, and Large-Scale Web Corpus Analytics", "summary": "A tutorial demonstrates streaming, filtering, deduplication, tokenization, and analytics on the FineWeb dataset using Python, reproducing quality-filtering pipelines and MinHash-based near-duplicate detection on a sample of 3,000 documents.", "body_md": "In this [tu](https://github.com/MARKTECHPOST-AI-MEDIA-INC/AI-Agents-Projects-Tutorials/blob/main/Data%20Analysis/fineweb_streaming_filtering_dedup_tokenization_tutorial_marktechpost.py)[t](https://github.com/MARKTECHPOST-AI-MEDIA-INC/AI-Agents-Projects-Tutorials/blob/main/Data%20Analysis/fineweb_streaming_filtering_dedup_tokenization_tutorial_marktechpost.py)[orial](https://github.com/MARKTECHPOST-AI-MEDIA-INC/AI-Agents-Projects-Tutorials/blob/main/Data%20Analysis/fineweb_streaming_filtering_dedup_tokenization_tutorial_marktechpost.py), we explore the[ FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset through an advanced hands-on workflow. We stream a manageable sample of the dataset without downloading the full multi-terabyte corpus, inspect its schema and metadata, and analyze key fields such as URL, language, language score, and token count. We also reproduce simplified versions of FineWeb’s quality-filtering pipeline, apply MinHash-based near-duplicate detection, verify token counts with the GPT-2 tokenizer, and generate useful analytics on domains, language scores, document lengths, and tokenizer efficiency.\n\n``` python\nimport subprocess, sys\ndef pip(*pkgs):\n   subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\npip(\"datasets>=2.19\", \"datasketch\", \"tiktoken\", \"pandas\", \"matplotlib\", \"tqdm\")\nimport re, math, random, collections\nfrom urllib.parse import urlparse\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom tqdm.auto import tqdm\nfrom datasets import load_dataset\nrandom.seed(0); np.random.seed(0)\npd.set_option(\"display.max_colwidth\", 90)\n```\n\nWe begin by installing all required libraries for streaming, analysis, deduplication, tokenization, and visualization. We import the core Python packages needed to process FineWeb documents and work with tabular data. We also set random seeds and display options so that our results remain consistent and easier to inspect.\n\n```\nN_DOCS = 3000\nprint(f\"Streaming {N_DOCS} docs from FineWeb sample-10BT ...\")\nstream = load_dataset(\n   \"HuggingFaceFW/fineweb\",\n   name=\"sample-10BT\",\n   split=\"train\",\n   streaming=True,\n)\ndocs = []\nfor i, doc in enumerate(tqdm(stream, total=N_DOCS)):\n   docs.append(doc)\n   if i + 1 >= N_DOCS:\n       break\ndf = pd.DataFrame(docs)\nprint(\"\\nColumns:\", list(df.columns))\nprint(df[[\"url\", \"language\", \"language_score\", \"token_count\"]].head(5))\nex = docs[0]\nprint(\"\\n--- Example record (fields) ---\")\nfor k, v in ex.items():\n   preview = (v[:120] + \"…\") if isinstance(v, str) and len(v) > 120 else v\n   print(f\"{k:>16}: {preview}\")\n```\n\nWe stream a fixed number of documents from the FineWeb sample-10BT subset without downloading the full dataset. We convert the streamed records into a DataFrame and inspect key metadata fields, including URL, language, language score, and token count. We also print a complete example record to better understand the dataset’s structure.\n\n``` python\nWORD = re.compile(r\"\\b\\w+\\b\")\ndef gopher_quality(text):\n   words = WORD.findall(text)\n   n = len(words)\n   if n < 50 or n > 100_000:\n       return False, \"word_count_out_of_range\"\n   mean_len = sum(len(w) for w in words) / n\n   if mean_len < 3 or mean_len > 10:\n       return False, \"bad_mean_word_length\"\n   if (text.count(\"#\") + text.count(\"...\")) / n > 0.1:\n       return False, \"too_many_symbols\"\n   lines = text.split(\"\\n\")\n   if lines and sum(l.lstrip().startswith((\"•\", \"-\", \"*\")) for l in lines) / len(lines) > 0.9:\n       return False, \"mostly_bullets\"\n   stops = {\"the\", \"be\", \"to\", \"of\", \"and\", \"that\", \"have\", \"with\"}\n   if len(stops & {w.lower() for w in words}) < 2:\n       return False, \"too_few_stopwords\"\n   return True, \"ok\"\ndef c4_quality(text):\n   lines = [l for l in text.split(\"\\n\") if l.strip()]\n   if not lines:\n       return False, \"empty\"\n   low = text.lower()\n   for bad in (\"lorem ipsum\", \"javascript is disabled\"):\n       if bad in low:\n           return False, f\"boilerplate:{bad}\"\n   if text.count(\"{\") > 0 and text.count(\"{\") / max(len(lines), 1) > 0.5:\n       return False, \"too_many_braces\"\n   return True, \"ok\"\ndef fineweb_custom(text):\n   lines = [l.strip() for l in text.split(\"\\n\") if l.strip()]\n   if not lines:\n       return False, \"empty\"\n   dup_frac = 1 - len(set(lines)) / len(lines)\n   if dup_frac > 0.3:\n       return False, \"duplicated_lines\"\n   short_frac = sum(len(l) < 30 for l in lines) / len(lines)\n   if short_frac > 0.67 and len(lines) > 5:\n       return False, \"list_like\"\n   return True, \"ok\"\nresults = []\nfor d in docs:\n   t = d[\"text\"]\n   g_ok, g_r = gopher_quality(t)\n   c_ok, c_r = c4_quality(t)\n   f_ok, f_r = fineweb_custom(t)\n   reason = \"kept\" if (g_ok and c_ok and f_ok) else (g_r if not g_ok else c_r if not c_ok else f_r)\n   results.append(reason)\nfilter_summary = pd.Series(results).value_counts()\nprint(\"\\n--- Quality-filter outcomes on already-clean FineWeb data ---\")\nprint(\"(Most pass: FineWeb is pre-filtered. Rejections show what the rules catch.)\")\nprint(filter_summary)\n```\n\nWe recreate simplified versions of FineWeb’s quality filters using Gopher-style, C4-style, and custom text-cleaning heuristics. We check each document for issues such as abnormal word counts, poor word statistics, boilerplate text, repeated lines, and list-like structure. We summarize how many documents pass or fail these filters to understand the quality of the already-cleaned FineWeb sample.\n\n``` python\nfrom datasketch import MinHash, MinHashLSH\ndef shingles(text, k=5):\n   toks = WORD.findall(text.lower())\n   return {\" \".join(toks[i:i+k]) for i in range(max(len(toks) - k + 1, 1))}\nNUM_PERM = 128\nTHRESHOLD = 0.7\nlsh = MinHashLSH(threshold=THRESHOLD, num_perm=NUM_PERM)\nminhashes = {}\nfor idx, d in enumerate(tqdm(docs, desc=\"MinHashing\")):\n   m = MinHash(num_perm=NUM_PERM)\n   for s in shingles(d[\"text\"]):\n       m.update(s.encode(\"utf8\"))\n   minhashes[idx] = m\n   lsh.insert(str(idx), m)\ndup_pairs = set()\nfor idx, m in minhashes.items():\n   for cand in lsh.query(m):\n       c = int(cand)\n       if c != idx:\n           dup_pairs.add(tuple(sorted((idx, c))))\nprint(f\"\\nFound {len(dup_pairs)} near-duplicate pairs (Jaccard ≥ {THRESHOLD}).\")\nif dup_pairs:\n   a, b = next(iter(dup_pairs))\n   j = minhashes[a].jaccard(minhashes[b])\n   print(f\"Example pair (estimated Jaccard ≈ {j:.2f}):\")\n   print(\"  DOC A:\", docs[a][\"text\"][:160].replace(\"\\n\", \" \"), \"…\")\n   print(\"  DOC B:\", docs[b][\"text\"][:160].replace(\"\\n\", \" \"), \"…\")\nelse:\n   print(\"No near-dupes in this slice — expected, since FineWeb is dedup'd per crawl.\")\n```\n\nWe implement MinHash-based near-duplicate detection to approximate how large web corpora identify repeated or highly similar documents. We convert each document into word shingles, generate MinHash signatures, and index them with Locality Sensitive Hashing. We then search for near-duplicate document pairs and inspect an example if any similar texts are found.\n\n``` python\nimport tiktoken\nenc = tiktoken.get_encoding(\"gpt2\")\ncheck = docs[:200]\nrecomputed = [len(enc.encode(d[\"text\"])) for d in tqdm(check, desc=\"Tokenizing\")]\nstored = [d[\"token_count\"] for d in check]\ndiffs = np.array(recomputed) - np.array(stored)\nprint(f\"\\n--- Verifying token_count field (gpt2) on 200 docs ---\")\nprint(f\"Mean abs diff vs stored token_count: {np.abs(diffs).mean():.2f} tokens\")\nprint(f\"Exact matches: {(diffs == 0).mean()*100:.0f}%   (small drift = tokenizer version)\")\ndf[\"chars_per_token\"] = df[\"text\"].str.len() / df[\"token_count\"].clip(lower=1)\nprint(f\"Avg characters per token: {df['chars_per_token'].mean():.2f}\")\n```\n\nWe verify the dataset’s token_count field by recomputing GPT-2 token counts with the tiktoken tokenizer. We compare the recomputed token counts with the stored values and measure the average difference between them. We also calculate characters per token to understand tokenizer efficiency across the sampled documents.\n\n```\ndf[\"domain\"] = df[\"url\"].apply(lambda u: urlparse(u).netloc.replace(\"www.\", \"\") if isinstance(u, str) else \"?\")\ntop_domains = df[\"domain\"].value_counts().head(15)\nprint(\"\\n--- Top 15 domains in sample ---\")\nprint(top_domains)\nfig, axes = plt.subplots(2, 2, figsize=(14, 10))\naxes[0, 0].hist(df[\"token_count\"].clip(upper=4000), bins=50, color=\"#7b2d26\")\naxes[0, 0].set_title(\"Token count per document (gpt2)\")\naxes[0, 0].set_xlabel(\"tokens\"); axes[0, 0].set_ylabel(\"docs\")\naxes[0, 1].hist(df[\"language_score\"], bins=40, color=\"#2d5d7b\")\naxes[0, 1].axvline(0.65, color=\"red\", ls=\"--\", label=\"FineWeb cutoff 0.65\")\naxes[0, 1].set_title(\"fastText English language score\")\naxes[0, 1].set_xlabel(\"score\"); axes[0, 1].legend()\naxes[1, 0].hist(df[\"chars_per_token\"].clip(upper=8), bins=40, color=\"#3f7b2d\")\naxes[1, 0].set_title(\"Characters per token (compression)\")\naxes[1, 0].set_xlabel(\"chars / token\")\ntop_domains.iloc[::-1].plot(kind=\"barh\", ax=axes[1, 1], color=\"#7b5d2d\")\naxes[1, 1].set_title(\"Top domains\")\nplt.tight_layout()\nplt.show()\nprint(\"\\n\" + \"=\" * 70)\nprint(\"SUMMARY\")\nprint(\"=\" * 70)\nprint(f\"Docs streamed          : {len(df):,}\")\nprint(f\"Total gpt2 tokens       : {df['token_count'].sum():,}\")\nprint(f\"Median tokens/doc       : {int(df['token_count'].median())}\")\nprint(f\"Unique domains          : {df['domain'].nunique():,}\")\nprint(f\"Mean language_score     : {df['language_score'].mean():.3f}\")\nprint(f\"Near-duplicate pairs    : {len(dup_pairs)}\")\nprint(f\"Docs flagged by filters : {(pd.Series(results) != 'kept').sum()} / {len(results)}\")\nprint(\"\\nNext steps:\")\nprint(\"  • Swap name='sample-10BT' for a real crawl, e.g. name='CC-MAIN-2024-10'\")\nprint(\"  • Raise N_DOCS for stronger statistics\")\nprint(\"  • Use the full datatrove pipeline to reproduce FineWeb end-to-end\")\n```\n\nWe extract domain names from URLs and identify the most frequent domains present in the FineWeb sample. We create visualizations for token count distribution, language score distribution, characters per token, and top domains. We finish by printing a compact summary of streamed documents, total tokens, median length, unique domains, language quality, duplicate count, and filter results.\n\nIn conclusion, we developed a practical understanding of how large-scale web datasets such as FineWeb are explored, filtered, deduplicated, and analyzed for language model training. We worked efficiently with streaming data, tested quality heuristics on real documents, identified near-duplicate text patterns, and validated token-level metadata using a production-style tokenizer. It can be used to scale the workflow to larger FineWeb crawls, perform deeper corpus analysis, and design high-quality preprocessing pipelines for LLM dataset preparation.\n\nCheck out the ** Full Codes with Notebook. **Also, feel free to follow us on\n\n**and don’t forget to join our**[Twitter](https://x.com/intent/follow?screen_name=marktechpost)\n\n**and Subscribe to**\n\n[150k+ML SubReddit](https://www.reddit.com/r/machinelearningnews/)**. Wait! are you on telegram?**\n\n[our Newsletter](https://www.aidevsignals.com/)\n\n[now you can join us on telegram as well.](https://t.me/machinelearningresearchnews)Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? [Connect with us](https://forms.gle/wbash1wF6efRj8G58)\n\nSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.\n\n- Sana Hassan\n- Sana Hassan\n- Sana Hassan\n- Sana Hassan", "url": "https://wpnews.pro/news/a-coding-hands-on-on-fineweb-for-streaming-filtering-deduplication-tokenization", "canonical_source": "https://www.marktechpost.com/2026/06/14/a-coding-hands-on-on-fineweb-for-streaming-filtering-deduplication-tokenization-and-large-scale-web-corpus-analytics/", "published_at": "2026-06-14 20:45:32+00:00", "updated_at": "2026-06-14 20:51:27.343717+00:00", "lang": "en", "topics": ["large-language-models", "natural-language-processing", "ai-tools", "developer-tools"], "entities": ["FineWeb", "HuggingFaceFW", "GPT-2", "MinHash", "Python", "pandas", "matplotlib", "tiktoken"], "alternates": {"html": "https://wpnews.pro/news/a-coding-hands-on-on-fineweb-for-streaming-filtering-deduplication-tokenization", "markdown": "https://wpnews.pro/news/a-coding-hands-on-on-fineweb-for-streaming-filtering-deduplication-tokenization.md", "text": "https://wpnews.pro/news/a-coding-hands-on-on-fineweb-for-streaming-filtering-deduplication-tokenization.txt", "jsonld": "https://wpnews.pro/news/a-coding-hands-on-on-fineweb-for-streaming-filtering-deduplication-tokenization.jsonld"}}