Training a Twitch chat toxicity classifier on real VOD data at scale A developer built a Twitch chat toxicity classifier by scraping VOD chat replay data at scale using the platform's internal `VideoCommentsByOffsetOrCursor` GraphQL endpoint, which is not publicly accessible. The project required bypassing Twitch's TLS fingerprint inspection and rate-limiting through browser-emulating HTTP libraries, residential proxies, and offset-based pagination to collect structured message data including text, emotes, badges, and subscriber status. The resulting dataset, costing approximately $0.001 per message, enables training of TF-IDF and logistic regression classifiers with features that distinguish between moderators, subscribers, and regular users. Quick answer:Twitch has no public API for VOD chat replay. To build a Twitch toxicity classifier dataset you walk the internal VideoCommentsByOffsetOrCursor GraphQL endpoint at scale — the same one the web player uses. The Devil Scrapes Twitch VOD Chat Archive Actor does that for $0.001 per message ~$1.05 per 1,000 , returning the structured fields — message fragments , badges , is subscriber — that make classifier features actually useful. If you maintain a mod-bot StreamElements, Nightbot, Streamlabs, or custom , or if you are an ML engineer building a Twitch-native toxicity model, your training data problem is the same: you need labeled-able chat messages at scale from real VODs, with enough context per row to build signal-rich features. This post walks the full pipeline — pulling the data, loading it into pandas, training a baseline TF-IDF + logistic-regression classifier, and sketching the upgrade path to a transformer. Not in any useful sense. The Twitch Helix API https://dev.twitch.tv/docs/api/ exposes live IRC chat via EventSub and the Chat & Messaging endpoints, but it has no endpoint for VOD chat replay — the historical timestamped record of a past broadcast. That data exists you can watch it in the VOD player , but the only programmatic surface for it is the internal VideoCommentsByOffsetOrCursor persisted GraphQL query. Walking that endpoint reliably is a job in itself. Twitch inspects TLS fingerprints from incoming requests — Python's requests or httpx produce a ClientHello that no real browser sends, and the server responds with a 403 before it reads the body. Past roughly 10,000 messages on a single IP, Twitch's rate-limiting kicks in hard. The cursor-based pagination mode triggers an integrity-check challenge that needs a live browser to solve. Offset-based pagination avoids it, but only if you know to use it before you start coding. We absorb all of that. The Actor rotates through Chrome, Firefox, and Safari TLS fingerprints via curl-cffi , threads residential proxies with fresh session IDs on each block, retries with exponential backoff on 408 / 429 / 5xx , and pages exclusively by content offset to sidestep the integrity check. The result is a clean dataset of typed rows you can load straight into pandas. Not all chat APIs return the same structure. The fields the Actor returns were chosen with feature engineering in mind: message text — the plain-text body of the message with emote shortcodes preserved as literal text e.g. "PogChamp PogChamp OMEGALUL" . This is your label target and your primary text feature. message fragments — a structured array of {type, text, emote id} objects. Type is either "text" or "emote" . This matters because emotes carry semantic weight a TF-IDF tokenizer cannot capture from their shortcode text alone. An "emote" fragment with emote id lets you treat emotes as a distinct token type, deduplicate their representation, or embed them separately. Spam runs often consist almost entirely of emote fragments; that ratio is a cheap feature. badges — an array of {set id, version} objects representing the user's active chat badges. A user carrying a moderator badge, a broadcaster badge, or a vip badge is structurally different from a first-time chatter — and their messages should be weighted differently in your training set. A model that does not distinguish a moderator warning from a random user saying the same thing is a weaker model. is subscriber — a boolean convenience flag derived from the badges array. Subscribers are users who have paid for channel membership; their base rate of toxic behavior differs from non-subscribers. This is a fast binary feature your model can use without parsing the full badges array. message offset seconds — the message's position in the VOD timeline in seconds. Toxic spikes correlate with in-stream events: a bad play, a controversial opinion, a raid. Including offset in your labeling pass lets you sample across the full timeline rather than front-loading training data from the first ten minutes. commenter id and commenter login You need apify-client installed pip install apify-client pandas scikit-learn . Get a free Apify API token at apify.com https://apify.com — no card required, every account starts with $5 of credit. The call below targets three VODs by ID and caps at 5,000 messages per VOD. At $0.001 per message plus the $0.05 actor-start, 15,000 messages costs $15.05. python from apify client import ApifyClient client = ApifyClient "YOUR APIFY TOKEN" run = client.actor "DevilScrapes/twitch-vod-chat-archive" .call run input={ "vodIds": "2773625679", "2756421083", "2741897234" , "maxMessagesPerVod": 5000, "startOffsetSeconds": 0, "proxyConfiguration": { "useApifyProxy": True, "apifyProxyGroups": "RESIDENTIAL" } } items = list client.dataset run "defaultDatasetId" .iterate items print f"Pulled {len items } messages" For a larger training corpus — say 100 VODs from a mix of channels — set maxRecentVods on channelLogin mode instead of listing IDs: run = client.actor "DevilScrapes/twitch-vod-chat-archive" .call run input={ "channelLogin": "shroud", "maxRecentVods": 50, "maxMessagesPerVod": 10000, "proxyConfiguration": { "useApifyProxy": True, "apifyProxyGroups": "RESIDENTIAL" } } That gives you up to 500,000 messages per channel in a single run. At $0.001/message that is ~$500.05 for the full 500k — but the free $5 trial credit covers 4,950 messages, enough to validate your pipeline before committing. python import pandas as pd df = pd.DataFrame items Compute emote ratio — useful spam feature def emote ratio fragments : if not fragments: return 0.0 emote count = sum 1 for f in fragments if f.get "type" == "emote" return emote count / len fragments df "emote ratio" = df "message fragments" .apply emote ratio Extract badge sets as a frozenset for grouping def badge set badges : return frozenset b "set id" for b in badges if badges else frozenset df "badge set" = df "badges" .apply badge set is moderator / is broadcaster convenience columns df "is moderator" = df "badge set" .apply lambda s: "moderator" in s df "is broadcaster" = df "badge set" .apply lambda s: "broadcaster" in s Messages per user — frequency signal msg counts = df.groupby "commenter id" "message id" .count .rename "user msg count" df = df.merge msg counts, on="commenter id", how="left" print df "message text", "is subscriber", "is moderator", "emote ratio", "user msg count" .head Sample output row from a real VOD scrape channel: shroud, toxic content masked : { "vod id": "2773625679", "vod title": "never played forza but i definitely have a drivers license so it should be easy", "channel login": "shroud", "message id": "1292e052-0561-4db5-86c7-adfc4556d628", "message offset seconds": 12, "posted at": "2026-05-16T18:42:35.297Z", "commenter id": "142680597", "commenter login": "tabrexs", "commenter display name": "tabrexs", "message text": "PewPewPew", "message fragments": { "type": "emote", "text": "PewPewPew", "emote id": "emotesv2 587405136a8147148c77df74baaa1bf4" } , "user color": " DAA520", "badges": , "is subscriber": false, "scraped at": "2026-05-16T19:00:00Z" } For a first iteration, label toxic/benign manually on a sample and train a TF-IDF + logistic-regression baseline. This is fast to iterate on and gives you a performance floor to beat with transformer fine-tuning later. Important framing note for the labeling pass: toxic labels in mod-tool training are typically defined by the channel's own moderation rules, not a universal taxonomy. What a family-friendly channel flags as toxic differs from a gaming-focused one. Build your label schema per-channel or use a community standard like Perspective API categories https://perspectiveapi.com/ for initial seeding. Do not include known-slur text in your labeled examples file in plaintext — store them masked e.g. masked slur and apply transformations at load time. The mod community, and any team reviewing your training data, will thank you. python import json from sklearn.feature extraction.text import TfidfVectorizer from sklearn.linear model import LogisticRegression from sklearn.model selection import train test split from sklearn.metrics import classification report from sklearn.pipeline import Pipeline import numpy as np Load your labeled subset human annotations: {message id: 0 or 1} 0 = benign, 1 = toxic / spam with open "labels.json" as f: labels = json.load f {"message id 1": 0, "message id 2": 1, ...} labeled df = df df "message id" .isin labels .copy labeled df "label" = labeled df "message id" .map labels Text feature — message text is the primary signal X text = labeled df "message text" .fillna "" y = labeled df "label" X train, X test, y train, y test = train test split X text, y, test size=0.2, random state=42, stratify=y Baseline: TF-IDF unigrams + bigrams, logistic regression pipeline = Pipeline "tfidf", TfidfVectorizer ngram range= 1, 2 , max features=20000, sublinear tf=True , "clf", LogisticRegression C=1.0, class weight="balanced", important: toxic is a minority class max iter=1000 , pipeline.fit X train, y train y pred = pipeline.predict X test print classification report y test, y pred, target names= "benign", "toxic" Adding structural features alongside TF-IDF: The text pipeline above ignores emote ratio , is subscriber , and user msg count . To include them in the same model, combine sparse TF-IDF with a dense feature matrix: python from scipy.sparse import hstack from sklearn.preprocessing import StandardScaler Dense features dense features = labeled df "emote ratio", "is subscriber", "is moderator", "user msg count" .fillna 0 .values X train dense, X test dense = dense features labeled df.index.isin X train.index , dense features labeled df.index.isin X test.index , Fit TF-IDF on train split only tfidf = TfidfVectorizer ngram range= 1, 2 , max features=20000, sublinear tf=True X train sparse = tfidf.fit transform X train X test sparse = tfidf.transform X test Combine X train combined = hstack X train sparse, X train dense X test combined = hstack X test sparse, X test dense clf = LogisticRegression C=1.0, class weight="balanced", max iter=1000 clf.fit X train combined, y train print classification report y test, clf.predict X test combined , target names= "benign", "toxic" In practice the emote ratio column tends to lift spam precision noticeably — pure-emote spam messages produce a ratio near 1.0 and a short message text length, a combination TF-IDF alone does not capture well. The baseline above will plateau around 75–82% F1 on a well-balanced Twitch dataset. The main failure modes are: The upgrade path is to fine-tune a pre-trained model on your labeled data. cardiffnlp/twitter-roberta-base-offensive is a strong starting checkpoint for chat-style text — it was trained on social-media toxicity and transfers better to Twitch than a generic BERT. Pseudocode — full fine-tuning loop depends on your GPU setup from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments from datasets import Dataset model name = "cardiffnlp/twitter-roberta-base-offensive" tokenizer = AutoTokenizer.from pretrained model name hf dataset = Dataset.from pandas labeled df "message text", "label" .rename columns={"message text": "text"} def tokenize batch : return tokenizer batch "text" , truncation=True, padding="max length", max length=128 tokenized = hf dataset.map tokenize, batched=True ... standard Trainer setup with TrainingArguments, compute metrics, etc. The message fragments field opens a further avenue: treat emote tokens as special tokens added to the tokenizer vocabulary one token per emote id , then let the model learn emote embeddings jointly with text. This is not a weekend project, but it is the difference between a model that handles OMEGALUL as an unknown token and one that learns it signals laughter. The plan answers the pricing question directly. At $0.001/message: | Pull size | Cost | Labeled examples assuming 10% manual label rate | |---|---|---| | 10,000 messages | $10.05 | ~1,000 labeled rows | | 50,000 messages | $50.05 | ~5,000 labeled rows | | 100,000 messages | $100.05 | ~10,000 labeled rows | For a TF-IDF baseline, 1,000–5,000 labeled examples is workable if your class balance is reasonable. For transformer fine-tuning, 5,000+ labeled examples per class is the typical floor for stable results. You get to the free trial's 4,950 messages before spending a cent — that is enough to validate your feature extraction pipeline end-to-end before scaling up. The full Twitch chat scraper guide covers the broader use-case landscape esports analytics, post-broadcast review, channel back-catalog mode if you want context beyond classifier training: Twitch Chat Scraper: export any VOD's full chat replay for $1.05/1K https://dev.to/devil scrapes/twitch-chat-scraper-export-any-vods-full-chat-replay-for-1051k-1jea . Can I use this for StreamElements / Nightbot rule testing? Yes. Pull historical chat from VODs where you know toxic events occurred, then replay the message text values through your bot's filter rules in a test harness. The badges and is subscriber fields let you simulate the trust-level rules most bots implement moderators and subscribers often get different thresholds . Does the Actor return deleted or banned messages? No. The public chat-replay endpoint does not expose moderator actions — bans, timeouts, or the content of deleted messages. Deleted messages may appear as a