How to Build a Clean Academic Dataset Without Losing Your Mind (or Your Weekend)

A developer built a clean, reproducible academic dataset pipeline using ScholarAPI, an API that provides access to 30 million open-access papers with pre-extracted full text in structured JSON format. The pipeline uses four stable endpoints for searching, listing, and retrieving paper texts, eliminating common problems like PDF extraction errors, fragmented sources, and legal ambiguity that typically consume months of engineering time. The approach enables machine learning engineers to build production-ready datasets for fine-tuning or RAG in hours rather than weeks.

Everyone has an opinion on which model to fine-tune. Nobody talks about where the training data actually comes from. Ask any ML engineer who has built something on scientific literature and you'll hear the same story: the model took two weeks. The dataset took two months. The dataset was the hard part. I've been there. Cobbling together CSVs from PubMed exports, writing scrapers that broke every time a journal sneezed, hand-cleaning PDF extractions that looked like someone ran a blender through a research paper. It's unglamorous, it's slow, and it's the reason a lot of genuinely good AI projects never ship. This article is about doing it the right way, building clean, structured, reproducible academic datasets using ScholarAPI https://scholarapi.net/?via=-asig3 . We'll go from zero to a production-ready dataset pipeline, with real code you can run today. Most dataset-building tutorials assume you're scraping Reddit or pulling from a nice REST API with a consistent schema. Academic literature is neither of those things. Here's what you're actually dealing with: Fragmentation. Research is spread across 20,000+ journals, repositories, preprint servers, and institutional databases. There is no single place to query all of it. PubMed covers medicine. arXiv covers physics and CS. Neither covers materials science, economics, or law particularly well. Format chaos. The canonical format for academic publishing is PDF, a format designed for print, not machines. Extracting clean text from a PDF is a non-trivial engineering problem. Do it wrong and you get scrambled column layouts, broken equations, and reference lists fused into body text. No stable programmatic access. Google Scholar has 389 million papers. It also has no API. The moment your scraper gets reliable, Google changes something and you're back to zero. Legal ambiguity at scale. Using copyrighted content to train models is genuinely complicated. Open-access literature, where authors have explicitly licensed reuse, is the safe zone. But you have to know what you're pulling. ScholarAPI https://scholarapi.net/?via=-asig3 is built around exactly these constraints: 30M+ open-access papers, pre-extracted full text, structured JSON, stable endpoints. It doesn't solve every problem but it eliminates the ones that kill most projects before they start. By the end of this article you'll have: All three use the same four endpoints: GET /api/v1/search find papers by keyword GET /api/v1/list paginate by date, monitor new content GET /api/v1/text/{id} clean extracted full text GET /api/v1/texts/{ids} bulk, up to 100 texts in one call Auth is one header everywhere: X-API-Key: sch xxxxxxxxx Get your key at scholarapi.net https://scholarapi.net , 1,000 free credits on signup, enough to pull a few hundred full texts and genuinely evaluate whether this works for your use case. This is the most common use case: you need N papers on a topic, with full text, for fine-tuning, RAG, or evaluation. python import requests import json import time from pathlib import Path API KEY = "sch xxxxxxxxx" BASE = "https://scholarapi.net/api/v1" HEADERS = {"X-API-Key": API KEY} def search papers query: str, limit: int = 100 - list dict : """Search for papers matching a query. Returns metadata list.""" resp = requests.get f"{BASE}/search", headers=HEADERS, params={"q": query, "limit": limit} resp.raise for status return resp.json .get "results", def fetch texts bulk paper ids: list str - dict str, str : """ Fetch full text for up to 100 papers in one API call. Returns {paper id: full text} dict. """ API accepts comma-separated IDs ids str = ",".join paper ids :100 resp = requests.get f"{BASE}/texts/{ids str}", headers=HEADERS resp.raise for status return resp.json {id: text, id: text, ...} def build corpus query: str, target size: int = 500, output path: str = "corpus.jsonl" - int: """ Build a full-text corpus for a given query topic. Saves to JSONL, one JSON object per line, easy to stream later. """ print f"Searching for papers: '{query}'" papers = search papers query, limit=min target size, 100 print f"Found {len papers } papers in search results" Filter to papers that have full text available with text = p for p in papers if p.get "has text" print f"{len with text } have full text available" written = 0 Batch into groups of 100 for the bulk endpoint with open output path, "w" as f: for i in range 0, len with text , 100 : batch = with text i:i+100 ids = p "id" for p in batch texts = fetch texts bulk ids for paper in batch: pid = paper "id" text = texts.get pid if not text: continue record = { "id": pid, "title": paper.get "title" , "authors": paper.get "authors", , "published date": paper.get "published date" , "journal": paper.get "journal" , "abstract": paper.get "abstract" , "full text": text, "source url": paper.get "url" , auditable backlink "query": query, } f.write json.dumps record + "\n" written += 1 Be a good API citizen time.sleep 0.5 print f" Batch {i//100 + 1} done — {written} records so far" print f"\nCorpus saved to {output path} — {written} papers with full text" return written Run it build corpus query="transformer attention mechanism natural language processing", target size=200, output path="nlp corpus.jsonl" Why JSONL? Because it streams. You can process a 10GB JSONL file line-by-line without loading it into memory. It's also what Hugging Face datasets expect natively. Start with JSONL, you'll thank yourself later. Why source url in every record? ScholarAPI includes a backlink to the original paper on every result. Keep it. When someone asks "where did this training data come from," you have a per-record answer. That's the difference between an auditable dataset and a liability. Static datasets go stale. If you're building a system that needs to stay current with new research, a literature monitoring agent, a continuously updated RAG knowledge base, an LLM that gets fine-tuned monthly, you need a pipeline, not a one-time dump. The /list endpoint with indexed after is what makes this possible. ScholarAPI indexes new papers within 24–48 hours of publication. Here's a pipeline that runs daily and appends only new content: python import requests import json from datetime import datetime, timedelta, timezone from pathlib import Path API KEY = "sch xxxxxxxxx" BASE = "https://scholarapi.net/api/v1" HEADERS = {"X-API-Key": API KEY} State file — tracks when we last ran STATE FILE = Path ".pipeline state.json" def load state - dict: if STATE FILE.exists : return json.loads STATE FILE.read text First run, go back 7 days default since = datetime.now timezone.utc - timedelta days=7 .isoformat return {"last run": default since, "total records": 0} def save state state: dict : STATE FILE.write text json.dumps state, indent=2 def fetch new papers keyword: str, since: str - list dict : """Pull all new papers matching keyword since a given timestamp.""" all results = cursor = None while True: params = { "q": keyword, "indexed after": since, "has text": "true", "limit": 100, } if cursor: params "cursor" = cursor resp = requests.get f"{BASE}/list", headers=HEADERS, params=params resp.raise for status data = resp.json results = data.get "results", all results.extend results Paginate until exhausted cursor = data.get "next cursor" if not cursor or not results: break return all results def run pipeline keyword: str, output file: str = "stream.jsonl" : state = load state since = state "last run" now = datetime.now timezone.utc .isoformat print f"Pipeline run: {since} → {now}" print f"Keyword: '{keyword}'" papers = fetch new papers keyword, since print f"New papers found: {len papers }" if not papers: print "Nothing new. Updating state and exiting." state "last run" = now save state state return Bulk-fetch full texts in batches of 100 added = 0 with open output file, "a" as f: append mode for i in range 0, len papers , 100 : batch = papers i:i+100 ids = p "id" for p in batch ids str = ",".join ids texts resp = requests.get f"{BASE}/texts/{ids str}", headers=HEADERS texts = texts resp.json for paper in batch: pid = paper "id" text = texts.get pid if not text: continue f.write json.dumps { "id": pid, "title": paper.get "title" , "published date": paper.get "published date" , "indexed at": paper.get "indexed at" , "full text": text, "source url": paper.get "url" , "pipeline run": now, } + "\n" added += 1 state "last run" = now state "total records" = state.get "total records", 0 + added save state state print f"Added {added} new records. Total dataset size: {state 'total records' }" Run it — or stick this in a cron job / Airflow DAG run pipeline keyword="CRISPR gene editing therapy", output file="crispr stream.jsonl" Cron it at 6am daily: 0 6 /usr/bin/python3 /path/to/pipeline.py /var/log/pipeline.log 2 &1 Your dataset grows automatically. Every morning it's slightly smarter than yesterday. You have a JSONL file. Now make it useful to everyone, including your future self. python from datasets import Dataset, DatasetDict import json from pathlib import Path def jsonl to hf dataset jsonl path: str, train split: float = 0.9 - DatasetDict: """ Load a JSONL corpus and split into train/test. Pushes to Hugging Face Hub. """ records = with open jsonl path as f: for line in f: line = line.strip if line: records.append json.loads line print f"Loaded {len records } records from {jsonl path}" Build HF Dataset ds = Dataset.from list records Train/test split split = ds.train test split test size=1 - train split, seed=42 ds dict = DatasetDict {"train": split "train" , "test": split "test" } print f"Train: {len ds dict 'train' } | Test: {len ds dict 'test' }" return ds dict def push to hub ds dict: DatasetDict, repo id: str, hf token: str : """Push dataset to Hugging Face Hub.""" ds dict.push to hub repo id, token=hf token, commit message="Dataset built with ScholarAPI, open-access full text" print f"Dataset live at: https://huggingface.co/datasets/{repo id}" Full flow ds = jsonl to hf dataset "nlp corpus.jsonl" push to hub ds dict=ds, repo id="your-org/nlp-academic-corpus", hf token="hf xxxxxxxxxx" That's it. Your dataset is on the Hub, versioned, citable, and searchable. Credits aren't opaque. Here's exactly what a dataset build costs: | Action | Credits | |---|---| /search per call | 10 + 2 per result | /text/{id} single | 3 credits promo, normally 5 | /texts/{ids} bulk | 3 per paper promo, normally 5 | /pdf/{id} | 5 credits promo, normally 10 | Real example: Building a 500-paper full-text corpus. A 5,000-paper corpus at promo rates sits comfortably inside the $149 pack 10K credits . Promo pricing on text and PDF endpoints runs until end of June 2026, worth building sooner rather than later. has text is not guaranteed. Even with has text=true in your list query, a small percentage of papers will return empty text. The PDF exists but extraction failed, corrupted file, scanned image-only PDF, unusual encoding. Build your pipeline to handle None text gracefully. We do this above with the if not text: continue guard. Deduplication matters. If you run multiple queries on overlapping topics, you'll get duplicate papers with different query labels. Deduplicate by id before training. Don't skip this, duplicates in training data are a quiet way to skew your model. Deduplicate a JSONL by paper ID seen = set with open "corpus.jsonl" as f in, open "corpus deduped.jsonl", "w" as f out: for line in f in: record = json.loads line if record "id" not in seen: seen.add record "id" f out.write line print f"Unique papers: {len seen }" Open-access only. Elsevier, Wiley, Taylor & Francis, their subscription-paywalled content isn't here. Open-access publications from Springer and Nature are. Check your target domain's open-access rate before committing to a corpus size, CS and medicine have excellent OA coverage; some law and humanities journals less so. Rate limits exist but are generous. Don't hammer the API with 1,000 parallel requests. The time.sleep 0.5 in the corpus builder above is intentional. You'll get cleaner results and avoid any throttling. Here's the schema I actually use in production. Opinionated, tested, ready to go: RECORD SCHEMA = { Identity "id": str, ScholarAPI paper ID, stable, use as primary key "source url": str, Original journal/repo URL, auditable Content "title": str, "abstract": str, "full text": str, Pre-extracted, clean Metadata "authors": list, "Last, First", ... "published date": str, ISO 8601: "2024-03-15" "journal": str, "doi": str, When available Pipeline bookkeeping "query": str, Which search query surfaced this paper "indexed at": str, When ScholarAPI indexed it "pipeline run": str, When YOUR pipeline ran, for debugging } The pipeline run field sounds like overkill until you have three months of streaming data and need to diagnose why a batch from February looks different from March. Add it from day one. Biomedical QA fine-tuning corpus. Pull 10K papers from oncology, cardiology, and neurology. Split into context, question, answer triples using an LLM. Fine-tune a small model. You now have a domain-specific medical QA system trained on peer-reviewed literature. Cross-disciplinary embedding benchmark. Build 1K papers across 10 domains. Embed them. Measure how well different embedding models separate domains in latent space. Publish the benchmark. People will cite it. Hallucination evaluation dataset. Take paper abstracts. Generate LLM summaries. Compare against the actual conclusion sections. You have a grounded hallucination benchmark that's impossible to game because the ground truth is published literature. Temporal drift dataset. Pull papers from 2018 and 2024 on the same topic. Fine-tune on 2018 data, evaluate on 2024. You now have a dataset that measures how much a field has moved, which is exactly what you need to understand model knowledge cutoffs. None of these existed as clean, reproducible pipelines before tools like ScholarAPI made the data layer boring. That's the point. Dataset quality is a multiplier on everything downstream. A mediocre model trained on clean, well-structured, domain-specific data will outperform a great model trained on garbage. This is not a controversial opinion. It's something the entire ML community knows and somehow keeps forgetting when it comes time to actually collect the data. The data layer deserves real engineering. Reproducibility. Versioning. Auditable sources. Deduplication. Clean text, not HTML artifacts. ScholarAPI makes this tractable for academic literature. The endpoints are simple, the text extraction is real, and every record links back to where it came from. That's the baseline you need to build anything you'd actually trust. The rest is your problem to solve. But at least it's the interesting part. Get your API key at scholarapi.net, 1,000 free credits Your first corpus query: curl "https://scholarapi.net/api/v1/search?q=your+topic+here&limit=10" \ -H "X-API-Key: sch xxxxxxxxx" Full API reference https://scholarapi.net/docs/api . No fluff, just endpoints. If you build something with this, a dataset, a benchmark, a pipeline, drop it in the comments. I'm genuinely curious what people are using academic literature for that I haven't thought of yet. Tags: python machinelearning datascience api