{"slug": "how-to-build-a-clean-academic-dataset-without-losing-your-mind-or-your-weekend", "title": "How to Build a Clean Academic Dataset Without Losing Your Mind (or Your Weekend)", "summary": "A developer built a clean, reproducible academic dataset pipeline using ScholarAPI, an API that provides access to 30 million open-access papers with pre-extracted full text in structured JSON format. The pipeline uses four stable endpoints for searching, listing, and retrieving paper texts, eliminating common problems like PDF extraction errors, fragmented sources, and legal ambiguity that typically consume months of engineering time. The approach enables machine learning engineers to build production-ready datasets for fine-tuning or RAG in hours rather than weeks.", "body_md": "Everyone has an opinion on which model to fine-tune.\n\nNobody talks about where the training data actually comes from.\n\nAsk any ML engineer who has built something on scientific literature and you'll hear the same story: the model took two weeks. The dataset took two months. The dataset was the hard part.\n\nI've been there. Cobbling together CSVs from PubMed exports, writing scrapers that broke every time a journal sneezed, hand-cleaning PDF extractions that looked like someone ran a blender through a research paper. It's unglamorous, it's slow, and it's the reason a lot of genuinely good AI projects never ship.\n\nThis article is about doing it the right way, building clean, structured, reproducible academic datasets using [ScholarAPI](https://scholarapi.net/?via=-asig3). We'll go from zero to a production-ready dataset pipeline, with real code you can run today.\n\nMost dataset-building tutorials assume you're scraping Reddit or pulling from a nice REST API with a consistent schema. Academic literature is neither of those things.\n\nHere's what you're actually dealing with:\n\n**Fragmentation.** Research is spread across 20,000+ journals, repositories, preprint servers, and institutional databases. There is no single place to query all of it. PubMed covers medicine. arXiv covers physics and CS. Neither covers materials science, economics, or law particularly well.\n\n**Format chaos.** The canonical format for academic publishing is PDF, a format designed for print, not machines. Extracting clean text from a PDF is a non-trivial engineering problem. Do it wrong and you get scrambled column layouts, broken equations, and reference lists fused into body text.\n\n**No stable programmatic access.** Google Scholar has 389 million papers. It also has no API. The moment your scraper gets reliable, Google changes something and you're back to zero.\n\n**Legal ambiguity at scale.** Using copyrighted content to train models is genuinely complicated. Open-access literature, where authors have explicitly licensed reuse, is the safe zone. But you have to know what you're pulling.\n\n[ScholarAPI](https://scholarapi.net/?via=-asig3) is built around exactly these constraints: 30M+ open-access papers, pre-extracted full text, structured JSON, stable endpoints. It doesn't solve every problem but it eliminates the ones that kill most projects before they start.\n\nBy the end of this article you'll have:\n\nAll three use the same four endpoints:\n\n```\nGET /api/v1/search          # find papers by keyword\nGET /api/v1/list            # paginate by date, monitor new content  \nGET /api/v1/text/{id}       # clean extracted full text\nGET /api/v1/texts/{ids}     # bulk, up to 100 texts in one call\n```\n\nAuth is one header everywhere: `X-API-Key: sch_xxxxxxxxx`\n\nGet your key at [scholarapi.net](https://scholarapi.net), 1,000 free credits on signup, enough to pull a few hundred full texts and genuinely evaluate whether this works for your use case.\n\nThis is the most common use case: you need N papers on a topic, with full text, for fine-tuning, RAG, or evaluation.\n\n``` python\nimport requests\nimport json\nimport time\nfrom pathlib import Path\n\nAPI_KEY = \"sch_xxxxxxxxx\"\nBASE    = \"https://scholarapi.net/api/v1\"\nHEADERS = {\"X-API-Key\": API_KEY}\n\ndef search_papers(query: str, limit: int = 100) -> list[dict]:\n    \"\"\"Search for papers matching a query. Returns metadata list.\"\"\"\n    resp = requests.get(\n        f\"{BASE}/search\",\n        headers=HEADERS,\n        params={\"q\": query, \"limit\": limit}\n    )\n    resp.raise_for_status()\n    return resp.json().get(\"results\", [])\n\ndef fetch_texts_bulk(paper_ids: list[str]) -> dict[str, str]:\n    \"\"\"\n    Fetch full text for up to 100 papers in one API call.\n    Returns {paper_id: full_text} dict.\n    \"\"\"\n    # API accepts comma-separated IDs\n    ids_str = \",\".join(paper_ids[:100])\n    resp = requests.get(\n        f\"{BASE}/texts/{ids_str}\",\n        headers=HEADERS\n    )\n    resp.raise_for_status()\n    return resp.json()  # {id: text, id: text, ...}\n\ndef build_corpus(query: str, target_size: int = 500, output_path: str = \"corpus.jsonl\") -> int:\n    \"\"\"\n    Build a full-text corpus for a given query topic.\n    Saves to JSONL, one JSON object per line, easy to stream later.\n    \"\"\"\n    print(f\"Searching for papers: '{query}'\")\n    papers = search_papers(query, limit=min(target_size, 100))\n    print(f\"Found {len(papers)} papers in search results\")\n\n    # Filter to papers that have full text available\n    with_text = [p for p in papers if p.get(\"has_text\")]\n    print(f\"{len(with_text)} have full text available\")\n\n    written = 0\n    # Batch into groups of 100 for the bulk endpoint\n    with open(output_path, \"w\") as f:\n        for i in range(0, len(with_text), 100):\n            batch = with_text[i:i+100]\n            ids   = [p[\"id\"] for p in batch]\n\n            texts = fetch_texts_bulk(ids)\n\n            for paper in batch:\n                pid  = paper[\"id\"]\n                text = texts.get(pid)\n                if not text:\n                    continue\n\n                record = {\n                    \"id\":             pid,\n                    \"title\":          paper.get(\"title\"),\n                    \"authors\":        paper.get(\"authors\", []),\n                    \"published_date\": paper.get(\"published_date\"),\n                    \"journal\":        paper.get(\"journal\"),\n                    \"abstract\":       paper.get(\"abstract\"),\n                    \"full_text\":      text,\n                    \"source_url\":     paper.get(\"url\"),   # auditable backlink\n                    \"query\":          query,\n                }\n                f.write(json.dumps(record) + \"\\n\")\n                written += 1\n\n            # Be a good API citizen\n            time.sleep(0.5)\n            print(f\"  Batch {i//100 + 1} done — {written} records so far\")\n\n    print(f\"\\nCorpus saved to {output_path} — {written} papers with full text\")\n    return written\n\n# Run it\nbuild_corpus(\n    query=\"transformer attention mechanism natural language processing\",\n    target_size=200,\n    output_path=\"nlp_corpus.jsonl\"\n)\n```\n\n**Why JSONL?** Because it streams. You can process a 10GB JSONL file line-by-line without loading it into memory. It's also what Hugging Face datasets expect natively. Start with JSONL, you'll thank yourself later.\n\n**Why source_url in every record?** ScholarAPI includes a backlink to the original paper on every result. Keep it. When someone asks \"where did this training data come from,\" you have a per-record answer. That's the difference between an auditable dataset and a liability.\n\nStatic datasets go stale. If you're building a system that needs to stay current with new research, a literature monitoring agent, a continuously updated RAG knowledge base, an LLM that gets fine-tuned monthly, you need a pipeline, not a one-time dump.\n\nThe `/list`\n\nendpoint with `indexed_after`\n\nis what makes this possible. ScholarAPI indexes new papers within 24–48 hours of publication. Here's a pipeline that runs daily and appends only new content:\n\n``` python\nimport requests\nimport json\nfrom datetime import datetime, timedelta, timezone\nfrom pathlib import Path\n\nAPI_KEY = \"sch_xxxxxxxxx\"\nBASE    = \"https://scholarapi.net/api/v1\"\nHEADERS = {\"X-API-Key\": API_KEY}\n\n# State file — tracks when we last ran\nSTATE_FILE = Path(\".pipeline_state.json\")\n\ndef load_state() -> dict:\n    if STATE_FILE.exists():\n        return json.loads(STATE_FILE.read_text())\n    # First run, go back 7 days\n    default_since = (datetime.now(timezone.utc) - timedelta(days=7)).isoformat()\n    return {\"last_run\": default_since, \"total_records\": 0}\n\ndef save_state(state: dict):\n    STATE_FILE.write_text(json.dumps(state, indent=2))\n\ndef fetch_new_papers(keyword: str, since: str) -> list[dict]:\n    \"\"\"Pull all new papers matching keyword since a given timestamp.\"\"\"\n    all_results = []\n    cursor      = None\n\n    while True:\n        params = {\n            \"q\":             keyword,\n            \"indexed_after\": since,\n            \"has_text\":      \"true\",\n            \"limit\":         100,\n        }\n        if cursor:\n            params[\"cursor\"] = cursor\n\n        resp = requests.get(f\"{BASE}/list\", headers=HEADERS, params=params)\n        resp.raise_for_status()\n        data = resp.json()\n\n        results = data.get(\"results\", [])\n        all_results.extend(results)\n\n        # Paginate until exhausted\n        cursor = data.get(\"next_cursor\")\n        if not cursor or not results:\n            break\n\n    return all_results\n\ndef run_pipeline(keyword: str, output_file: str = \"stream.jsonl\"):\n    state = load_state()\n    since = state[\"last_run\"]\n    now   = datetime.now(timezone.utc).isoformat()\n\n    print(f\"Pipeline run: {since} → {now}\")\n    print(f\"Keyword: '{keyword}'\")\n\n    papers = fetch_new_papers(keyword, since)\n    print(f\"New papers found: {len(papers)}\")\n\n    if not papers:\n        print(\"Nothing new. Updating state and exiting.\")\n        state[\"last_run\"] = now\n        save_state(state)\n        return\n\n    # Bulk-fetch full texts in batches of 100\n    added = 0\n    with open(output_file, \"a\") as f:  # append mode\n        for i in range(0, len(papers), 100):\n            batch = papers[i:i+100]\n            ids   = [p[\"id\"] for p in batch]\n            ids_str = \",\".join(ids)\n\n            texts_resp = requests.get(\n                f\"{BASE}/texts/{ids_str}\",\n                headers=HEADERS\n            )\n            texts = texts_resp.json()\n\n            for paper in batch:\n                pid  = paper[\"id\"]\n                text = texts.get(pid)\n                if not text:\n                    continue\n\n                f.write(json.dumps({\n                    \"id\":             pid,\n                    \"title\":          paper.get(\"title\"),\n                    \"published_date\": paper.get(\"published_date\"),\n                    \"indexed_at\":     paper.get(\"indexed_at\"),\n                    \"full_text\":      text,\n                    \"source_url\":     paper.get(\"url\"),\n                    \"pipeline_run\":   now,\n                }) + \"\\n\")\n                added += 1\n\n    state[\"last_run\"]      = now\n    state[\"total_records\"] = state.get(\"total_records\", 0) + added\n    save_state(state)\n\n    print(f\"Added {added} new records. Total dataset size: {state['total_records']}\")\n\n# Run it — or stick this in a cron job / Airflow DAG\nrun_pipeline(\n    keyword=\"CRISPR gene editing therapy\",\n    output_file=\"crispr_stream.jsonl\"\n)\n```\n\nCron it at 6am daily:\n\n```\n0 6 * * * /usr/bin/python3 /path/to/pipeline.py >> /var/log/pipeline.log 2>&1\n```\n\nYour dataset grows automatically. Every morning it's slightly smarter than yesterday.\n\nYou have a JSONL file. Now make it useful to everyone, including your future self.\n\n``` python\nfrom datasets import Dataset, DatasetDict\nimport json\nfrom pathlib import Path\n\ndef jsonl_to_hf_dataset(jsonl_path: str, train_split: float = 0.9) -> DatasetDict:\n    \"\"\"\n    Load a JSONL corpus and split into train/test.\n    Pushes to Hugging Face Hub.\n    \"\"\"\n    records = []\n    with open(jsonl_path) as f:\n        for line in f:\n            line = line.strip()\n            if line:\n                records.append(json.loads(line))\n\n    print(f\"Loaded {len(records)} records from {jsonl_path}\")\n\n    # Build HF Dataset\n    ds = Dataset.from_list(records)\n\n    # Train/test split\n    split     = ds.train_test_split(test_size=1 - train_split, seed=42)\n    ds_dict   = DatasetDict({\"train\": split[\"train\"], \"test\": split[\"test\"]})\n\n    print(f\"Train: {len(ds_dict['train'])} | Test: {len(ds_dict['test'])}\")\n    return ds_dict\n\ndef push_to_hub(ds_dict: DatasetDict, repo_id: str, hf_token: str):\n    \"\"\"Push dataset to Hugging Face Hub.\"\"\"\n    ds_dict.push_to_hub(\n        repo_id,\n        token=hf_token,\n        commit_message=\"Dataset built with ScholarAPI, open-access full text\"\n    )\n    print(f\"Dataset live at: https://huggingface.co/datasets/{repo_id}\")\n\n# Full flow\nds = jsonl_to_hf_dataset(\"nlp_corpus.jsonl\")\npush_to_hub(\n    ds_dict=ds,\n    repo_id=\"your-org/nlp-academic-corpus\",\n    hf_token=\"hf_xxxxxxxxxx\"\n)\n```\n\nThat's it. Your dataset is on the Hub, versioned, citable, and searchable.\n\nCredits aren't opaque. Here's exactly what a dataset build costs:\n\n| Action | Credits |\n|---|---|\n`/search` (per call) |\n10 + 2 per result |\n`/text/{id}` (single) |\n3 credits (promo, normally 5)\n|\n`/texts/{ids}` (bulk) |\n3 per paper (promo, normally 5)\n|\n`/pdf/{id}` |\n5 credits (promo, normally 10)\n|\n\n**Real example:** Building a 500-paper full-text corpus.\n\nA 5,000-paper corpus at promo rates sits comfortably inside the $149 pack (10K credits).\n\nPromo pricing on text and PDF endpoints runs until end of June 2026, worth building sooner rather than later.\n\n** has_text is not guaranteed.** Even with\n\n`has_text=true`\n\nin your list query, a small percentage of papers will return empty text. The PDF exists but extraction failed, corrupted file, scanned image-only PDF, unusual encoding. Build your pipeline to handle `None`\n\ntext gracefully. We do this above with the `if not text: continue`\n\nguard.**Deduplication matters.** If you run multiple queries on overlapping topics, you'll get duplicate papers with different query labels. Deduplicate by `id`\n\nbefore training. Don't skip this, duplicates in training data are a quiet way to skew your model.\n\n```\n# Deduplicate a JSONL by paper ID\nseen = set()\nwith open(\"corpus.jsonl\") as f_in, open(\"corpus_deduped.jsonl\", \"w\") as f_out:\n    for line in f_in:\n        record = json.loads(line)\n        if record[\"id\"] not in seen:\n            seen.add(record[\"id\"])\n            f_out.write(line)\n\nprint(f\"Unique papers: {len(seen)}\")\n```\n\n**Open-access only.** Elsevier, Wiley, Taylor & Francis, their subscription-paywalled content isn't here. Open-access publications from Springer and Nature are. Check your target domain's open-access rate before committing to a corpus size, CS and medicine have excellent OA coverage; some law and humanities journals less so.\n\n**Rate limits exist but are generous.** Don't hammer the API with 1,000 parallel requests. The `time.sleep(0.5)`\n\nin the corpus builder above is intentional. You'll get cleaner results and avoid any throttling.\n\nHere's the schema I actually use in production. Opinionated, tested, ready to go:\n\n```\nRECORD_SCHEMA = {\n    # Identity\n    \"id\":             str,   # ScholarAPI paper ID, stable, use as primary key\n    \"source_url\":     str,   # Original journal/repo URL, auditable\n\n    # Content\n    \"title\":          str,\n    \"abstract\":       str,\n    \"full_text\":      str,   # Pre-extracted, clean\n\n    # Metadata\n    \"authors\":        list,  # [\"Last, First\", ...]\n    \"published_date\": str,   # ISO 8601: \"2024-03-15\"\n    \"journal\":        str,\n    \"doi\":            str,   # When available\n\n    # Pipeline bookkeeping\n    \"query\":          str,   # Which search query surfaced this paper\n    \"indexed_at\":     str,   # When ScholarAPI indexed it\n    \"pipeline_run\":   str,   # When YOUR pipeline ran, for debugging\n}\n```\n\nThe `pipeline_run`\n\nfield sounds like overkill until you have three months of streaming data and need to diagnose why a batch from February looks different from March. Add it from day one.\n\n**Biomedical QA fine-tuning corpus.** Pull 10K papers from oncology, cardiology, and neurology. Split into (context, question, answer) triples using an LLM. Fine-tune a small model. You now have a domain-specific medical QA system trained on peer-reviewed literature.\n\n**Cross-disciplinary embedding benchmark.** Build 1K papers across 10 domains. Embed them. Measure how well different embedding models separate domains in latent space. Publish the benchmark. People will cite it.\n\n**Hallucination evaluation dataset.** Take paper abstracts. Generate LLM summaries. Compare against the actual conclusion sections. You have a grounded hallucination benchmark that's impossible to game because the ground truth is published literature.\n\n**Temporal drift dataset.** Pull papers from 2018 and 2024 on the same topic. Fine-tune on 2018 data, evaluate on 2024. You now have a dataset that measures how much a field has moved, which is exactly what you need to understand model knowledge cutoffs.\n\nNone of these existed as clean, reproducible pipelines before tools like ScholarAPI made the data layer boring. That's the point.\n\nDataset quality is a multiplier on everything downstream.\n\nA mediocre model trained on clean, well-structured, domain-specific data will outperform a great model trained on garbage. This is not a controversial opinion. It's something the entire ML community knows and somehow keeps forgetting when it comes time to actually collect the data.\n\nThe data layer deserves real engineering. Reproducibility. Versioning. Auditable sources. Deduplication. Clean text, not HTML artifacts.\n\nScholarAPI makes this tractable for academic literature. The endpoints are simple, the text extraction is real, and every record links back to where it came from. That's the baseline you need to build anything you'd actually trust.\n\nThe rest is your problem to solve. But at least it's the interesting part.\n\n```\n# Get your API key at scholarapi.net, 1,000 free credits\n\n# Your first corpus query:\ncurl \"https://scholarapi.net/api/v1/search?q=your+topic+here&limit=10\" \\\n  -H \"X-API-Key: sch_xxxxxxxxx\"\n```\n\n[Full API reference](https://scholarapi.net/docs/api). No fluff, just endpoints.\n\nIf you build something with this, a dataset, a benchmark, a pipeline, drop it in the comments. I'm genuinely curious what people are using academic literature for that I haven't thought of yet.\n\n*Tags: python machinelearning datascience api*", "url": "https://wpnews.pro/news/how-to-build-a-clean-academic-dataset-without-losing-your-mind-or-your-weekend", "canonical_source": "https://dev.to/reel_crave/how-to-build-a-clean-academic-dataset-without-losing-your-mind-or-your-weekend-1oa3", "published_at": "2026-05-28 11:17:51+00:00", "updated_at": "2026-05-28 11:53:27.050894+00:00", "lang": "en", "topics": ["machine-learning", "artificial-intelligence", "ai-research", "ai-tools", "natural-language-processing"], "entities": ["ScholarAPI", "PubMed", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/how-to-build-a-clean-academic-dataset-without-losing-your-mind-or-your-weekend", "markdown": "https://wpnews.pro/news/how-to-build-a-clean-academic-dataset-without-losing-your-mind-or-your-weekend.md", "text": "https://wpnews.pro/news/how-to-build-a-clean-academic-dataset-without-losing-your-mind-or-your-weekend.txt", "jsonld": "https://wpnews.pro/news/how-to-build-a-clean-academic-dataset-without-losing-your-mind-or-your-weekend.jsonld"}}