How to Build a Clean Academic Dataset Without Losing Your Mind (or Your Weekend)

wpnews.pro

Everyone has an opinion on which model to fine-tune.

Nobody talks about where the training data actually comes from.

Ask any ML engineer who has built something on scientific literature and you'll hear the same story: the model took two weeks. The dataset took two months. The dataset was the hard part.

I've been there. Cobbling together CSVs from PubMed exports, writing scrapers that broke every time a journal sneezed, hand-cleaning PDF extractions that looked like someone ran a blender through a research paper. It's unglamorous, it's slow, and it's the reason a lot of genuinely good AI projects never ship.

This article is about doing it the right way, building clean, structured, reproducible academic datasets using ScholarAPI. We'll go from zero to a production-ready dataset pipeline, with real code you can run today.

Most dataset-building tutorials assume you're scraping Reddit or pulling from a nice REST API with a consistent schema. Academic literature is neither of those things.

Here's what you're actually dealing with:

Fragmentation. Research is spread across 20,000+ journals, repositories, preprint servers, and institutional databases. There is no single place to query all of it. PubMed covers medicine. arXiv covers physics and CS. Neither covers materials science, economics, or law particularly well.

Format chaos. The canonical format for academic publishing is PDF, a format designed for print, not machines. Extracting clean text from a PDF is a non-trivial engineering problem. Do it wrong and you get scrambled column layouts, broken equations, and reference lists fused into body text.

No stable programmatic access. Google Scholar has 389 million papers. It also has no API. The moment your scraper gets reliable, Google changes something and you're back to zero.

Legal ambiguity at scale. Using copyrighted content to train models is genuinely complicated. Open-access literature, where authors have explicitly licensed reuse, is the safe zone. But you have to know what you're pulling.

ScholarAPI is built around exactly these constraints: 30M+ open-access papers, pre-extracted full text, structured JSON, stable endpoints. It doesn't solve every problem but it eliminates the ones that kill most projects before they start.

By the end of this article you'll have:

All three use the same four endpoints:

GET /api/v1/search          # find papers by keyword
GET /api/v1/list            # paginate by date, monitor new content  
GET /api/v1/text/{id}       # clean extracted full text
GET /api/v1/texts/{ids}     # bulk, up to 100 texts in one call

Auth is one header everywhere: X-API-Key: sch_xxxxxxxxx

Get your key at scholarapi.net, 1,000 free credits on signup, enough to pull a few hundred full texts and genuinely evaluate whether this works for your use case.

This is the most common use case: you need N papers on a topic, with full text, for fine-tuning, RAG, or evaluation.

import requests
import json
import time
from pathlib import Path

API_KEY = "sch_xxxxxxxxx"
BASE    = "https://scholarapi.net/api/v1"
HEADERS = {"X-API-Key": API_KEY}

def search_papers(query: str, limit: int = 100) -> list[dict]:
    """Search for papers matching a query. Returns metadata list."""
    resp = requests.get(
        f"{BASE}/search",
        headers=HEADERS,
        params={"q": query, "limit": limit}
    )
    resp.raise_for_status()
    return resp.json().get("results", [])

def fetch_texts_bulk(paper_ids: list[str]) -> dict[str, str]:
    """
    Fetch full text for up to 100 papers in one API call.
    Returns {paper_id: full_text} dict.
    """
    ids_str = ",".join(paper_ids[:100])
    resp = requests.get(
        f"{BASE}/texts/{ids_str}",
        headers=HEADERS
    )
    resp.raise_for_status()
    return resp.json()  # {id: text, id: text, ...}

def build_corpus(query: str, target_size: int = 500, output_path: str = "corpus.jsonl") -> int:
    """
    Build a full-text corpus for a given query topic.
    Saves to JSONL, one JSON object per line, easy to stream later.
    """
    print(f"Searching for papers: '{query}'")
    papers = search_papers(query, limit=min(target_size, 100))
    print(f"Found {len(papers)} papers in search results")

    with_text = [p for p in papers if p.get("has_text")]
    print(f"{len(with_text)} have full text available")

    written = 0
    with open(output_path, "w") as f:
        for i in range(0, len(with_text), 100):
            batch = with_text[i:i+100]
            ids   = [p["id"] for p in batch]

            texts = fetch_texts_bulk(ids)

            for paper in batch:
                pid  = paper["id"]
                text = texts.get(pid)
                if not text:
                    continue

                record = {
                    "id":             pid,
                    "title":          paper.get("title"),
                    "authors":        paper.get("authors", []),
                    "published_date": paper.get("published_date"),
                    "journal":        paper.get("journal"),
                    "abstract":       paper.get("abstract"),
                    "full_text":      text,
                    "source_url":     paper.get("url"),   # auditable backlink
                    "query":          query,
                }
                f.write(json.dumps(record) + "\n")
                written += 1

            time.sleep(0.5)
            print(f"  Batch {i//100 + 1} done — {written} records so far")

    print(f"\nCorpus saved to {output_path} — {written} papers with full text")
    return written

build_corpus(
    query="transformer attention mechanism natural language processing",
    target_size=200,
    output_path="nlp_corpus.jsonl"
)

Why JSONL? Because it streams. You can process a 10GB JSONL file line-by-line without it into memory. It's also what Hugging Face datasets expect natively. Start with JSONL, you'll thank yourself later.

Why source_url in every record? ScholarAPI includes a backlink to the original paper on every result. Keep it. When someone asks "where did this training data come from," you have a per-record answer. That's the difference between an auditable dataset and a liability.

Static datasets go stale. If you're building a system that needs to stay current with new research, a literature monitoring agent, a continuously updated RAG knowledge base, an LLM that gets fine-tuned monthly, you need a pipeline, not a one-time dump.

The /list

endpoint with indexed_after

is what makes this possible. ScholarAPI indexes new papers within 24–48 hours of publication. Here's a pipeline that runs daily and appends only new content:

import requests
import json
from datetime import datetime, timedelta, timezone
from pathlib import Path

API_KEY = "sch_xxxxxxxxx"
BASE    = "https://scholarapi.net/api/v1"
HEADERS = {"X-API-Key": API_KEY}

STATE_FILE = Path(".pipeline_state.json")

def load_state() -> dict:
    if STATE_FILE.exists():
        return json.loads(STATE_FILE.read_text())
    default_since = (datetime.now(timezone.utc) - timedelta(days=7)).isoformat()
    return {"last_run": default_since, "total_records": 0}

def save_state(state: dict):
    STATE_FILE.write_text(json.dumps(state, indent=2))

def fetch_new_papers(keyword: str, since: str) -> list[dict]:
    """Pull all new papers matching keyword since a given timestamp."""
    all_results = []
    cursor      = None

    while True:
        params = {
            "q":             keyword,
            "indexed_after": since,
            "has_text":      "true",
            "limit":         100,
        }
        if cursor:
            params["cursor"] = cursor

        resp = requests.get(f"{BASE}/list", headers=HEADERS, params=params)
        resp.raise_for_status()
        data = resp.json()

        results = data.get("results", [])
        all_results.extend(results)

        cursor = data.get("next_cursor")
        if not cursor or not results:
            break

    return all_results

def run_pipeline(keyword: str, output_file: str = "stream.jsonl"):
    state = load_state()
    since = state["last_run"]
    now   = datetime.now(timezone.utc).isoformat()

    print(f"Pipeline run: {since} → {now}")
    print(f"Keyword: '{keyword}'")

    papers = fetch_new_papers(keyword, since)
    print(f"New papers found: {len(papers)}")

    if not papers:
        print("Nothing new. Updating state and exiting.")
        state["last_run"] = now
        save_state(state)
        return

    added = 0
    with open(output_file, "a") as f:  # append mode
        for i in range(0, len(papers), 100):
            batch = papers[i:i+100]
            ids   = [p["id"] for p in batch]
            ids_str = ",".join(ids)

            texts_resp = requests.get(
                f"{BASE}/texts/{ids_str}",
                headers=HEADERS
            )
            texts = texts_resp.json()

            for paper in batch:
                pid  = paper["id"]
                text = texts.get(pid)
                if not text:
                    continue

                f.write(json.dumps({
                    "id":             pid,
                    "title":          paper.get("title"),
                    "published_date": paper.get("published_date"),
                    "indexed_at":     paper.get("indexed_at"),
                    "full_text":      text,
                    "source_url":     paper.get("url"),
                    "pipeline_run":   now,
                }) + "\n")
                added += 1

    state["last_run"]      = now
    state["total_records"] = state.get("total_records", 0) + added
    save_state(state)

    print(f"Added {added} new records. Total dataset size: {state['total_records']}")

run_pipeline(
    keyword="CRISPR gene editing therapy",
    output_file="crispr_stream.jsonl"
)

Cron it at 6am daily:

0 6 * * * /usr/bin/python3 /path/to/pipeline.py >> /var/log/pipeline.log 2>&1

Your dataset grows automatically. Every morning it's slightly smarter than yesterday.

You have a JSONL file. Now make it useful to everyone, including your future self.

from datasets import Dataset, DatasetDict
import json
from pathlib import Path

def jsonl_to_hf_dataset(jsonl_path: str, train_split: float = 0.9) -> DatasetDict:
    """
    Load a JSONL corpus and split into train/test.
    Pushes to Hugging Face Hub.
    """
    records = []
    with open(jsonl_path) as f:
        for line in f:
            line = line.strip()
            if line:
                records.append(json.loads(line))

    print(f"Loaded {len(records)} records from {jsonl_path}")

    ds = Dataset.from_list(records)

    split     = ds.train_test_split(test_size=1 - train_split, seed=42)
    ds_dict   = DatasetDict({"train": split["train"], "test": split["test"]})

    print(f"Train: {len(ds_dict['train'])} | Test: {len(ds_dict['test'])}")
    return ds_dict

def push_to_hub(ds_dict: DatasetDict, repo_id: str, hf_token: str):
    """Push dataset to Hugging Face Hub."""
    ds_dict.push_to_hub(
        repo_id,
        token=hf_token,
        commit_message="Dataset built with ScholarAPI, open-access full text"
    )
    print(f"Dataset live at: https://huggingface.co/datasets/{repo_id}")

ds = jsonl_to_hf_dataset("nlp_corpus.jsonl")
push_to_hub(
    ds_dict=ds,
    repo_id="your-org/nlp-academic-corpus",
    hf_token="hf_xxxxxxxxxx"
)

That's it. Your dataset is on the Hub, versioned, citable, and searchable.

Credits aren't opaque. Here's exactly what a dataset build costs:

Action	Credits
`/search` (per call)
10 + 2 per result
`/text/{id}` (single)
3 credits (promo, normally 5)

`/texts/{ids}` (bulk)
3 per paper (promo, normally 5)

`/pdf/{id}`
5 credits (promo, normally 10)

Real example: Building a 500-paper full-text corpus.

A 5,000-paper corpus at promo rates sits comfortably inside the $149 pack (10K credits).

Promo pricing on text and PDF endpoints runs until end of June 2026, worth building sooner rather than later.

** has_text is not guaranteed.** Even with

has_text=true

in your list query, a small percentage of papers will return empty text. The PDF exists but extraction failed, corrupted file, scanned image-only PDF, unusual encoding. Build your pipeline to handle None

text gracefully. We do this above with the if not text: continue

guard.Deduplication matters. If you run multiple queries on overlapping topics, you'll get duplicate papers with different query labels. Deduplicate by id

before training. Don't skip this, duplicates in training data are a quiet way to skew your model.

seen = set()
with open("corpus.jsonl") as f_in, open("corpus_deduped.jsonl", "w") as f_out:
    for line in f_in:
        record = json.loads(line)
        if record["id"] not in seen:
            seen.add(record["id"])
            f_out.write(line)

print(f"Unique papers: {len(seen)}")

Open-access only. Elsevier, Wiley, Taylor & Francis, their subscription-paywalled content isn't here. Open-access publications from Springer and Nature are. Check your target domain's open-access rate before committing to a corpus size, CS and medicine have excellent OA coverage; some law and humanities journals less so.

Rate limits exist but are generous. Don't hammer the API with 1,000 parallel requests. The time.sleep(0.5)

in the corpus builder above is intentional. You'll get cleaner results and avoid any throttling.

Here's the schema I actually use in production. Opinionated, tested, ready to go:

RECORD_SCHEMA = {
    "id":             str,   # ScholarAPI paper ID, stable, use as primary key
    "source_url":     str,   # Original journal/repo URL, auditable

    "title":          str,
    "abstract":       str,
    "full_text":      str,   # Pre-extracted, clean

    "authors":        list,  # ["Last, First", ...]
    "published_date": str,   # ISO 8601: "2024-03-15"
    "journal":        str,
    "doi":            str,   # When available

    "query":          str,   # Which search query surfaced this paper
    "indexed_at":     str,   # When ScholarAPI indexed it
    "pipeline_run":   str,   # When YOUR pipeline ran, for debugging
}

The pipeline_run

field sounds like overkill until you have three months of streaming data and need to diagnose why a batch from February looks different from March. Add it from day one.

Biomedical QA fine-tuning corpus. Pull 10K papers from oncology, cardiology, and neurology. Split into (context, question, answer) triples using an LLM. Fine-tune a small model. You now have a domain-specific medical QA system trained on peer-reviewed literature.

Cross-disciplinary embedding benchmark. Build 1K papers across 10 domains. Embed them. Measure how well different embedding models separate domains in latent space. Publish the benchmark. People will cite it.

Hallucination evaluation dataset. Take paper abstracts. Generate LLM summaries. Compare against the actual conclusion sections. You have a grounded hallucination benchmark that's impossible to game because the ground truth is published literature.

Temporal drift dataset. Pull papers from 2018 and 2024 on the same topic. Fine-tune on 2018 data, evaluate on 2024. You now have a dataset that measures how much a field has moved, which is exactly what you need to understand model knowledge cutoffs.

None of these existed as clean, reproducible pipelines before tools like ScholarAPI made the data layer boring. That's the point.

Dataset quality is a multiplier on everything downstream.

A mediocre model trained on clean, well-structured, domain-specific data will outperform a great model trained on garbage. This is not a controversial opinion. It's something the entire ML community knows and somehow keeps forgetting when it comes time to actually collect the data.

The data layer deserves real engineering. Reproducibility. Versioning. Auditable sources. Deduplication. Clean text, not HTML artifacts.

ScholarAPI makes this tractable for academic literature. The endpoints are simple, the text extraction is real, and every record links back to where it came from. That's the baseline you need to build anything you'd actually trust.

The rest is your problem to solve. But at least it's the interesting part.


curl "https://scholarapi.net/api/v1/search?q=your+topic+here&limit=10" \
  -H "X-API-Key: sch_xxxxxxxxx"

Full API reference. No fluff, just endpoints.

If you build something with this, a dataset, a benchmark, a pipeline, drop it in the comments. I'm genuinely curious what people are using academic literature for that I haven't thought of yet.

Tags: python machinelearning datascience api

source & further reading

dev.to — original article Claude Code's Auto Mode Now Default on Major Cloud Platforms Your LLM Cannot Tell When It Is Wrong, Build for That Passion Edition

How to Build a Clean Academic Dataset Without Losing Your Mind (or Your Weekend)

Run your AI side-project on zahid.host