ML research datasets from ArXiv and Semantic Scholar (JSONL, quality-scored)

FineSet released four quality-scored ML research datasets on Hugging Face, covering synthetic data, efficient LLMs, LLM agents, and mechanistic interpretability papers from ArXiv and Semantic Scholar. The datasets are continuously updated and designed for fine-tuning, with thousands of downloads each.

Hugging Face Models Datasets Spaces Buckets new Docs Enterprise Pricing Website Tasks HuggingChat Collections Languages Organizations Community Blog Posts Daily Papers Learn Discord Forum GitHub Solutions Team & Enterprise Hugging Face PRO Enterprise Support Inference Providers Inference Endpoints Storage Buckets Log In Sign Up Hiring 💼 FineSet fineset-io Follow PhysiQuanty's profile picture 1 follower · 3 following https://fineset.io fineset io AI & ML interests Export-ready, continuously-updated training datasets from arXiv, GitHub & more. Describe what you want to fine-tune on → get a dataset Recent Activity updated a dataset 1 day ago fineset-io/synthetic-data-papers published a dataset 1 day ago fineset-io/synthetic-data-papers updated a dataset 1 day ago fineset-io/efficient-llm-papers View all activity Organizations None yet models 0 None public yet datasets 4 Sort: Recently updated fineset-io/synthetic-data-papers Viewer • Updated 1 day ago • 738 • 13 fineset-io/efficient-llm-papers Viewer • Updated 1 day ago • 1.73k • 13 fineset-io/llm-agent-papers Viewer • Updated 4 days ago • 1.66k • 49 fineset-io/mechanistic-interpretability-papers Viewer • Updated 4 days ago • 748 • 68 • 1