{"slug": "ml-research-datasets-from-arxiv-and-semantic-scholar-jsonl-quality-scored", "title": "ML research datasets from ArXiv and Semantic Scholar (JSONL, quality-scored)", "summary": "FineSet released four quality-scored ML research datasets on Hugging Face, covering synthetic data, efficient LLMs, LLM agents, and mechanistic interpretability papers from ArXiv and Semantic Scholar. The datasets are continuously updated and designed for fine-tuning, with thousands of downloads each.", "body_md": "Hugging Face\nModels\nDatasets\nSpaces\nBuckets\nnew\nDocs\nEnterprise\nPricing\nWebsite\nTasks\nHuggingChat\nCollections\nLanguages\nOrganizations\nCommunity\nBlog\nPosts\nDaily Papers\nLearn\nDiscord\nForum\nGitHub\nSolutions\nTeam & Enterprise\nHugging Face PRO\nEnterprise Support\nInference Providers\nInference Endpoints\nStorage Buckets\nLog In\nSign Up\nHiring 💼\nFineSet\nfineset-io\nFollow\nPhysiQuanty's profile picture\n1 follower\n·\n3 following\nhttps://fineset.io\nfineset_io\nAI & ML interests\nExport-ready, continuously-updated training datasets from arXiv, GitHub & more. Describe what you want to fine-tune on → get a dataset\nRecent Activity\nupdated\na dataset\n1 day ago\nfineset-io/synthetic-data-papers\npublished\na dataset\n1 day ago\nfineset-io/synthetic-data-papers\nupdated\na dataset\n1 day ago\nfineset-io/efficient-llm-papers\nView all activity\nOrganizations\nNone yet\nmodels\n0\nNone public yet\ndatasets\n4\nSort: Recently updated\nfineset-io/synthetic-data-papers\nViewer\n•\nUpdated\n1 day ago\n•\n738\n•\n13\nfineset-io/efficient-llm-papers\nViewer\n•\nUpdated\n1 day ago\n•\n1.73k\n•\n13\nfineset-io/llm-agent-papers\nViewer\n•\nUpdated\n4 days ago\n•\n1.66k\n•\n49\nfineset-io/mechanistic-interpretability-papers\nViewer\n•\nUpdated\n4 days ago\n•\n748\n•\n68\n•\n1", "url": "https://wpnews.pro/news/ml-research-datasets-from-arxiv-and-semantic-scholar-jsonl-quality-scored", "canonical_source": "https://huggingface.co/fineset-io", "published_at": "2026-06-16 09:31:18+00:00", "updated_at": "2026-06-16 09:48:44.829611+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "ai-research"], "entities": ["FineSet", "Hugging Face", "ArXiv", "Semantic Scholar"], "alternates": {"html": "https://wpnews.pro/news/ml-research-datasets-from-arxiv-and-semantic-scholar-jsonl-quality-scored", "markdown": "https://wpnews.pro/news/ml-research-datasets-from-arxiv-and-semantic-scholar-jsonl-quality-scored.md", "text": "https://wpnews.pro/news/ml-research-datasets-from-arxiv-and-semantic-scholar-jsonl-quality-scored.txt", "jsonld": "https://wpnews.pro/news/ml-research-datasets-from-arxiv-and-semantic-scholar-jsonl-quality-scored.jsonld"}}