Recursive AI Research Skill for Claude Code / OpenClaw / Codex A new open-source SKILL.md file enables coding agents like Claude Code, OpenClaw, and Codex to autonomously execute the full ML/AI research loop—from hypothesis to paper—while logging mistakes as lessons to prevent repetition. The recursive design helps researchers avoid common pitfalls such as leaked labels, single-seed wins, and hallucinated citations, with the skill improving over time via a LESSONS.md file. One SKILL.md that makes any coding agent move fast through the full ML/AI research loop — and get better every time it makes a mistake. Hypothesis → literature review → reproduce baseline → leak-free experiments → honest analysis → paper. Works with Claude https://claude.ai / Claude Code, OpenAI Codex https://openai.com , and OpenClaw https://github.com/openclaw/openclaw . Recursive by design: it logs its own mistakes as rules so it never repeats them. ⭐ If this saves you from one leaked-label result, please star the repo — it helps other researchers find it. Most AI-agent "research" help is a chatbot that sounds confident and cites papers that don't exist. Real research fails in specific, boring, expensive ways: a baseline you quoted instead of ran, a metric that's secretly leaking the label, a "gain" that's really one lucky seed, a citation invented from memory. This skill encodes the discipline that catches those failures before they cost you a month — as a portable SKILL.md your agent reads automatically. And it's recursive : when the agent makes a mistake and fixes it, it writes the lesson to LESSONS.md , reads that file at the start of every future task, and stops repeating itself. The skill you use in month three is sharper than the one you installed. | Stage | What the agent does | Guardrail it enforces | |---|---|---| Frame | Turns a vague idea into a testable hypothesis + a stated delta vs prior work | No experiment until the claim is one sentence | Review | Finds the 5–15 papers that matter, builds a comparison matrix | Cite only papers actually read — never from memory | Reproduce | Runs the strongest baseline on your setup first | You need a ruler before you measure a gain | Design | Sets seeds, fixes splits, runs a full leakage audit | Suspiciously-good ≠ breakthrough — prove it's not leakage | Run | Scaffolds configs so every number is reproducible | The config is the single source of truth | Analyze | Compares vs baseline with mean ± std over ≥3 seeds | A single-seed win is a story, not a finding | Write | Backs every claim with a number; ships a reproducibility checklist | Never drop the seed/dataset that hurt the story | Clone it, then drop it where your agent looks for skills: git clone https://github.com/