Donate coding sessions to train open models

Developer launches Trace Commons, a community effort to collect coding-agent sessions from tools like Claude Code and Codex under a Creative Commons Attribution 4.0 license, aiming to create an open dataset for training AI coding models. The initiative addresses the lack of open trajectory data, which Meta's CWM model showed is critical for multi-step software engineering tasks. Trace Commons argues that open-weight models need transparent training data to achieve true open-source AI.

A developer is launching Trace Commons today, a community-led effort to collect coding-agent sessions — the prompts, file edits and tool calls from Claude Code, Codex and similar tools AI coding assistants that read your repo and edit files in response to natural-language prompts — and release everything under a Creative Commons Attribution 4.0 licence. The pitch is blunt: Anthropic and OpenAI are sitting on the largest corpus of high-quality training data for AI coding agents ever assembled, simply because their own products generate it. Every time a developer asks Claude Code to refactor a TypeScript file, or Codex to chase a failing test, that interaction is a fresh labelled training example — and it never leaves the vendor’s pipeline. If an open-weight model is to plan, use tools and debug across many turns, somebody has to collect a comparable corpus in the open. Trace Commons is asking developers to donate anonymised session logs on a you keep the copyright, the world keeps the data basis. Why this data moves the needle The most concrete evidence that this kind of data is worth collecting comes from Meta’s CWM, an open-weights research model released in September 2025 to study code generation with world models. The team’s central move was to add a new training stage partway through the model’s development, in which it ingested a large number of observation-action trajectories harvested from a Python interpreter and from Docker-based agent environments. In plain terms: they taught the model to simulate the consequences of running code, not just to read it. The result, on Meta’s evidence, is a model that punches above its weight for its size on real, multi-step software-engineering work. Full numbers are in the box below. On Meta’s evidence, the constraint looks more like trajectory data than parameter count. Why open weights need more than weights Beneath the data-grab story sits a longer-running point. The Open Source Initiative published a clear-eyed explainer arguing that open weights — the term vendors use — falls well short of open-source AI. Weights are the final, frozen parameters of a trained network; they tell you nothing about the training data, the cleaning pipeline or the intermediate checkpoints used to build the model. OSI’s comparison is stark: open weights release the model’s state ; open-source AI releases the process — the training code, the dataset where legally possible, and the data composition details. As the explainer puts it: “Open Weights might seem revolutionary at first glance, but they’re merely a starting point. While they do move the needle closer to transparency than strictly closed, proprietary models, they lack the detailed insights found in Open Source AI.” If Trace Commons succeeds, it shifts that balance. An open-weights coding model trained on a CC-BY-4.0 trajectory dataset is, for the first time, something a third party can audit, replicate and improve — not just fine-tune. That is the difference between transparency as marketing and transparency as engineering. How to contribute this afternoon You don’t need to be a researcher. If you use Claude Code, Codex, Cursor’s agent mode or any coding agent that exports a session log, you can donate a day’s work in an hour. A workflow that works: Run a couple of real sessions this morning — bug fixes, refactors, the work you’d do anyway. Use a coding agent that lets you export a transcript. Scrub before you export. Strip secrets, internal hostnames, customer names and API keys. The licence is CC-BY-4.0 — anyone, including your competitors, can read what you submit. Convert to JSONL a plain-text format where each line is one session record — one session per line, with a schema roughly: {"task": ..., "messages": ... , "tools used": ... , "outcome": "pass|fail"} . The repo README pins the exact schema. Submit via the contribution flow — typically a pull request or a CLI upload. You retain copyright; the dataset just gets a permissive licence. Nudge one colleague to do the same. A dataset is only as useful as its diversity, and one contributor in a team of ten is a sample size of one. The whole exercise is an afternoon’s work, and the marginal value compounds: a large corpus of donated sessions is materially more useful to a small open-weights team than a thin one. The CWM paper shows what’s possible when the trajectories exist. The OSI explainer shows why the open part matters as much as the weights . Trace Commons is the part in between — a practical way to make sure the next generation of coding models isn’t trained on data nobody else is allowed to see. Sources & quotes Every quotation in this article is verbatim from a named source — click any 1 to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify → /blog/how-we-keep-an-ai-newsroom-honest/