{"slug": "clawmark-open-source-claude-md-a-b-testing-cli-tool", "title": "clawmark: open-source CLAUDE.md A/B Testing CLI tool", "summary": "Clawmark, an open-source Rust CLI tool, enables A/B testing of CLAUDE.md files by evaluating two variants against five SWE-bench Lite tasks using Claude and Docker. The tool generates a comparison report to help developers optimize their CLAUDE.md configurations for better performance on software engineering benchmarks.", "body_md": "`clawmark`\n\nis a local Rust CLI for answering one focused question:\n\nWhich of these two\n\n`CLAUDE.md`\n\nfiles performs better on a small SWE-bench Lite smoke set?\n\nv1 compares exactly two local variant files against five bundled SWE-bench Lite tasks. It runs Claude locally, evaluates the generated patches with the official SWE-bench harness in Docker, and writes a simple A/B report.\n\n`clawmark doctor`\n\nchecks local prerequisites.`clawmark run`\n\nevaluates variant A and variant B on the same five tasks.`clawmark report`\n\nreads an existing output directory and prints the A/B summary again.\n\nThere is no config file, web UI, remote execution, retries, resume, progress UI, repeated trials, or full 300-task SWE-bench run in v1.\n\nInstall these yourself before running `clawmark`\n\n:\n\n| Dependency | Required version | Notes |\n|---|---|---|\n| Rust | stable, MSRV 1.79 | Used to build this CLI |\n| Claude CLI | >= 1.0.0 | Must be installed, on `PATH` , and authenticated |\n| Docker | >= 24.0 | Required by the SWE-bench harness |\n| Python | 3.11+ | Required by `swebench` |\n| swebench | latest | Install into the `python3` on your `PATH` with `python3 -m pip install --upgrade swebench` |\n| git | >= 2.39 | Used to clone task repos and collect diffs |\n\nCheck your machine:\n\n```\ncargo run -- doctor\n```\n\n`doctor`\n\nprints a status table and exits non-zero if a required check fails. A missing SWE-bench Docker image is only a warning; the first evaluation may pull it.\n\nCreate two variant files somewhere inside the current working directory:\n\n```\nmkdir -p variants\n$EDITOR variants/a.md\n$EDITOR variants/b.md\n```\n\nRun the A/B smoke benchmark:\n\n```\ncargo run -- run \\\n  --a variants/a.md \\\n  --b variants/b.md \\\n  --model sonnet \\\n  --timeout-secs 300 \\\n  --out out\n```\n\nThis performs:\n\n```\n2 variants x 5 tasks x 1 trial = 10 Claude invocations\n```\n\n`run`\n\ncreates a fresh output directory. It fails if `--out`\n\nalready exists, so use a new directory for each run.\n\nPrint the report from existing output:\n\n```\ncargo run -- report --out out\n```\n\nAfter building a release binary, the same commands can be run as:\n\n```\ncargo build --release\n./target/release/clawmark doctor\n./target/release/clawmark run --a variants/a.md --b variants/b.md --model sonnet --timeout-secs 300 --out out\n./target/release/clawmark report --out out\n```\n\nv1 is intentionally minimal and does not enforce a turn limit, token budget, retry policy, or per-task cost cap. `--timeout-secs`\n\nis only a wall-clock timeout around each Claude invocation. A broad `CLAUDE.md`\n\ncan spend the full timeout exploring the repo, installing dependencies, or running tests without producing a patch.\n\nFor first e2e runs, use strict benchmark-oriented variants:\n\n```\nYou are running inside an automated benchmark. Make the smallest code change that addresses the issue.\n\nRules:\n- Do not run the full test suite.\n- Only inspect files needed for the issue.\n- If you run tests, run at most one targeted test command.\n- Do not spend time on formatting, docs, or unrelated cleanup.\n- When a plausible minimal patch is made, stop.\n```\n\nRecommended starting settings:\n\n- Use\n`--timeout-secs 600`\n\nfor a bounded smoke test. - Use\n`--timeout-secs 1800`\n\nonly when you want to give Claude enough time to solve harder tasks. - Use a fresh\n`--out`\n\ndirectory for every attempt. - Run\n`cargo run -- doctor`\n\nfirst so failures happen before any Claude calls. - Watch the first task before walking away; if it reaches the timeout with an empty patch, tighten your variant instructions before running all 10 invocations.\n\nBudget expectation varies heavily by model behavior. The v1 smoke run performs 10 Claude invocations, so open-ended variants can consume materially more time and usage quota than short, patch-focused variants.\n\nFor each task and variant, `clawmark`\n\n:\n\n- Clones the SWE-bench target repository into a temporary workspace.\n- Checks out the task's base commit.\n- Writes the selected variant file as\n`CLAUDE.md`\n\nat the repo root. - Invokes Claude with the task problem statement.\n- Captures\n`git diff HEAD`\n\nas the model patch.\n\nClaude is invoked locally with:\n\n```\nclaude -p --output-format json --dangerously-skip-permissions --model <model> --add-dir <workspace> -- <problem_statement>\n```\n\nAfter all predictions are written for a variant, `clawmark`\n\ninvokes the SWE-bench harness once for A and once for B. The report treats the harness `resolved_ids`\n\narrays as the source of truth.\n\n```\nclawmark doctor\n```\n\nChecks Docker, Claude CLI, Claude authentication, Python, `swebench`\n\n, the SWE-bench harness CLI, git, Docker Hub registry reachability, and whether the SWE-bench Docker image is already present.\n\n```\nclawmark run --a <path> --b <path> --model <model> --timeout-secs <seconds> --out <dir>\n```\n\nRuns the fixed five-task A/B benchmark. `--timeout-secs`\n\nmust be between `1`\n\nand `86400`\n\n; it applies to each Claude invocation and is also passed to the SWE-bench harness.\n\n```\nclawmark report --out <dir>\n```\n\nReads existing harness output, prints resolved counts, A wins, B wins, both-resolved ties, and both-failed ties, then writes `report.json`\n\n.\n\n`--a`\n\nand`--b`\n\nmust exist and be regular files after symlink resolution.- Both variant paths must be inside the process current working directory.\n- A and B must resolve to different canonical files.\n`--model`\n\nmust be a non-empty string and is passed as one argument to`claude --model`\n\n.`run --out`\n\nrequires an existing parent directory and a destination that does not already exist.`report --out`\n\nrequires an existing v1 output directory with harness results.\n\nVariant filenames do not need to be `CLAUDE.md`\n\n; their contents are injected as `CLAUDE.md`\n\ninside each temporary task workspace.\n\n```\nout/\n  run_records.jsonl\n  predictions/\n    a.jsonl\n    b.jsonl\n  harness/\n    a.json\n    b.json\n  report.json\n```\n\n`run_records.jsonl`\n\nstores one record per variant/task attempt. `predictions/a.jsonl`\n\nand `predictions/b.jsonl`\n\nare the SWE-bench harness inputs. `harness/a.json`\n\nand `harness/b.json`\n\nare stable copies of the raw SWE-bench summary files. `report.json`\n\nstores the final A/B aggregate report.\n\nAll clawmark-owned records include `schema_version: 1`\n\n.\n\nMost per-task failures are recorded and the run continues with an empty patch for that task:\n\n- git clone or checkout failure\n- Claude non-auth failure\n- Claude timeout\n- model unavailable or rate limit errors\n- empty\n`git diff HEAD`\n\nClaude authentication failures abort the whole run. Harness failures abort before `report.json`\n\nis written, but already-written predictions remain in the output directory for inspection.\n\n`clawmark`\n\nis a local developer tool for user-controlled inputs.\n\nClaude runs on the host, not in a container. The command uses `--dangerously-skip-permissions`\n\n, and variant instructions are not OS-sandboxed. Do not run untrusted `CLAUDE.md`\n\nvariants, untrusted benchmark data, or untrusted prompts through v1.\n\nSWE-bench test execution runs inside Docker through the official harness. The model-generated patch is evaluated by the harness; `clawmark`\n\ndoes not execute the patch directly on the host.\n\n`clawmark`\n\nmitigates shell injection by using subprocess argv arrays, variant path traversal by canonicalizing and checking paths against the current working directory, and partial write corruption with atomic file writes. It does not prevent malicious model behavior from accessing host files, environment variables, network resources, or other local credentials available to the process.\n\n— The harness filters out empty predictions. This almost always means Claude produced no patch (often from too small a`No instances to run.`\n\n`--timeout-secs`\n\n). Use a generous timeout (e.g.`--timeout-secs 600`\n\n) and a fresh`--out`\n\ndirectory, then inspect`run_records.jsonl`\n\nto confirm patches are non-empty.— Docker cannot resolve Docker Hub, so the harness cannot pull SWE-bench images.`lookup registry-1.docker.io: no such host`\n\n/ Docker image pull errors`clawmark doctor`\n\nflags this with the \"Docker registry reachable\" check. Fix your network/VPN and Docker Desktop DNS settings, then retry. This is an environment issue, not a clawmark bug.**Hugging Face**— These are normal probes the`404`\n\nlines during dataset load`datasets`\n\nlibrary makes for optional files (e.g.`SWE-bench_Lite.py`\n\n,`dataset_infos.json`\n\n). They are not errors and do not affect the run.\n\n`clawmark`\n\ndoes not send telemetry or usage data. Claude and SWE-bench may perform their own network activity as part of their normal operation.\n\nPull requests are welcome. For major changes or new features, open an issue\nfirst to discuss the change — see [CONTRIBUTING.md](/emiliolugo/clawmark/blob/main/CONTRIBUTING.md).\n\nLicensed under the [MIT License](/emiliolugo/clawmark/blob/main/LICENSE).", "url": "https://wpnews.pro/news/clawmark-open-source-claude-md-a-b-testing-cli-tool", "canonical_source": "https://github.com/emiliolugo/clawmark", "published_at": "2026-06-18 14:00:00+00:00", "updated_at": "2026-06-18 14:23:14.239160+00:00", "lang": "en", "topics": ["developer-tools", "ai-tools", "large-language-models"], "entities": ["Clawmark", "Claude", "SWE-bench", "Docker", "Rust"], "alternates": {"html": "https://wpnews.pro/news/clawmark-open-source-claude-md-a-b-testing-cli-tool", "markdown": "https://wpnews.pro/news/clawmark-open-source-claude-md-a-b-testing-cli-tool.md", "text": "https://wpnews.pro/news/clawmark-open-source-claude-md-a-b-testing-cli-tool.txt", "jsonld": "https://wpnews.pro/news/clawmark-open-source-claude-md-a-b-testing-cli-tool.jsonld"}}