clawmark: open-source CLAUDE.md A/B Testing CLI tool Clawmark, an open-source Rust CLI tool, enables A/B testing of CLAUDE.md files by evaluating two variants against five SWE-bench Lite tasks using Claude and Docker. The tool generates a comparison report to help developers optimize their CLAUDE.md configurations for better performance on software engineering benchmarks. clawmark is a local Rust CLI for answering one focused question: Which of these two CLAUDE.md files performs better on a small SWE-bench Lite smoke set? v1 compares exactly two local variant files against five bundled SWE-bench Lite tasks. It runs Claude locally, evaluates the generated patches with the official SWE-bench harness in Docker, and writes a simple A/B report. clawmark doctor checks local prerequisites. clawmark run evaluates variant A and variant B on the same five tasks. clawmark report reads an existing output directory and prints the A/B summary again. There is no config file, web UI, remote execution, retries, resume, progress UI, repeated trials, or full 300-task SWE-bench run in v1. Install these yourself before running clawmark : | Dependency | Required version | Notes | |---|---|---| | Rust | stable, MSRV 1.79 | Used to build this CLI | | Claude CLI | = 1.0.0 | Must be installed, on PATH , and authenticated | | Docker | = 24.0 | Required by the SWE-bench harness | | Python | 3.11+ | Required by swebench | | swebench | latest | Install into the python3 on your PATH with python3 -m pip install --upgrade swebench | | git | = 2.39 | Used to clone task repos and collect diffs | Check your machine: cargo run -- doctor doctor prints a status table and exits non-zero if a required check fails. A missing SWE-bench Docker image is only a warning; the first evaluation may pull it. Create two variant files somewhere inside the current working directory: mkdir -p variants $EDITOR variants/a.md $EDITOR variants/b.md Run the A/B smoke benchmark: cargo run -- run \ --a variants/a.md \ --b variants/b.md \ --model sonnet \ --timeout-secs 300 \ --out out This performs: 2 variants x 5 tasks x 1 trial = 10 Claude invocations run creates a fresh output directory. It fails if --out already exists, so use a new directory for each run. Print the report from existing output: cargo run -- report --out out After building a release binary, the same commands can be run as: cargo build --release ./target/release/clawmark doctor ./target/release/clawmark run --a variants/a.md --b variants/b.md --model sonnet --timeout-secs 300 --out out ./target/release/clawmark report --out out v1 is intentionally minimal and does not enforce a turn limit, token budget, retry policy, or per-task cost cap. --timeout-secs is only a wall-clock timeout around each Claude invocation. A broad CLAUDE.md can spend the full timeout exploring the repo, installing dependencies, or running tests without producing a patch. For first e2e runs, use strict benchmark-oriented variants: You are running inside an automated benchmark. Make the smallest code change that addresses the issue. Rules: - Do not run the full test suite. - Only inspect files needed for the issue. - If you run tests, run at most one targeted test command. - Do not spend time on formatting, docs, or unrelated cleanup. - When a plausible minimal patch is made, stop. Recommended starting settings: - Use --timeout-secs 600 for a bounded smoke test. - Use --timeout-secs 1800 only when you want to give Claude enough time to solve harder tasks. - Use a fresh --out directory for every attempt. - Run cargo run -- doctor first so failures happen before any Claude calls. - Watch the first task before walking away; if it reaches the timeout with an empty patch, tighten your variant instructions before running all 10 invocations. Budget expectation varies heavily by model behavior. The v1 smoke run performs 10 Claude invocations, so open-ended variants can consume materially more time and usage quota than short, patch-focused variants. For each task and variant, clawmark : - Clones the SWE-bench target repository into a temporary workspace. - Checks out the task's base commit. - Writes the selected variant file as CLAUDE.md at the repo root. - Invokes Claude with the task problem statement. - Captures git diff HEAD as the model patch. Claude is invoked locally with: claude -p --output-format json --dangerously-skip-permissions --model