clawmark: open-source CLAUDE.md A/B Testing CLI tool

Clawmark, an open-source Rust CLI tool, enables A/B testing of CLAUDE.md files by evaluating two variants against five SWE-bench Lite tasks using Claude and Docker. The tool generates a comparison report to help developers optimize their CLAUDE.md configurations for better performance on software engineering benchmarks.

clawmark is a local Rust CLI for answering one focused question: Which of these two CLAUDE.md files performs better on a small SWE-bench Lite smoke set? v1 compares exactly two local variant files against five bundled SWE-bench Lite tasks. It runs Claude locally, evaluates the generated patches with the official SWE-bench harness in Docker, and writes a simple A/B report. clawmark doctor checks local prerequisites. clawmark run evaluates variant A and variant B on the same five tasks. clawmark report reads an existing output directory and prints the A/B summary again. There is no config file, web UI, remote execution, retries, resume, progress UI, repeated trials, or full 300-task SWE-bench run in v1. Install these yourself before running clawmark : | Dependency | Required version | Notes | |---|---|---| | Rust | stable, MSRV 1.79 | Used to build this CLI | | Claude CLI | = 1.0.0 | Must be installed, on PATH , and authenticated | | Docker | = 24.0 | Required by the SWE-bench harness | | Python | 3.11+ | Required by swebench | | swebench | latest | Install into the python3 on your PATH with python3 -m pip install --upgrade swebench | | git | = 2.39 | Used to clone task repos and collect diffs | Check your machine: cargo run -- doctor doctor prints a status table and exits non-zero if a required check fails. A missing SWE-bench Docker image is only a warning; the first evaluation may pull it. Create two variant files somewhere inside the current working directory: mkdir -p variants $EDITOR variants/a.md $EDITOR variants/b.md Run the A/B smoke benchmark: cargo run -- run \ --a variants/a.md \ --b variants/b.md \ --model sonnet \ --timeout-secs 300 \ --out out This performs: 2 variants x 5 tasks x 1 trial = 10 Claude invocations run creates a fresh output directory. It fails if --out already exists, so use a new directory for each run. Print the report from existing output: cargo run -- report --out out After building a release binary, the same commands can be run as: cargo build --release ./target/release/clawmark doctor ./target/release/clawmark run --a variants/a.md --b variants/b.md --model sonnet --timeout-secs 300 --out out ./target/release/clawmark report --out out v1 is intentionally minimal and does not enforce a turn limit, token budget, retry policy, or per-task cost cap. --timeout-secs is only a wall-clock timeout around each Claude invocation. A broad CLAUDE.md can spend the full timeout exploring the repo, installing dependencies, or running tests without producing a patch. For first e2e runs, use strict benchmark-oriented variants: You are running inside an automated benchmark. Make the smallest code change that addresses the issue. Rules: - Do not run the full test suite. - Only inspect files needed for the issue. - If you run tests, run at most one targeted test command. - Do not spend time on formatting, docs, or unrelated cleanup. - When a plausible minimal patch is made, stop. Recommended starting settings: - Use --timeout-secs 600 for a bounded smoke test. - Use --timeout-secs 1800 only when you want to give Claude enough time to solve harder tasks. - Use a fresh --out directory for every attempt. - Run cargo run -- doctor first so failures happen before any Claude calls. - Watch the first task before walking away; if it reaches the timeout with an empty patch, tighten your variant instructions before running all 10 invocations. Budget expectation varies heavily by model behavior. The v1 smoke run performs 10 Claude invocations, so open-ended variants can consume materially more time and usage quota than short, patch-focused variants. For each task and variant, clawmark : - Clones the SWE-bench target repository into a temporary workspace. - Checks out the task's base commit. - Writes the selected variant file as CLAUDE.md at the repo root. - Invokes Claude with the task problem statement. - Captures git diff HEAD as the model patch. Claude is invoked locally with: claude -p --output-format json --dangerously-skip-permissions --model <model --add-dir <workspace -- <problem statement After all predictions are written for a variant, clawmark invokes the SWE-bench harness once for A and once for B. The report treats the harness resolved ids arrays as the source of truth. clawmark doctor Checks Docker, Claude CLI, Claude authentication, Python, swebench , the SWE-bench harness CLI, git, Docker Hub registry reachability, and whether the SWE-bench Docker image is already present. clawmark run --a <path --b <path --model <model --timeout-secs <seconds --out <dir Runs the fixed five-task A/B benchmark. --timeout-secs must be between 1 and 86400 ; it applies to each Claude invocation and is also passed to the SWE-bench harness. clawmark report --out <dir Reads existing harness output, prints resolved counts, A wins, B wins, both-resolved ties, and both-failed ties, then writes report.json . --a and --b must exist and be regular files after symlink resolution.- Both variant paths must be inside the process current working directory. - A and B must resolve to different canonical files. --model must be a non-empty string and is passed as one argument to claude --model . run --out requires an existing parent directory and a destination that does not already exist. report --out requires an existing v1 output directory with harness results. Variant filenames do not need to be CLAUDE.md ; their contents are injected as CLAUDE.md inside each temporary task workspace. out/ run records.jsonl predictions/ a.jsonl b.jsonl harness/ a.json b.json report.json run records.jsonl stores one record per variant/task attempt. predictions/a.jsonl and predictions/b.jsonl are the SWE-bench harness inputs. harness/a.json and harness/b.json are stable copies of the raw SWE-bench summary files. report.json stores the final A/B aggregate report. All clawmark-owned records include schema version: 1 . Most per-task failures are recorded and the run continues with an empty patch for that task: - git clone or checkout failure - Claude non-auth failure - Claude timeout - model unavailable or rate limit errors - empty git diff HEAD Claude authentication failures abort the whole run. Harness failures abort before report.json is written, but already-written predictions remain in the output directory for inspection. clawmark is a local developer tool for user-controlled inputs. Claude runs on the host, not in a container. The command uses --dangerously-skip-permissions , and variant instructions are not OS-sandboxed. Do not run untrusted CLAUDE.md variants, untrusted benchmark data, or untrusted prompts through v1. SWE-bench test execution runs inside Docker through the official harness. The model-generated patch is evaluated by the harness; clawmark does not execute the patch directly on the host. clawmark mitigates shell injection by using subprocess argv arrays, variant path traversal by canonicalizing and checking paths against the current working directory, and partial write corruption with atomic file writes. It does not prevent malicious model behavior from accessing host files, environment variables, network resources, or other local credentials available to the process. — The harness filters out empty predictions. This almost always means Claude produced no patch often from too small a No instances to run. --timeout-secs . Use a generous timeout e.g. --timeout-secs 600 and a fresh --out directory, then inspect run records.jsonl to confirm patches are non-empty.— Docker cannot resolve Docker Hub, so the harness cannot pull SWE-bench images. lookup registry-1.docker.io: no such host / Docker image pull errors clawmark doctor flags this with the "Docker registry reachable" check. Fix your network/VPN and Docker Desktop DNS settings, then retry. This is an environment issue, not a clawmark bug. Hugging Face — These are normal probes the 404 lines during dataset load datasets library makes for optional files e.g. SWE-bench Lite.py , dataset infos.json . They are not errors and do not affect the run. clawmark does not send telemetry or usage data. Claude and SWE-bench may perform their own network activity as part of their normal operation. Pull requests are welcome. For major changes or new features, open an issue first to discuss the change — see CONTRIBUTING.md /emiliolugo/clawmark/blob/main/CONTRIBUTING.md . Licensed under the MIT License /emiliolugo/clawmark/blob/main/LICENSE .