Show HN: Memory layer for Claude Code(+10.2 pts on SWE-bench Verified benchmark) A developer released World Model MCP, a memory layer for AI coding agents that uses a temporal knowledge graph to prevent repeated mistakes, achieving a +10.2 point improvement on the SWE-bench Verified benchmark across 49 instances. The tool validates code changes against learned constraints, re-injects context after compaction, and resolves contradictions, supporting Claude Code, Cursor, and other MCP-aware agents. Enforcement, provenance, and harness-neutral memory for AI coding agents. A temporal knowledge graph that validates code changes against learned constraints at the edit boundary, re-injects relevant context after compaction, tracks contradictions with confidence-weighted resolution, and runs across Claude Code, Cursor, and pi. Status: v0.9.1— 26 MCP tools, 19 CLI subcommands, 375 tests, SWE-bench Verified repeat-mistake benchmark with +10.2 pts paired delta across 49 instances +15.0 pts within-domain, +6.9 pts cross-domain , 105-pair contradiction-resolution benchmark. v0.9 ships the empirical wedge proof: a locked, pre-registered methodology tested whether the persistent-knowledge layer measurably reduces repeated coding-agent mistakes on a public task corpus. Result confirms positive within-domain and cross-domain effects with zero observed regressions on out-of-domain tasks. Full per-task tables, mechanistic analysis of the two cross-domain flips sphinx-9461 is the cleanest case , and honest limitations in . v0.8.1 expanded the contradiction-resolution benchmark to 105 pairs across 19 categories. v0.8.0 added domain-aware confidence decay with per-evidence-type TTL, per-item provenance fields benchmarks/repeat-mistake/RESULTS.md source tool and confirmer , slash command write operations, and a confirmer parameter on resolve contradiction . Antigravity adapter held for the fourth consecutive release pending a TransformCompactionHook in the SDK; next re-verify 2026-07-24. v0.7.6 added the /world-model slash command and status-watch TUI widget. v0.7.5 added the Codex CLI adapter. v0.7.0 introduced PostCompact auto-injection, the defer enforcement tier, confidence-weighted contradiction resolution, and a compaction audit log. Contributions welcome. mcp-name: io.github.SaravananJaichandar/world-model-mcp If world-model-mcp helped you, star the repo or open an issue with what worked or didn't. I read every one and the feedback shapes what ships next. World Model MCP creates a temporal knowledge graph of your codebase that learns from every coding session to: Prevent Hallucinations -- Validates API/function references against known entities before use Stop Repeated Mistakes -- Learns constraints from corrections, applies them in future sessions Reduce Regressions -- Tracks bug fixes and warns when changes touch critical regions Survive Compaction -- Re-injects top constraints and recent facts after the agent's context window resets Resolve Contradictions -- Picks a winner between conflicting facts using confidence, recency, or source count Think of it as a long-term memory layer that runs alongside Claude Code, Cursor, or any MCP-aware coding agent. - Repeat-mistake benchmark on SWE-bench Verified — the central wedge proof. 50 SWE-bench Verified tasks across django, sympy, matplotlib, scikit-learn, and sphinx, run as a paired baseline-vs-treatment comparison. Methodology was locked aton 2026-06-17 before the data existed so the result cannot be accused of goalpost-moving. benchmarks/repeat-mistake/DESIGN.md - Headline results — Subset 1 within-domain: django + sympy baseline 15/20 = 75.0 percent, treatment 18/20 = 90.0 percent, delta +15.0 pts with 4 FAIL to PASS flips and 1 regression. Subset 2 cross-domain: matplotlib + scikit-learn + sphinx baseline 18/29 = 62.1 percent, treatment 20/29 = 69.0 percent, delta +6.9 pts with 2 flips and zero regressions. Combined paired result across 49 instances: 33/49 to 38/49, delta +10.2 pts. - Cross-domain transfer isolated cleanly — the Subset 2 treatment arm loaded ONLY the 4 Subset 1 constraints django and sympy directives , holding out the 11 Subset 2 constraints to test whether learning from one repo family generalizes to a different one. Two cross-domain flips with plausible mechanistic explanations grounded in the loaded constraints. Sphinx-9461 is the strongest case: a sympy classmethod constraint transferred to a sphinx classmethod-wrapper unwrapping bug. - Honest caveats embedded in RESULTS.md — seven explicit limitations including single-trial design, constraint-failure overlap on Subset 1, the small cross-domain transfer rate, one dropped instance due to an upstream SWE-bench pip flag issue, and judge-model self-reference risk. Stated verbatim rather than hidden in an appendix. - Full reproducibility artifacts — every progress JSONL, predictions JSON, results JSONL, classification JSONL, constraints JSON, and harness report JSON committed in. Locked judge prompts in benchmarks/repeat-mistake/ failure classifier.py and learning hook.py . Total agent cost across both arms was approximately 90 USD on a Claude Code subscription. - Contradiction-resolution benchmark expansion -- the v0.7.4 24-pair benchmark grew to 105 hand-curated pairs across 19 categories. Six new categories exercise the v0.8.0 schema specifically: source tool corroboration , confirmer overrides pending , decay advantage session vs source , decay advantage stale session , evidence type user correction , settled beats higher confidence . Deterministic runner at; full per-strategy + per-category breakdown at benchmarks/contradictions-200/run.py . benchmarks/contradictions-200/RESULTS.md - Honest framing on the numbers : the new dataset is harder than v0.7.4's 24-pair set because the new categories deliberately test schema awareness confirmer, evidence type, decay rather than raw confidence ranking. Headline numbers: keep most sources 99.0%, keep higher confidence 81.0%, auto 77.1%, keep higher confidence decayed 90.5% on the 21 pairs where evidence type is present , overall 78.2% across all strategies. The original 24-pair v0.7.4 93.5% number is preserved unchanged at benchmarks/contradictions/ and is not invalidated; it tested a different smaller, easier corpus. - The wedge benchmark is v0.9 : "does the learning loop measurably reduce repeated coding-agent mistakes on a public task corpus?" The contradiction-resolution work in this release is internal schema-correctness validation. The empirical artifact that maps to the published essay framing — the learning loop is the durable layer — lands in v0.9 with a SWE-bench-style repeat-mistake benchmark. - Domain-aware confidence decay -- new world model server/decay.py module with exponential half-life decay per evidence type . Half-lives: source code 365d, test 180d, session 14d, user correction 730d, bug fix 365d. Decay applies on read no background task , so the next query fact call returns the time-corrected confidence. Settled facts canonical status, or any fact with confirmer = NULL never auto-transition. Synthesized facts that decay below 0.2 confidence and corroborated facts that decay below 0.1 confidence auto-supersede on read, surfacing rot to the next compaction injection. - Per-item provenance fields on facts -- three additive columns source tool TEXT , confirmer TEXT , last decay at TIMESTAMP , all NULL-defaulted, no backfill. source tool records which tool wrote the fact e.g. claude code , codex , cursor , pi , user . confirmer records who confirmed it, distinct from the asserter; NULL means pending, non-NULL means settled. Both are exposed on the Fact model and propagated through create fact . Honors the public commitment to Patdolitse anthropics/claude-code 47023 https://github.com/anthropics/claude-code/issues/47023 and ferhimedamine openai/codex 19195 https://github.com/openai/codex/issues/19195 . - Slash command write operations -- two new subcommands. /world-model resolve