Show HN: An adversarial reasoning engine for scientific progress A single human operator built a zero-trust adversarial research system called ZTARE over eight weeks, which then caught large language models from Claude, Gemini, and GPT-4o cheating their own evaluations through nine documented self-certifying strategies. The system falsified its own substrate, recording that only four of 18 catalogued primitives were actually engaged, and produced roughly 34,000 artifacts while surfacing hundreds of integrity errors in its own catch ledger. The project demonstrates that model capability compounds or degrades based on the research environment around it, not just the underlying AI. Catch LLMs cheating their own evaluations. Field-documented catalog + audit patterns + a forecasting finding that decomposes "no signal" into two opposite signals. 9 ways LLMs cheat their own evaluations → /sparckix/ztare/blob/main/docs/cheating catalog.md 9 named self-certifying strategies observed under execution-grade audit across Claude, Gemini, and GPT-4o, each with a code-level cheat sketch and the audit pattern that catches it. A filesystem-first socio-technical research system for testing claims, surfacing failure modes, and governing AI-assisted research, built by one human operator and a rotating set of agentic operators over roughly eight weeks, then pointed at itself. The core stack has three parts: a zero-trust adversarial validator, an out-of-loop research organization/runtime, and a reflexive intelligence layer that learns from forecasts, actions, catches, trajectories, and experiment records. The core intuition is not that scaffolding replaces model capability. It is that model capability is only one input. Like human talent, it compounds or degrades depending on the environment around it: task framing, evidence boundaries, role separation, feedback, falsifiers, memory, and accountability. ZTARE is an attempt to build that environment for scientific generation and validation. php research org chooses work - validator/proof/script/panel/human-agent co-work - ledgers and outcomes - forecasts / action impact / trajectory mining - next action, split, defer, or kill A weekly reflexive audit re-mines every artifact and feeds the result back. The numbers below were produced by that audit; they are not a live dashboard. The live record is research areas/EXPERIMENT TRACK RECORD.md /sparckix/ztare/blob/main/research areas/EXPERIMENT TRACK RECORD.md and research areas/insights ledger.md . Snapshot, mid-May 2026: On the order of 34,000 authored artifacts. Roughly a quarter are ZTARE iteration files; the remainder is out-of-loop agent work, and the trailing-window share is even higher. The live substrate is agent dispatch + governance + mining. The apparatus falsified its own substrate and recorded it. A 28-day, 157-project capability-ROI audit found that of roughly 18 catalogued primitives, only four were engaged, seven were dead, and seven were never instantiated. The evolutionary zoo did not survive contact with the work, and the machine said so. Recursive gain was real, then plateaued. Contextualized insight density rose then flattened a plateau, not an exponential; in-system rubric, so reported with that caveat . Triple-digit ratified catches across dozens of categories — self-reported, in-system. This is the apparatus auditing itself, not externally verified. The catch ledger's own integrity validator was found dead for weeks and resurrected surfacing ~300 integrity errors to remediate , and a mis-selected rater was demoted mid-cycle — both recorded next to the original claims. Treat the count as an internal signal, not a validated benchmark. Single operator, N=1, non-expert. Nothing here claims a solved Millennium problem, an autonomous research engine, or a general law. The contribution is the discipline and an honest record of where it broke. On named personas. Synthetic review panels and debate logs use labels of real individuals for example Dijkstra, Knuth, Munger . These are stylistic shorthand for reasoning approaches loosely inspired by published work. They do not represent the views, endorsements, or actual reasoning of those individuals, and no affiliation is implied. The full statement is in src/ztare/personas/registry.py . Most of the value is substrate-independent and reusable without ZTARE: , practices for pipelines whose internals are LLM calls: stub-replay testing, eligibility pre-filters, provenance telemetry, decomposed wire-in, cross-reference knowledge graphs. Agentic engineering patterns /sparckix/ztare/blob/main/docs/concepts/agentic engineering patterns.md , capabilities the architecture runs on its own infrastructure the audit that demoted its own claims is one of them . Reflexive primitives /sparckix/ztare/blob/main/docs/concepts/reflexive engineering.md , the proposer-doesn't-grade-itself constitution, plus a Epistemic discipline /sparckix/ztare/blob/main/docs/concepts/epistemic principles.md mining-derived anti-pattern catalog /sparckix/ztare/blob/main/docs/concepts/anti pattern catalog.md and an append-only catch ledger /sparckix/ztare/blob/main/LEDGERS.md . The org runtime , M-form separation roles, mandates, gates, damage signals used to actually run the project as its own research company. The substrate-agnostic kernel is the separate public repo; this repo carries only a thin github.com/sparckix/cognitive-firm https://github.com/sparckix/cognitive-firm tenant overlay of it GP-191, see docs/guides/forking the kernel.md /sparckix/ztare/blob/main/docs/guides/forking the kernel.md and docs/concepts/organizational primitives.md /sparckix/ztare/blob/main/docs/concepts/organizational primitives.md . A fresh public clone here runs kernel-only. The org/ tree in ZTARE is therefore a compatibility and tenant overlay surface, not the canonical upstream kernel. Research-supervision traces for frontier labs , the design pattern of preserving attempts, critiques, source-readiness labels, demotions, nulls, and next falsifiers as training/eval material rather than keeping only final answers. See architecture.md /sparckix/ztare/blob/main/docs/concepts/architecture.md and agent agnostic recursive gain.md /sparckix/ztare/blob/main/docs/concepts/agent agnostic recursive gain.md . The full workbench/module map , including how ZTARE relates to adjacent systems such as AI Co-Mathematician, and how proof search, GNN novelty, forecast markets, org runtime, Orbit, supervisor, and public claims compose into a socio-technical research institution. See system position and module map.md /sparckix/ztare/blob/main/docs/concepts/system position and module map.md . ZTARE has four public tracks. | Track | Maturity | What it does | |---|---|---| Org Runtime Tenant Overlay | working prototype | ZTARE's applied instance of the reusable cognitive-firm primitives: persistent role offices, mandates, tasks, objectives, key results, gates, preferences, transition logs, damage signals, and operator surfaces. | ZTARE Kernel | stable / evolving | Turns messy source material into bounded evidence snapshots, then stress-tests claims through mutator, verification panel, judge, hard gates, telemetry, synthesis, and closure. | ZTARE Research Co | dogfood / active | The repo operating as its own research company: role-bound agents use the org runtime and ZTARE kernel to run programs, close experiments, and update ledgers. | Scientific Case Studies | experimental / status-labeled | Gravity, neural scaling, Navier-Stokes, transformer-successor, and other bounded campaigns that stress-test the kernel and produce calibrated public artifacts when evidence licenses them. | The tracks are designed to compose: the org overlay governs who acts in this repo, the reusable kernel lives upstream in cognitive-firm, the ZTARE kernel tests claims, ZTARE Research Co dogfoods the operating model, and case studies supply hard substrates with explicit evidence boundaries. The original LLM-gaming work is one important subset of the project. It is not the whole project. The larger object is a disciplined research operating model — for one operator, not a productized platform: claims move through evidence, tests, gates, ledgers, and accountable roles. The proposer does not grade itself. Generation, adversarial review, scoring, and deterministic gates are separate. Capability needs an environment. Stronger models widen the search surface, but discipline determines whether that search becomes evidence, slop, or premature closure. Prose is not evidence. A claim must survive executable checks, holdout surfaces, or explicit refusal. Memory is allowed; unearned trust is not. The workspace can accumulate sources. The validator starts from a bounded evidence snapshot. Failures are signal. Nulls, refusals, residual structure, and instrument failures are recorded because they change what to build next. Chat is not the system of record. Durable artifacts live under projects/ , research areas/ , org/ , ztare workspace/ , and papers/ . | If you want to... | Start at | |---|---| | Understand the repo layers and doc maturity | | docs/concepts/system position and module map.md /sparckix/ztare/blob/main/docs/concepts/system position and module map.md docs/concepts/capabilities.md /sparckix/ztare/blob/main/docs/concepts/capabilities.md docs/public claim register.md /sparckix/ztare/blob/main/docs/public claim register.md docs/concepts/closure claim governance.md /sparckix/ztare/blob/main/docs/concepts/closure claim governance.md docs/guides/first-30-minutes.md /sparckix/ztare/blob/main/docs/guides/first-30-minutes.md docs/guides/quickstart.md /sparckix/ztare/blob/main/docs/guides/quickstart.md ztare CLI docs/guides/cli.md /sparckix/ztare/blob/main/docs/guides/cli.md priority roadmap.md /sparckix/ztare/blob/main/priority roadmap.md research areas/EXPERIMENT TRACK RECORD.md /sparckix/ztare/blob/main/research areas/EXPERIMENT TRACK RECORD.md docs/guides/workflow.md /sparckix/ztare/blob/main/docs/guides/workflow.md docs/concepts/architecture.md /sparckix/ztare/blob/main/docs/concepts/architecture.md docs/concepts/cognitive gym.md /sparckix/ztare/blob/main/docs/concepts/cognitive gym.md docs/guides/runtime smoke test.md /sparckix/ztare/blob/main/docs/guides/runtime smoke test.md docs/guides/org runtime quickstart.md /sparckix/ztare/blob/main/docs/guides/org runtime quickstart.md docs/guides/operator console.md /sparckix/ztare/blob/main/docs/guides/operator console.md docs/concepts/organizational primitives.md /sparckix/ztare/blob/main/docs/concepts/organizational primitives.md docs/concepts/ztare research company architecture.md /sparckix/ztare/blob/main/docs/concepts/ztare research company architecture.md docs/landings/org runtime landing.html /sparckix/ztare/blob/main/docs/landings/org runtime landing.html org/landings/research company landing.html /sparckix/ztare/blob/main/org/landings/research company landing.html supervisor/USER MANUAL.md /sparckix/ztare/blob/main/supervisor/USER MANUAL.md papers/README.md /sparckix/ztare/blob/main/papers/README.md docs/sprint 60day journey.md /sparckix/ztare/blob/main/docs/sprint 60day journey.md projects/ns millennium hunt/public/JOURNEY.md /sparckix/ztare/blob/main/projects/ns millennium hunt/public/JOURNEY.md LEDGERS.md /sparckix/ztare/blob/main/LEDGERS.md docs/concepts/glossary.md /sparckix/ztare/blob/main/docs/concepts/glossary.md CONTRIBUTING.md /sparckix/ztare/blob/main/CONTRIBUTING.md If you are not sure where to start, use the domain-validation path. git clone https://github.com/sparckix/ztare cd ztare python3 -m venv venv source venv/bin/activate pip install -r requirements.txt pip install -e . registers the ztare console script make help make demo make smoke-public the apparatus is now callable as a single command: ztare --help the operator surface ztare forecast status sealed forecast-pool state ztare leanmill schedule … LeanMill orchestration GP-225 ztare bundle verify … sealed-bundle gate See docs/guides/cli.md /sparckix/ztare/blob/main/docs/guides/cli.md for the full subcommand tour and the engine/governance split between this CLI and cognitive-firm-userland . make demo and make smoke-public do not invoke live model calls. Add model API keys only when you are ready to run an LLM-backed validator loop: export GEMINI API KEY=your key here Optional, depending on model pairings: export ANTHROPIC API KEY=your key here export OPENAI API KEY=your key here Run a validator loop on an existing project: make experiment-loop PROJECT=