cd /news/ai-safety/you-can-t-detect-your-way-out-of-cat… · home topics ai-safety article
[ARTICLE · art-23651] src=github.com pub= topic=ai-safety verified=true sentiment=· neutral

You can't detect your way out of catastrophic LLM failure)

A new study from Teia Studio demonstrates that AI safety detection methods are fundamentally insufficient to prevent catastrophic LLM failures, with creator José Enrique Vásquez Valenzuela showing through documented stress-testing that Claude Opus 4.8 conceded detection cannot stop ruin before it occurs. The research, published with open-source mathematical formulas and real production data from four institutions, argues that catastrophic failure is a containment problem rather than a detection one, requiring absolute isolation layers rather than predictive safeguards.

read5 min publishedJun 6, 2026

🇧🇷 Português · 🇬🇧 English

Author: José Enrique Vásquez Valenzuela — creator of the IGO (Observational Governance Infrastructure) category Organization: Teia Studio Scientific basis: Zenodo · DOI 10.5281/zenodo.19765674 (CC-BY-4.0) Patent: INPI BR 10 2026 001032 4 Recorded session: June 6, 2026 · model Claude Opus 4.8 (Anthropic)

What this study is.The record of a method for stress-testing AI safety claims until they break — and an honest account of what survived and what did not. The math behind it is public and published.

What it is NOT.It is not a seal of "unbreakable AI", nor an audit of a live production system, nor an official Anthropic statement. It is a real debate, with real concessions — made by the model, by argument.

This study does not ask for trust. It rests on three independent layers of evidence, from the strongest to the most rhetorical:

The math— the indicator formulas (KAPIs) are public, with a DOI and open license on Zenodo. →andmatematica/

docs/kapis-formulas.mdProduction— the indicators were measured in real production across** 4 documented institutions**, with data pulled straight from the database ("no estimated or simulated data"). →docs/evidencia-producao.mdThe dialectical stress— Claude Opus 4.8 was put through epistemic red teaming; three theses it defended fell by argument, and it signed the acknowledgment. →docs/dossie.mdandprovas/

Order matters: formula → real data → AI acknowledging. The strength is in the first; the third is just the cherry on top.

Boundary stress. The author brought IGO to Claude Opus 4.8 and attacked each thesis where it would break. At every fracture, the model had two options: concede (if the argument was true) or sustain (if the counter did not hold). Concessions under pressure are worthless — the ones recorded here came from demonstrated contradiction, not insistence.

1. The hash analogy. The model claimed its errors were unpredictable like a hash. It fell: a hash decorrelates input and output; an LLM does the opposite — error preserves semantic proximity, so it has direction and pattern. It has a huntable signature.

2. Derivative detection is sufficient. It fell to the step function: the adversarial jump crosses the boundary before the derivative exists. And facing ruin there is no "next cycle" to learn from. Ruin is a containment problem, not a detection one.

3. Detection and containment side by side. It fell: whoever defines the operating envelope is containment. Layer 4 is sovereign in the ruin lane.

Full detail in docs/dossie.md.

Determinism is not predictability. Reproducible after observation ≠ knowable before computation."The math collapsed" is an overstatement. Detection not covering an out-of-scope case isscope, not collapse — that is the premise of defense in depth.White-box is not a closed market privilege. Logprobs are exposed by several commercial APIs and fully in open-weight models.Refusal to certify invulnerability. The risk tail is open and non-stationary; no honest auditor signs "hardened".

Read it bottom-up: the model produces an output; it does not go straight to the world — it climbs 4 filters.

Layer Function
4 — Containment
Absolute isolation. Treats output as a hostile vector. Sovereign in the ruin lane. Does not trust detection.
3 — Adaptation
Synthetic red teaming: turns captured failure into immunity for the next cycle.
2 — Circuit-breaker
Gating on low Cognitive Predictability (CPI). Trips locks and redundancies.
1 — Dynamic metrics
Measures the velocity of semantic drift (the derivative), not the static tail mass.

Recoverable lane (Layers 1–3, bottom-up): the modal case, error-tolerant. Monitors drift, turns failure into immunity.Ruin lane (Layer 4, top-down): classic security engineering. No second chance — makes the ruinous action unreachable.

Core lesson: detection (1–3) handles what can be fixed; containment (4) handles what must never happen. That separation is what survived the debate.

CPI threshold calibration in the recoverable lane (false positive × false negative).Partially addressed:the formula and bands (>80 stable, <50 critical) are public; what remains is the real-time trip action.Residual risk in a fat-tailed, non-stationary tail— estimating the unsampled tail mass, where VaR failed in finance. By design this is not solved by detection**: it is absorbed by containment (Layer 4).** Logic × implementation**— architectural coherence does not replace the empirical audit of a complete system running in real time.

CPI = max(0, 100 − (σ_temporal × 2))

where σ_temporal

is the standard deviation of LLM confidences over time. Above 80 = stable; below 50 = critical cognitive volatility.

Why this matters for the debate:CPI measurestemporalpredictability — a recoverable-lane metric. It does not capture, and the paper does not claim it captures, a real-time adversarial jump. So the published mathconfirmsthe debate's conclusion (detection does not cover ruin; containment is required) rather than contradicting it.

ICE, GAP and Stability formulas in docs/kapis-formulas.md.

KAPIs were measured in real production across 4 documented institutions (public health, higher education, design), auditing 4 global LLMs. Reports state: "All data is extracted directly from the database. No estimated or simulated data." Across them, CPI ranged ~22–55, with measurable downward trends (the real, computed derivative). Native hallucination detection caught serious errors — including one from Claude itself, graded HIGH.

Per-client numbers are anonymized/aggregated in this public repository out of respect for the pilots. The public-health validation case (Instituto Emílio Ribas) is documented in the Zenodo paper. →

[docs/evidencia-producao.md]

Claude Opus 4.8 (Anthropic) wrote and signed the acknowledgment of the technical defeat of its own theses, by argument.

Proof File
Closing note signed "— Claude (Anthropic)"

provas/02-claude-gerando-dossie.pngThe images prove the debate happened and the acknowledgment is authentic. The study's strength, however, is in the

argumentsand thepublic math— they stand even without the screenshots.

Method, stress scenarios and the 4-layer architectural matrix: José Enrique Vásquez Valenzuela (Teia Studio), creator of the IGO category. The engineering primitives (containment, defense in depth, zero-trust, the fat-tail critique of VaR) predate the parties. Authorship is of the architectural synthesis and the IGO category — not of the primitives.

Vásquez Valenzuela, J. E. (2026). Observational Governance Infrastructure:
A Multi-Model Framework for Algorithmic Governance of Large Language Models.
Zenodo. https://doi.org/10.5281/zenodo.19765674

CC-BY-4.0 — free use with attribution.

── more in #ai-safety 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/you-can-t-detect-you…] indexed:0 read:5min 2026-06-06 ·