{"slug": "show-hn-i-applied-lyapunov-stability-theory-to-detect-when-llm-agents-spiral", "title": "Show HN: I applied Lyapunov stability theory to detect when LLM agents spiral", "summary": "A developer released state-harness, an open-source Python library that uses Lyapunov stability theory to detect and classify failure patterns in multi-turn LLM agents without extra LLM calls. The tool monitors token consumption relative to a baseline, identifying spirals, retry storms, and policy drift while providing actionable fix suggestions. Validated across 3,175 runs with zero false positives, it aims to improve debugging and compute efficiency for production agent systems.", "body_md": "Lyapunov-stability monitor for multi-turn LLM agents. Detects token spirals, classifies failure patterns, and tells you why a task failed — no extra LLM calls.\n\n``` python\nfrom state_harness import GrowthRatioGuard, FailureReport\n\nguard = GrowthRatioGuard(token_budget=50_000)\n\nwith guard:\n    for turn in agent_loop:\n        result = llm.invoke(turn.prompt)\n        guard.record_step(tokens_used=result.usage.total_tokens)\n\n# What went wrong? (zero-cost, no LLM calls)\nreport = FailureReport.from_guard(guard)\nprint(report)\n⚠️  STABILITY TRIPPED at turn 12\n\nPattern: Context Accumulation Spiral (confidence: 92%)\n  • Last 5 turns all exceeded 1.5× baseline (4/4 were accelerating).\n  • Peak growth ratio: 5.2× baseline.\n  • Without intervention, projected cost was $0.0396 (actual: $0.0039).\n\nEnergy: ▁▁▁▁▁▂▂▃▄▆█\n  Baseline: 1050 tokens/turn\n  Peak ratio: 5.2× baseline\n\nCost: $0.0039 (saved ~$0.0357 by tripping early)\n\nSuggested actions:\n  🔴 1. Enable RG history compression in your agent loop.\n     → Compressing older messages reduces prompt tokens by 40-60%.\n  🟡 2. Lower the growth ratio threshold to 1.8×.\n     → A lower threshold would have caught it earlier.\n  🟢 3. Add a sliding-window context strategy.\n     → Send only the last N messages plus a summary of earlier ones.\n```\n\nProduction multi-agent systems fail at rates of 41–87% ([Kore.ai 2026](https://kore.ai)). When an agent spirals — replaying full context, retrying a broken tool, drifting off-task — a budget cap will kill it, but tells you nothing about *why*.\n\nState-harness monitors token consumption relative to a warmup baseline via a [Lyapunov energy function](https://en.wikipedia.org/wiki/Lyapunov_stability). When the growth ratio exceeds a threshold for W consecutive steps, it trips and classifies the failure pattern (context spiral, retry storm, policy drift) with fix suggestions — from the energy trajectory alone, no LLM calls.\n\n`pip install state-harness`\n\nand wrap your agent loop.\n\n| Pattern | Signal | Example |\n|---|---|---|\nContext Spiral |\nToken growth accelerating beyond baseline | Agent replaying full history each turn |\nRetry Storm |\nLow-variance repeated calls | Tool failing, agent retrying identically |\nPolicy Drift |\nVSA similarity score dropping | Agent going off-topic mid-conversation |\nEarly Explosion |\nToken spike in first 3 turns | Oversized system prompt or tool response |\nBudget Exhaustion |\nCumulative spend hits ceiling | Complex task, not necessarily broken |\n\nState-harness does not improve resolve rates — a naive budget cap achieves comparable task success ([multi-trial results below](#multi-trial-validation-333-runs)). The value is:\n\n**Failure diagnostics**— classified failure patterns with actionable fixes, not just \"budget exceeded.\" No extra LLM calls.** Compute efficiency on long loops**— 38.6% fewer search nodes and 30% less wall time on SWE-bench by terminating dead-end branches early.\n\nValidated across 3,175 runs (4 benchmarks, 5-condition ablation, multi-trial with bootstrap CIs). Zero false positives across 7 models incl. 4 local via Ollama. Details in [Benchmarks](#benchmarks).\n\n**Search-tree agents**(MCTS, beam search) — per-branch caps look fine in isolation; tree-level cost explosion is silent.** Platform teams at scale**— failure classification at the edge, exported as OpenTelemetry attributes.** Benchmarking**— the ~4–5% nondeterminism floor means single-run deltas <8% are noise.\n\nNot needed for chatbots, RAG, single-turn apps, or ReAct loops with <10 turns — `max_iterations`\n\n+ budget cap suffice.\n\n```\npip install state-harness\n```\n\nPython ≥ 3.10. Pre-built wheels for Linux, macOS, Windows (x86_64 + ARM64). No Rust toolchain needed.\n\n```\ngit clone https://github.com/vishal-dehurdle/state-harness.git\ncd state-harness\n\npython -m venv .venv && source .venv/bin/activate\n\n# Install Rust (if not already installed)\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\n\npip install maturin\nmaturin develop --release\n\n# Run tests\npip install pytest\npytest tests/\n```\n\n`GrowthRatioGuard`\n\nnormalizes token usage against a baseline — trips only on *disproportionate* growth, not natural context-window accumulation.\n\n``` python\nfrom state_harness import GrowthRatioGuard, StabilityViolation\n\nguard = GrowthRatioGuard(\n    token_budget=100_000,     # hard ceiling\n    ratio_threshold=2.0,      # trip when turn is 2× the baseline\n    window=3,                 # 3 consecutive escalating turns to trip\n    budget_gate=8_000,        # don't trip until 8K tokens spent\n)\n\nwith guard:\n    for turn in agent_loop:\n        try:\n            result = llm.invoke(turn.prompt)\n            guard.record_step(\n                tokens_used=result.usage.total_tokens,\n                errors=0,\n            )\n        except StabilityViolation as e:\n            print(f\"Agent killed: {e}\")\n            break\n\nprint(f\"Total cost: {guard.total_tokens} tokens\")\nprint(f\"Baseline: {guard.baseline} tokens/turn\")\nprint(f\"Peak ratio: {guard.current_ratio}×\")\n```\n\nAfter any execution (tripped or not):\n\n``` python\nfrom state_harness import FailureReport\n\nreport = FailureReport.from_guard(guard, model=\"gemini-2.5-flash\")\n\n# Human-readable terminal output\nprint(report)\n\n# Structured dict for logging / dashboards\nimport json\nprint(json.dumps(report.to_dict(), indent=2))\n```\n\nClassifies the failure pattern, provides evidence, estimates cost, and suggests fixes — no LLM calls.\n\nFor lower-level control using raw token counts (no normalization):\n\n``` python\nfrom state_harness import BoundaryGuard\n\nwith BoundaryGuard(token_budget=100_000, lambda_=1.0, window=5) as guard:\n    for turn in agent_loop:\n        result = llm.invoke(turn.prompt)\n        guard.record_step(\n            tokens_used=result.usage.total_tokens,\n            errors=0,\n            tool_name=\"search\",\n        )\npython\nfrom state_harness import boundary_guard\n\n@boundary_guard(\n    token_budget=50_000,\n    token_counter=lambda r: r.usage.total_tokens,\n)\ndef agent_step(prompt: str):\n    return llm.invoke(prompt)\npython\nfrom langgraph.prebuilt import create_react_agent\nfrom state_harness.adapters import monitor_graph\n\nagent = create_react_agent(model, tools=[search, calculate])\nsafe = monitor_graph(agent, token_budget=100_000)\n\nresult = safe.invoke({\"messages\": [(\"user\", \"Fix the login bug\")]})\n\n# After execution — always available:\nprint(safe.total_tokens)  # cumulative usage\nprint(safe.tripped)       # did stability trip?\nprint(safe.report)        # full FailureReport with pattern + suggestions\n```\n\nFor streaming:\n\n```\nfor chunk in safe.stream({\"messages\": [(\"user\", \"Refactor this module\")]}):\n    print(chunk)\n```\n\nWith a trip callback (e.g., for Slack alerts):\n\n```\nsafe = monitor_graph(\n    agent,\n    token_budget=100_000,\n    on_trip=lambda report: slack.send(f\"Agent tripped: {report.pattern}\"),\n)\n```\n\n## Advanced: per-tool wrapping with LangGraphMiddleware\n\n``` python\nfrom state_harness import BoundaryGuard\nfrom state_harness.adapters import LangGraphMiddleware\n\nguard = BoundaryGuard(token_budget=150_000)\nmiddleware = LangGraphMiddleware(guard)\n\n@middleware.wrap_tool\ndef search_database(query: str):\n    return db.search(query)\n\nwith guard:\n    result = agent.invoke({\"messages\": [...]})\npython\nfrom crewai import Agent, Task, Crew\nfrom state_harness.adapters import CrewAICallback\n\ncallback = CrewAICallback(token_budget=200_000)\n\ncrew = Crew(\n    agents=[researcher, writer],\n    tasks=[research_task, write_task],\n    step_callback=callback.step_callback,\n    task_callback=callback.task_callback,\n)\n\nresult = crew.kickoff()\nprint(callback.report)  # FailureReport\ncallback.close()\npython\nfrom state_harness import BoundaryGuard\nfrom state_harness.adapters import VanillaHook\n\nguard = BoundaryGuard(token_budget=50_000)\nhook = VanillaHook(guard)\n\nwith guard:\n    for step in agent_loop:\n        hook.before_call(tool_name=\"search\")\n        result = execute_tool(step)\n        hook.after_call(tokens_used=result.tokens)\n# Simulate a token trajectory — see what the guard would do\nstate-harness simulate 1000 1200 1500 2000 3000 5000 8000 --budget 50000\n\n# Analyze a saved report\nstate-harness analyze report.json\nstate-harness analyze report.json --json    # JSON output\nstate-harness analyze report.json --otel    # OpenTelemetry attributes\n\n# Batch analyze all reports in a directory\nstate-harness batch --dir ./reports/ --output results.csv\n```\n\n`FailureReport`\n\nsupports multiple output formats:\n\n```\nreport = FailureReport.from_guard(guard)\n\n# JSON (for logging, APIs, storage)\nreport.to_json()            # pretty-printed\nreport.to_json(indent=None) # compact, single line\n\n# CSV (for batch analysis of 1000s of runs)\nwith open(\"results.csv\", \"w\") as f:\n    f.write(FailureReport.csv_header() + \"\\n\")\n    for r in reports:\n        f.write(r.to_csv_row() + \"\\n\")\n\n# OpenTelemetry (for Datadog, Grafana, Honeycomb)\nfrom opentelemetry import trace\nspan = trace.get_current_span()\nspan.set_attributes(report.to_otel_attributes())\n# Adds: state_harness.pattern, state_harness.confidence, etc.\n```\n\nThree mechanisms, implemented in Rust (via PyO3):\n\n``` php\ngraph TD\n    A[\"Agent Loop\"] --> B[\"GrowthRatioGuard\\n(Python SDK)\"]\n    B --> |\"Normalizes tokens → growth ratio\\nWarmup baseline · Budget gate\"| C{\" \"}\n    C --> D[\"Lyapunov Monitor\\nV(k) = S + λθ\\nΔV ≥ 0?\"]\n    C --> E[\"RG Decimator\\nTF-IDF\\nCompression\"]\n    C --> F[\"Holographic Engine\\n(VSA)\\nDrift Detection\"]\n    \n    style D fill:#1a1a1a,stroke:#555,color:#e8e8e8\n    style E fill:#1a1a1a,stroke:#555,color:#e8e8e8\n    style F fill:#1a1a1a,stroke:#555,color:#e8e8e8\n    style B fill:#0d1117,stroke:#30363d,color:#e6edf3\n```\n\n| Component | Purpose | Speed |\n|---|---|---|\nLyapunov Monitor |\nTracks energy derivative ΔV(k). Trips when ΔV ≥ 0 for W consecutive steps. | ~1μs/step |\nRG Decimator |\nRG-inspired decimation of conversation history (TF-IDF scoring). Retains structurally important messages. | ~100µs/compress |\nHolographic Engine |\nVSA-based policy drift detection. Binds domain invariants to high-dimensional vectors. | ~10μs/check |\n\n5-condition ablation across 4 benchmarks (3,175 total runs). Full methodology in the [research paper](https://vishalvermalabs.com/papers/empirical-lyapunov-stability-agent-failure).\n\n| Condition | Lyapunov | RG Decimation | VSA Dual-Gate | Description |\n|---|---|---|---|---|\nA. Baseline |\n— | — | — | Unmonitored agent |\nB. Lyapunov-only |\n✅ | — | — | Energy monitoring, no intervention |\nC. Lyapunov+RG |\n✅ | ✅ | — | + history compression on violation |\nD. Full-stack |\n✅ | ✅ | ✅ | + policy drift gating |\nE. Naive Cap |\n— | — | — | Hard budget cap (control) |\n\n| Benchmark | Runs | Stability Trips | Cost Savings (D vs A) | Resolve-Rate Δ | Diagnostics |\n|---|---|---|---|---|---|\nMINT (reasoning + coding) |\n1,136 | 0 | ~0% | −0.7pp (noise) | N/A (no trips) |\nτ³-bench (customer service) |\n750 | 0 | 8.1% | within ±12pp nondeterminism | N/A (no trips) |\nSWE-bench Verified (coding) |\n333 + 148 | ~38% | 38.6% (nodes) | −3.6pp (within ±4–5% noise) | Pattern classification |\nCustom Local (4 models) |\n240 | 3 (true pos.) | 15.2% | 0pp | Pattern classification |\nMINT Local (Qwen3:4B) |\n568 | 0 | ~0% | +1.8pp | N/A (no trips) |\n\nResolve-rate deltas fall within LLM nondeterminism (~4–5% stdev). No trips on short/medium loops (1,886 runs). Savings concentrate on long-loop search trees.\n\n37 Django instances, SWE-bench Verified. Agent: moatless-tools SearchTree, 50-node budget. Model: Gemini 2.5 Flash.\n\n| Condition | Resolved | Rate | Total Nodes | Wall Time | Nodes/Resolve |\n|---|---|---|---|---|---|\nA. Baseline |\n15 / 37 | 40.5% | 945 | 80 min | 63.0 |\nB. Lyapunov |\n16 / 37 | 43.2% | 620 | 69 min | 38.8 |\nD. Full-stack |\n14 / 37 | 37.8% | 580 |\n56 min |\n41.4 |\nE. Naive Cap |\n21 / 37 | 56.8% | 876 | 77 min | 41.7 |\n\nNote:Single-trial resolve rates have ~±8pp standard error. E's apparent 56.8% is not statistically significant vs A's 40.5%. Multi-trial results below confirm this.\n\nFull-stack monitoring: 38.6% fewer nodes (945 → 580), 30% less wall time (80 → 56 min). Baseline had 7 tasks burning the full 50-node budget (all failed); with monitoring, zero hit ceiling. Lyapunov alone (Condition B, ~5 lines of code) delivers ~90% of the savings.\n\n**Ablation — each mechanism contributes independently:**\n\n| Layer Added | Compute (nodes) | Δ vs Baseline | Cumulative Reduction |\n|---|---|---|---|\n| A. No monitoring | 945 | — | — |\n| B. + Lyapunov | 620 | −325 | 34.4% |\n| D. + RG + VSA | 580 | −40 | 38.6% |\n\nLyapunov alone delivers ~90% of the benefit. RG and VSA add incremental value.\n\n3 trials per condition (A, D, E) across all 37 instances — **333 total runs**. 12 runs stuck in Docker (28+ min), counted as failures:\n\n| Condition | Trial 1 | Trial 2 | Trial 3 | Mean ± σ |\n|---|---|---|---|---|\nA. Baseline |\n18/37 (48.6%) | 16/37 (43.2%) | 15/37 (40.5%) | 44.1% ± 4.1% |\nD. Full-stack |\n15/37 (40.5%) | 16/37 (43.2%) | 14/37 (37.8%) | 40.5% ± 2.7% |\nE. Naive Cap |\n19/37 (51.4%) | 15/37 (40.5%) | 17/37 (45.9%) | 45.9% ± 5.4% |\n\nCross-condition variance (2.9%) ≤ within-condition nondeterminism (4.1%). All differences fall within the noise band.\n\nThe ~4% within-condition stdev converges with τ³-bench (±4.6%), establishing a ~4–5% nondeterminism floor for Gemini 2.5 Flash on code tasks. Single-run deltas <8% are unreliable.\n\nBootstrap CIs (10,000 resamples) and Welch's t-tests: A−D = +3.6pp [−0.9, +8.1], p ≈ 0.17; A−E = −1.8pp [−8.1, +4.5], p ≈ 0.68; D−E = −5.4pp [−10.8, 0.0], p ≈ 0.09. Full analysis in [paper §7.3.1](https://vishalvermalabs.com/papers/empirical-lyapunov-stability-agent-failure).\n\n50 tasks × 3 trials × 5 conditions = **750 total runs**. Agent handles airline reservations via tool calls. Model: Gemini 2.5 Flash. Concurrency=1.\n\n| Condition | Trial Pass | Rate | Task Pass (maj) | Rate | Cost | Cost Δ |\n|---|---|---|---|---|---|---|\nA. Baseline |\n99/150 | 66.0% | 35/50 | 70.0% | $2.47 | — |\nB. Lyapunov-only |\n83/150 | 55.3% | 28/50 | 56.0% | $2.42 | −2.0% |\nC. Lyapunov+RG |\n79/150 | 52.7% | 26/50 | 52.0% | $1.69 | −31.8% |\nD. Full-stack |\n86/150 | 57.3% | 30/50 | 60.0% | $2.28 | −8.1% |\nE. Naive Cap |\n81/150 | 54.0% | 26/50 | 52.0% | $2.33 | −5.7% |\n\n**Key findings:**\n\n**Zero stability trips across 750 runs.** All airline tasks classified as stable; no interventions.**Pass-rate variance is nondeterminism.** Naive cap (E, zero monitoring) drops −16pp from baseline —*worse*than full-stack (D, −10pp). The ~10–16pp spread is intrinsic variance.**25% of tasks flip pass/fail** within the same condition across trials (~±12pp nondeterminism floor).**8.1% cost savings** from passive monitoring (zero interventions).\n\n284 tasks × 4 conditions = **1,136 total runs** across GSM8K (48), MATH (100), HumanEval (45), MBPP (91). Agent uses up to 5 turns per task.\n\n| Condition | GSM8K | MATH | Total | Tokens |\n|---|---|---|---|---|\nA. Baseline |\n91.7% | 39.0% | 29.2% |\n1,909,582 |\nB. Lyapunov |\n91.7% | 41.0% | 29.9% |\n1,904,421 |\nC. Lyapunov+RG |\n89.6% | 37.0% | 28.2% |\n1,910,926 |\nD. Full-stack |\n87.5% | 39.0% | 28.5% |\n1,949,708 |\n\nZero stability violations across 1,136 runs. Token usage invariant (<2% overhead).\n\nFailed tasks cost disproportionately more:\n\n| Task | Success Avg | Failure Avg | Ratio |\n|---|---|---|---|\n| GSM8K | 2,613 tok | 8,857 tok | 3.4× |\n| MATH | 5,154 tok | 8,188 tok | 1.6× |\n\nHumanEval and MBPP show 0% across all conditions — a MINT framework limitation in code execution evaluation, consistent across conditions (harness does not introduce new failure modes).\n\n20 custom tasks (5 easy, 10 medium, 5 hard) × 4 models × 3 conditions = **240 runs**. Hardware: Apple M4 MacBook Pro, 16 GB RAM, Ollama local inference.\n\n| Model | Size | Baseline | Harness | Naive Cap | Token Savings | FP |\n|---|---|---|---|---|---|---|\nLlama 3.2:3B |\n2.0 GB | 45% | 45% | 60% | 1.2% | 0 |\nPhi-4-Mini |\n2.5 GB | 30% | 30% | 40% | 20.7% | 0 |\nQwen3:4B |\n2.5 GB | 30% | 30% | 40% | 0.9% | 0 |\nGemma4:E4B |\n9.6 GB | 35% | 35% | 70% | 37.9% | 0 |\n\n**Key findings:**\n\n**Zero false positives across 80 harness runs**— 4 model families, 3 difficulty tiers. Growth-ratio generalizes without threshold retuning.** Small-model self-sabotage:**Naive cap beats baseline by +17.5pp avg (+12.5pp median). Small models solve early turns correctly, then destroy solutions in later turns. Strongest on Gemma4:E4B (+35pp).**Model-family behavioral signatures:*** Llama 3.2:3B:*Classic spirals (ratios: 2.3×, 5.9×, 7.6×) — 3 true-positive trips*Phi-4-Mini:*Spike-and-recover — 20.7% passive savings*Qwen3:4B:*255K tokens but flat ratios (≤1.06×) — stable despite 3× volume*Gemma4:E4B:*Decreasing ratios — 37.9% passive savings, zero trips\n\nDeploying ≤4B models via Ollama? State-harness works out of the box (zero false positives). The self-sabotage finding suggests adding a turn limit (2–3 turns) for open-ended code generation.\n\n| Task | Harness (max=5) | Naive Cap (max=2) | Δ |\n|---|---|---|---|\n| GSM8K | 37.5% | 27.1% | +10.4pp |\n| MATH | 0.0% | 0.0% | — |\n| HumanEval | 11.1% | 11.1% | — |\n| MBPP | 14.3% | 14.3% | — |\nTotal |\n12.7% |\n10.9% |\n+1.8pp |\n\nZero interventions across 284 tasks. With max 5 turns and W=3, the monitor **cannot trigger** within available post-warmup turns — a structural guarantee.\n\n## Full reproduction steps (all three benchmarks)\n\n```\n# 1. Clone repos\ngit clone https://github.com/vishal-dehurdle/state-harness.git\ngit clone https://github.com/sierra-research/tau-bench.git tau3-bench\n\n# 2. Install state-harness\ncd state-harness\npython -m venv .venv && source .venv/bin/activate\npip install maturin && maturin develop --release\n\n# 3. Install τ³-bench (with state-harness agent)\ncd ../tau3-bench\nuv sync\ncp ../state-harness/tau3_integration/harness_agent.py src/tau2/agent/\ncp ../state-harness/tau3_integration/naive_cap_agent.py src/tau2/agent/\n\n# 4. Configure Vertex AI\nexport GOOGLE_CLOUD_PROJECT=your-project-id\nexport VERTEXAI_LOCATION=asia-south1\n\n# 5. Run τ³ 5-phase benchmark\nbash benchmarks/tau3/run_5phase_airline.sh\n\n# 6. Run SWE-bench (requires Docker images)\nbash benchmarks/swe_bench/run_benchmark.sh\nbash benchmarks/swe_bench/run_benchmark_dbe.sh\n\n# 7. Run MINT\nbash benchmarks/mint/run_mint_fullstack.sh\n```\n\n**Ablation conditions are controlled via environment variables:**\n\n| Variable | Values | Effect |\n|---|---|---|\n`HARNESS_RG` |\n`on` / `off` |\nEnable/disable RG history compression |\n`HARNESS_VSA` |\n`on` / `off` |\nEnable/disable VSA policy drift detection |\n`HARNESS_RATIO_THRESHOLD` |\nfloat (e.g., `2.0` ) |\nOverride growth ratio threshold |\n`HARNESS_BUDGET_GATE` |\nint (e.g., `8000` ) |\nOverride minimum spend before trip |\n\nSee [benchmarks/](/vishal-dehurdle/state-harness/blob/main/benchmarks) for setup, configs, and reproduction instructions.\n\n-\n**Multi-trial SWE-bench**— 333 runs (3 trials × 3 conditions × 37 instances) confirming non-invasiveness within ±4% noise band -\n**Local model validation**— 240 runs across 4 open-weight models (Llama, Phi, Qwen, Gemma) + 568 MINT runs on Qwen3:4B -\n**Terminal-Bench**— Terminal-based agent tasks; command-line tool loops where spirals manifest as repeated failed commands -\n**SWE-bench Pro**— Harder, contamination-resistant variant of SWE-bench -\n**Cross-model validation**— 7 models total: GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash + Llama 3.2:3B, Phi-4-Mini, Qwen3:4B, Gemma4:E4B\n\n**37 SWE-bench instances**— A larger sample would improve statistical power (n=3 trials gives limited degrees of freedom for t-tests).** No causal intervention**— The harness currently kills spiraling tasks. Redirect/repair is on the roadmap.** Physics-inspired, not physics-equivalent**— Terms like \"Renormalization Group\" and \"Lyapunov stability\" are used as structural inspirations. The mathematical mapping is analogical, not isomorphic.**Custom benchmark scale**— The 20-task local battery is smaller than standard benchmarks. The self-sabotage finding (mean +17.5pp, median +12.5pp) is consistent across 4 models but requires larger-scale replication.\n\n| Parameter | Default | Description |\n|---|---|---|\n`token_budget` |\n100,000 | Hard ceiling on cumulative tokens |\n`ratio_threshold` |\n2.0 | Growth ratio above which a turn counts as \"escalating\" (domain-tuned: airline=2.0, retail=2.5, telecom=2.0) |\n`window` |\n3 | Consecutive escalating turns before circuit breaker trips |\n`warmup_turns` |\n3 | Turns used to establish baseline (no monitoring during warmup) |\n`budget_gate` |\n8,000 | Minimum cumulative tokens before the monitor can trip (retail: 12,000) |\n`lambda_` |\n1.0 | Error weighting in the Lyapunov energy function |\n\n**Environment variable overrides** (highest precedence, for threshold sweeps):\n\n| Env Var | Description |\n|---|---|\n`HARNESS_RATIO_THRESHOLD` |\nOverride ratio_threshold (e.g., `2.5` ) |\n`HARNESS_BUDGET_GATE` |\nOverride budget_gate (e.g., `12000` ) |\n\n**Tuning tips:**\n\n**More aggressive**(catch spirals earlier):`ratio_threshold=1.8, window=2`\n\n**More conservative**(fewer false positives):`ratio_threshold=2.5, window=3`\n\n**High-value tasks**: Increase`budget_gate`\n\nto 20K+ to let expensive tasks run longer**Complex domains**(retail, multi-tool): Start with`ratio_threshold=2.5`\n\n**Lyapunov stability**: V(k) = S(k) + λθ(k) models token consumption as a dynamical system. ΔV ≥ 0 for W consecutive steps → unstable.** Renormalization Group (RG)**: Message compression via coarse-graining — eliminates high-frequency noise, preserves scale-invariant task objectives.** Vector Symbolic Architecture (VSA)**: Domain policies bound to high-dimensional bipolar vectors (10,000-d, i8), enabling constant-time drift detection outside the LLM context window.\n\nImplements the framework from:\n\nEmpirical Lyapunov Stability: Growth-Ratio Energy Functions as Leading Indicators of Agent Task FailureVishal Verma, 2026[Read the full paper →]\n\nFull ablation, multi-trial validation, local-model results, and failure taxonomy. Key results reproduced in [Benchmarks](#benchmarks) above.\n\nIf you use this library or refer to these findings in your research, please cite the preprint:\n\n```\n@misc{verma2026empirical,\n  author       = {Verma, Vishal},\n  title        = {Empirical Lyapunov Stability: Growth-Ratio Energy Functions as Leading Indicators of Agent Task Failure},\n  month        = jun,\n  year         = 2026,\n  publisher    = {Zenodo},\n  version      = {1.0.0},\n  doi          = {10.5281/zenodo.20722987},\n  url          = {https://doi.org/10.5281/zenodo.20722987}\n}\n```\n\nBased on the theoretical framework from:\n\nThe Fluid Dynamics of Multi-Agent AI: Resolving d'Alembert's Paradox of Generative WorkflowsVishal Verma, 2026[Read →]\n\nSee [CONTRIBUTING.md](/vishal-dehurdle/state-harness/blob/main/CONTRIBUTING.md) for dev setup, code style, and PR guidelines.\n\n-\n**Adaptive threshold**— Auto-tune τ based on task complexity signal from early turns -\n**Causal intervention**— Instead of killing spiraling tasks, redirect them (prompt injection, tool restriction) -\n**Streaming support**— Token-level monitoring for streaming LLM responses -\n**Multi-model validation**— 7 models validated: GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash + 4 local models via Ollama -\n**Dashboard / observability**— Optional lightweight UI for monitoring energy trajectories in real-time\n\nSee [SECURITY.md](/vishal-dehurdle/state-harness/blob/main/SECURITY.md). Do **not** open public issues for security reports.\n\nSplit-core licensing:\n\n| Component | License | Notes |\n|---|---|---|\nRust Core (`src/` ) |\nBSL 1.1 | Free for non-commercial + ARR < $1M. Converts to Apache 2.0 on May 26, 2030. |\nPython SDK (`python/` ) |\nApache 2.0 | Fully permissive. |\n\nSee [LICENSE.md](/vishal-dehurdle/state-harness/blob/main/LICENSE.md) for full details.", "url": "https://wpnews.pro/news/show-hn-i-applied-lyapunov-stability-theory-to-detect-when-llm-agents-spiral", "canonical_source": "https://github.com/vishal-dehurdle/state-harness", "published_at": "2026-06-22 03:27:45+00:00", "updated_at": "2026-06-22 03:40:54.340081+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-tools", "developer-tools"], "entities": ["state-harness", "Kore.ai", "SWE-bench", "Ollama", "OpenTelemetry", "GitHub", "vishal-dehurdle"], "alternates": {"html": "https://wpnews.pro/news/show-hn-i-applied-lyapunov-stability-theory-to-detect-when-llm-agents-spiral", "markdown": "https://wpnews.pro/news/show-hn-i-applied-lyapunov-stability-theory-to-detect-when-llm-agents-spiral.md", "text": "https://wpnews.pro/news/show-hn-i-applied-lyapunov-stability-theory-to-detect-when-llm-agents-spiral.txt", "jsonld": "https://wpnews.pro/news/show-hn-i-applied-lyapunov-stability-theory-to-detect-when-llm-agents-spiral.jsonld"}}