{"slug": "show-hn-adaptive-runtime-ai-agent-layer-no-gpu-crash-recovery", "title": "Show HN: Adaptive Runtime – AI agent layer, no GPU, crash recovery", "summary": "Adaptive Runtime, a new open-source AI agent layer, launched to solve production failures in AI systems by providing crash recovery, state persistence, and confidence-based decision-making without requiring a GPU. The runtime processes events through five engines—context, confidence, decision, state, and recovery—to automatically handle anomalies, service overloads, and retries with back-off. Designed to run on a $5 VPS, the tool aims to bridge the gap between AI development and reliable production deployment.", "body_md": "**Runtime Intelligence Layer for Stateful AI Systems**\n\nNota chatbot framework.Notan LLM wrapper.Nota workflow builder.An\n\nadaptive runtime intelligence layer— the missing piece between your AI logic and production reality.\n\nMost AI frameworks solve the *model* problem.\n\nNobody solves the *runtime* problem.\n\n```\nYour AI agent in development:   Works perfectly.\nYour AI agent in production:    Crashes. Forgets state. Retries blindly. Dies silently.\n```\n\nProduction AI systems fail because of:\n\n- 💥\n**No crash recovery**— state lost on restart - 🧠\n**No memory**— agent forgets context between sessions - 🔁\n**Retry chaos**— blind retries with no back-off - 📉\n**No confidence scoring**— decisions made without certainty - 🌊\n**No contextual awareness**— can't adapt to changing conditions\n\n**Adaptive Runtime fixes this.**\n\n```\n[16:08:13][RUNTIME]          Event received: service_overload\n[16:08:13][CONTEXT_ENGINE]   risk=high  stability=low  pressure=0.65\n[16:08:13][CONFIDENCE_ENGINE] confidence=0.84\n[16:08:13][DECISION_ENGINE]  ACTION: RESTART_SERVICE\n[16:08:13][STATE_ENGINE]     State persisted\n[16:08:13][RECOVERY_ENGINE]  Checkpoint #3 created\n\n  → restart_service  [high]  conf=0.840\n\n[16:08:14][RUNTIME]          Event received: anomaly_detected\n[16:08:14][CONTEXT_ENGINE]   risk=low   stability=stable  pressure=0.32\n[16:08:14][CONFIDENCE_ENGINE] confidence=0.62\n[16:08:14][DECISION_ENGINE]  ACTION: FLAG_FOR_REVIEW\n[16:08:14][STATE_ENGINE]     State persisted\n\n  → flag_for_review  [low]   conf=0.620\n```\n\nThe runtime **thinks**, **decides**, **remembers**, and **recovers** — automatically.\n\n```\nEvent (CPU spike, anomaly, timeout, auth failure...)\n  │\n  ▼\n┌─────────────────┐\n│  Context Engine │  → Analyzes conditions: risk, stability, pressure score\n└────────┬────────┘\n         │\n         ▼\n┌──────────────────────┐\n│  Confidence Engine   │  → Calculates adaptive confidence (with decay + history)\n└────────┬─────────────┘\n         │\n         ▼\n┌──────────────────┐\n│  Decision Engine │  → Selects action: restart / throttle / rollback / recover...\n└────────┬─────────┘\n         │\n         ▼\n┌──────────────────┐\n│   State Engine   │  → Persists state to SQLite (survives crashes)\n└────────┬─────────┘\n         │\n         ▼\n┌──────────────────────┐\n│   Recovery Engine    │  → Creates checkpoint, handles retry with back-off\n└──────────────────────┘\npip install pydantic aiosqlite\npython\nimport asyncio\nfrom adaptive_runtime import Runtime\n\nasync def main():\n    runtime = Runtime(agent_id=\"my-agent\")\n    await runtime.start()\n\n    result = await runtime.process({\n        \"type\": \"service_overload\",\n        \"severity\": 0.82,\n        \"cpu\": 94,\n        \"memory\": 88,\n    })\n\n    print(result.action)      # \"restart_service\"\n    print(result.confidence)  # 0.7831\n    print(result.reason)      # \"high_resource_pressure\"\n    print(result.priority)    # \"high\"\n\n    await runtime.stop()\n\nasyncio.run(main())\n```\n\n**That's it.** No API keys. No cloud setup. No GPU. Runs on a $5 VPS.\n\n``` python\nimport asyncio\nfrom adaptive_runtime import Runtime\n\nasync def monitor():\n    runtime = Runtime(agent_id=\"prod-monitor\", checkpoint_every=5)\n\n    # Subscribe to critical events\n    @runtime.bus.subscribe(\"anomaly_detected\")\n    async def on_anomaly(event):\n        print(f\"  ⚠ Anomaly handler fired — severity={event['severity']}\")\n\n    await runtime.start()\n\n    # Simulate real production events\n    events = [\n        {\"type\": \"service_overload\", \"severity\": 0.91, \"cpu\": 96, \"memory\": 92},\n        {\"type\": \"anomaly_detected\",  \"severity\": 0.74, \"error_rate\": 0.6},\n        {\"type\": \"auth_failure\",      \"severity\": 0.55},\n        {\"type\": \"timeout\",           \"severity\": 0.45, \"latency_ms\": 4200},\n        {\"type\": \"recovery_needed\",   \"severity\": 0.30},\n    ]\n\n    for event in events:\n        result = await runtime.process(event)\n        print(f\"  [{result.priority.upper()}] {event['type']:25s} → {result.action}\")\n\n    # Runtime remembers everything\n    history = await runtime.event_history(limit=5)\n    print(f\"\\n  Last {len(history)} events remembered across sessions.\")\n\n    await runtime.stop()\n\nasyncio.run(monitor())\n```\n\nOutput:\n\n```\n  [HIGH]    service_overload          → scale_up_immediate\n  [NORMAL]  anomaly_detected          → flag_for_review\n  ⚠ Anomaly handler fired — severity=0.74\n  [NORMAL]  auth_failure              → trigger_security_audit\n  [LOW]     timeout                   → cache_warmup\n  [LOW]     recovery_needed           → run_recovery\n\n  Last 5 events remembered across sessions.\n```\n\nThis question will come up. Here's the honest answer:\n\n| LangChain / AutoGen | Adaptive Runtime |\n|\n|---|---|---|\nPurpose |\nLLM orchestration | Runtime behavior |\nCore abstraction |\nPrompt chains | Stateful events |\nIntelligence |\nLanguage model | Probabilistic engine |\nDependencies |\nHeavy (openai, tiktoken, ...) | Minimal (pydantic, aiosqlite) |\nGPU required |\nSometimes | Never |\nCrash recovery |\n❌ | ✅ Built-in |\nState persistence |\nExternal setup | ✅ Built-in SQLite |\nConfidence scoring |\n❌ | ✅ Adaptive |\nRuns on $5 VPS |\nBarely | ✅ Designed for it |\nUse case |\nChat, RAG, agents | Runtime resilience |\n\n**TL;DR:** LangChain makes LLMs useful. Adaptive Runtime makes AI systems *reliable*.\n\nThey solve different problems. Use both, or use this standalone.\n\nMost AI problems in production are not model problems.\n\nThey areruntime problems.\n\nAdaptive Runtime is built around the belief that future AI systems need:\n\n**Memory**— state that survives crashes and restarts** Resilience**— self-healing with checkpoints and retry logic** Contextual behavior**— decisions that adapt to real conditions** Confidence awareness**— knowing*how certain*a decision is**Lightweight cognition**— intelligence without neural dependency\n\nNot just prompts. Not just workflows. **Runtime intelligence.**\n\nPersistent agent memory. Survives crashes. SQLite by default.\n\n```\nawait state_engine.save_state({\"health\": \"ok\", \"version\": \"1.2\"})\nstate = await state_engine.load_state()          # Restored after restart\nawait state_engine.patch_state({\"last\": \"ok\"})   # Partial update\n```\n\nTransforms raw signals into contextual understanding — no ML needed.\n\n```\nctx = context_engine.analyze({\n    \"type\": \"service_overload\", \"cpu\": 94, \"memory\": 88, \"severity\": 0.82\n})\n# → risk=\"high\", stability=\"low\", context=\"resource_pressure\", pressure=0.65\n```\n\nAdaptive probabilistic scoring with historical weighting and decay.\n\n```\nconf = confidence_engine.calculate(event, context_risk=\"high\")\n# → conf.final = 0.7831  (lower when risk is high, adapts from history)\n\nconfidence_engine.record_outcome(success=True, confidence=0.78, context_risk=\"high\")\n```\n\nExplainable rule-based action selection. Extensible with custom rules.\n\n```\ndecision = decision_engine.decide(event, \"resource_pressure\", \"high\", 0.78)\n# → action=\"restart_service\", reason=\"high_resource_pressure\", priority=\"high\"\n\n# Add your own rules:\ncustom_rules = [(\"my_context\", \"high\", 0.70, \"my_action\", \"my_reason\")]\nengine = DecisionEngine(custom_rules=custom_rules)\n```\n\nCrash recovery, checkpoint snapshots, exponential back-off retry.\n\n```\nawait recovery_engine.create_checkpoint(state)    # Save checkpoint\nstate = await recovery_engine.restore_latest()    # Restore after crash\nresult = await recovery_engine.retry(fn, fallback=fallback_fn)  # Retry with back-off\n✅ Raspberry Pi\n✅ $5 VPS (512MB RAM)  \n✅ Old laptop\n✅ Edge devices\n✅ Offline / air-gapped systems\n✅ Serverless (cold start friendly)\n```\n\nNo GPU. No cloud lock-in. No heavy ML frameworks.\n\nJust Python + asyncio + SQLite.\n\n```\nadaptive_runtime/\n│\n├── core/\n│   ├── state_engine.py       # State persistence and memory\n│   ├── context_engine.py     # Event → contextual classification\n│   ├── confidence_engine.py  # Adaptive probabilistic confidence\n│   ├── decision_engine.py    # Rule-based action selection\n│   └── recovery_engine.py    # Crash recovery + retry orchestration\n│\n├── runtime/\n│   ├── runtime_manager.py    # Main orchestrator (Runtime class)\n│   ├── event_bus.py          # Async pub/sub event bus\n│   └── cache.py              # TTL-based in-memory cache\n│\n├── storage/\n│   ├── sqlite_store.py       # Async SQLite persistence\n│   └── memory_store.py       # In-process ephemeral store (testing)\n│\n├── observability/\n│   ├── logger.py             # Structured color logger\n│   └── metrics.py            # Lightweight in-memory metrics\n│\n├── examples/\n│   ├── agent_demo.py         # Basic event processing\n│   ├── monitoring_demo.py    # Continuous monitoring + event bus\n│   └── automation_demo.py    # Retry + crash recovery\n│\n└── tests/\n    └── test_engines.py       # 12 unit tests — all engines\n# Clone\ngit clone https://github.com/stateflow-dev/adaptive-runtime.git\ncd adaptive-runtime\n\n# Install\npip install pydantic aiosqlite\n\n# Run demos\npython examples/agent_demo.py\npython examples/monitoring_demo.py\npython examples/automation_demo.py\n\n# Run tests\npip install pytest pytest-asyncio\npytest tests/ -v\n# → 12 passed\n```\n\n| Feature | Status | |\n|---|---|---|\n| ✅ | 5 Core Engines | Tier 1 — Released |\n| ✅ | SQLite + Memory store | Tier 1 — Released |\n| ✅ | Async event bus | Tier 1 — Released |\n| ✅ | Retry + crash recovery | Tier 1 — Released |\n| 🔜 | REST API adapter (FastAPI) | Tier 2 |\n| 🔜 | Multi-agent orchestration | Tier 2 |\n| 🔜 | Plugin system | Tier 2 |\n| 🔜 | Real-time dashboard | Tier 2 |\n| 🔜 | Distributed runtime | Tier 3 |\n\nMeasured on a mid-range Windows laptop (Python 3.10, SQLite, no GPU).\n\n| Metric | Result |\n|---|---|\n| Cold start | 446 ms |\n| Idle memory | 29 MB |\n| CPU idle usage | <0% |\n| SQLite save latency | 36.5 ms avg (n=50) |\n| SQLite load latency | 2.7 ms avg (n=50) |\n| Event processing | 109.2 ms avg (n=50) |\n| GPU required | ❌ Never |\n\nRuns comfortably on a $5 VPS (512MB RAM). No GPU. No cloud lock-in.\n\nIssues and PRs welcome. Please open an issue first for major changes.\n\nMIT © [Stateflow Labs](https://github.com/stateflow-dev)\n\n**\"The biggest AI problems in production are not model problems.They are runtime problems.\"**", "url": "https://wpnews.pro/news/show-hn-adaptive-runtime-ai-agent-layer-no-gpu-crash-recovery", "canonical_source": "https://github.com/stateflow-dev/adaptive-runtime", "published_at": "2026-05-29 11:30:31+00:00", "updated_at": "2026-05-29 11:47:09.802725+00:00", "lang": "en", "topics": ["ai-agents", "ai-infrastructure", "ai-tools", "ai-products", "mlops"], "entities": ["Adaptive Runtime"], "alternates": {"html": "https://wpnews.pro/news/show-hn-adaptive-runtime-ai-agent-layer-no-gpu-crash-recovery", "markdown": "https://wpnews.pro/news/show-hn-adaptive-runtime-ai-agent-layer-no-gpu-crash-recovery.md", "text": "https://wpnews.pro/news/show-hn-adaptive-runtime-ai-agent-layer-no-gpu-crash-recovery.txt", "jsonld": "https://wpnews.pro/news/show-hn-adaptive-runtime-ai-agent-layer-no-gpu-crash-recovery.jsonld"}}