Show HN: Adaptive Runtime – AI agent layer, no GPU, crash recovery

Adaptive Runtime, a new open-source AI agent layer, launched to solve production failures in AI systems by providing crash recovery, state persistence, and confidence-based decision-making without requiring a GPU. The runtime processes events through five engines—context, confidence, decision, state, and recovery—to automatically handle anomalies, service overloads, and retries with back-off. Designed to run on a $5 VPS, the tool aims to bridge the gap between AI development and reliable production deployment.

Runtime Intelligence Layer for Stateful AI Systems Nota chatbot framework.Notan LLM wrapper.Nota workflow builder.An adaptive runtime intelligence layer— the missing piece between your AI logic and production reality. Most AI frameworks solve the model problem. Nobody solves the runtime problem. Your AI agent in development: Works perfectly. Your AI agent in production: Crashes. Forgets state. Retries blindly. Dies silently. Production AI systems fail because of: - 💥 No crash recovery — state lost on restart - 🧠 No memory — agent forgets context between sessions - 🔁 Retry chaos — blind retries with no back-off - 📉 No confidence scoring — decisions made without certainty - 🌊 No contextual awareness — can't adapt to changing conditions Adaptive Runtime fixes this. 16:08:13 RUNTIME Event received: service overload 16:08:13 CONTEXT ENGINE risk=high stability=low pressure=0.65 16:08:13 CONFIDENCE ENGINE confidence=0.84 16:08:13 DECISION ENGINE ACTION: RESTART SERVICE 16:08:13 STATE ENGINE State persisted 16:08:13 RECOVERY ENGINE Checkpoint 3 created → restart service high conf=0.840 16:08:14 RUNTIME Event received: anomaly detected 16:08:14 CONTEXT ENGINE risk=low stability=stable pressure=0.32 16:08:14 CONFIDENCE ENGINE confidence=0.62 16:08:14 DECISION ENGINE ACTION: FLAG FOR REVIEW 16:08:14 STATE ENGINE State persisted → flag for review low conf=0.620 The runtime thinks , decides , remembers , and recovers — automatically. Event CPU spike, anomaly, timeout, auth failure... │ ▼ ┌─────────────────┐ │ Context Engine │ → Analyzes conditions: risk, stability, pressure score └────────┬────────┘ │ ▼ ┌──────────────────────┐ │ Confidence Engine │ → Calculates adaptive confidence with decay + history └────────┬─────────────┘ │ ▼ ┌──────────────────┐ │ Decision Engine │ → Selects action: restart / throttle / rollback / recover... └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ State Engine │ → Persists state to SQLite survives crashes └────────┬─────────┘ │ ▼ ┌──────────────────────┐ │ Recovery Engine │ → Creates checkpoint, handles retry with back-off └──────────────────────┘ pip install pydantic aiosqlite python import asyncio from adaptive runtime import Runtime async def main : runtime = Runtime agent id="my-agent" await runtime.start result = await runtime.process { "type": "service overload", "severity": 0.82, "cpu": 94, "memory": 88, } print result.action "restart service" print result.confidence 0.7831 print result.reason "high resource pressure" print result.priority "high" await runtime.stop asyncio.run main That's it. No API keys. No cloud setup. No GPU. Runs on a $5 VPS. python import asyncio from adaptive runtime import Runtime async def monitor : runtime = Runtime agent id="prod-monitor", checkpoint every=5 Subscribe to critical events @runtime.bus.subscribe "anomaly detected" async def on anomaly event : print f" ⚠ Anomaly handler fired — severity={event 'severity' }" await runtime.start Simulate real production events events = {"type": "service overload", "severity": 0.91, "cpu": 96, "memory": 92}, {"type": "anomaly detected", "severity": 0.74, "error rate": 0.6}, {"type": "auth failure", "severity": 0.55}, {"type": "timeout", "severity": 0.45, "latency ms": 4200}, {"type": "recovery needed", "severity": 0.30}, for event in events: result = await runtime.process event print f" {result.priority.upper } {event 'type' :25s} → {result.action}" Runtime remembers everything history = await runtime.event history limit=5 print f"\n Last {len history } events remembered across sessions." await runtime.stop asyncio.run monitor Output: HIGH service overload → scale up immediate NORMAL anomaly detected → flag for review ⚠ Anomaly handler fired — severity=0.74 NORMAL auth failure → trigger security audit LOW timeout → cache warmup LOW recovery needed → run recovery Last 5 events remembered across sessions. This question will come up. Here's the honest answer: | LangChain / AutoGen | Adaptive Runtime | | |---|---|---| Purpose | LLM orchestration | Runtime behavior | Core abstraction | Prompt chains | Stateful events | Intelligence | Language model | Probabilistic engine | Dependencies | Heavy openai, tiktoken, ... | Minimal pydantic, aiosqlite | GPU required | Sometimes | Never | Crash recovery | ❌ | ✅ Built-in | State persistence | External setup | ✅ Built-in SQLite | Confidence scoring | ❌ | ✅ Adaptive | Runs on $5 VPS | Barely | ✅ Designed for it | Use case | Chat, RAG, agents | Runtime resilience | TL;DR: LangChain makes LLMs useful. Adaptive Runtime makes AI systems reliable . They solve different problems. Use both, or use this standalone. Most AI problems in production are not model problems. They areruntime problems. Adaptive Runtime is built around the belief that future AI systems need: Memory — state that survives crashes and restarts Resilience — self-healing with checkpoints and retry logic Contextual behavior — decisions that adapt to real conditions Confidence awareness — knowing how certain a decision is Lightweight cognition — intelligence without neural dependency Not just prompts. Not just workflows. Runtime intelligence. Persistent agent memory. Survives crashes. SQLite by default. await state engine.save state {"health": "ok", "version": "1.2"} state = await state engine.load state Restored after restart await state engine.patch state {"last": "ok"} Partial update Transforms raw signals into contextual understanding — no ML needed. ctx = context engine.analyze { "type": "service overload", "cpu": 94, "memory": 88, "severity": 0.82 } → risk="high", stability="low", context="resource pressure", pressure=0.65 Adaptive probabilistic scoring with historical weighting and decay. conf = confidence engine.calculate event, context risk="high" → conf.final = 0.7831 lower when risk is high, adapts from history confidence engine.record outcome success=True, confidence=0.78, context risk="high" Explainable rule-based action selection. Extensible with custom rules. decision = decision engine.decide event, "resource pressure", "high", 0.78 → action="restart service", reason="high resource pressure", priority="high" Add your own rules: custom rules = "my context", "high", 0.70, "my action", "my reason" engine = DecisionEngine custom rules=custom rules Crash recovery, checkpoint snapshots, exponential back-off retry. await recovery engine.create checkpoint state Save checkpoint state = await recovery engine.restore latest Restore after crash result = await recovery engine.retry fn, fallback=fallback fn Retry with back-off ✅ Raspberry Pi ✅ $5 VPS 512MB RAM ✅ Old laptop ✅ Edge devices ✅ Offline / air-gapped systems ✅ Serverless cold start friendly No GPU. No cloud lock-in. No heavy ML frameworks. Just Python + asyncio + SQLite. adaptive runtime/ │ ├── core/ │ ├── state engine.py State persistence and memory │ ├── context engine.py Event → contextual classification │ ├── confidence engine.py Adaptive probabilistic confidence │ ├── decision engine.py Rule-based action selection │ └── recovery engine.py Crash recovery + retry orchestration │ ├── runtime/ │ ├── runtime manager.py Main orchestrator Runtime class │ ├── event bus.py Async pub/sub event bus │ └── cache.py TTL-based in-memory cache │ ├── storage/ │ ├── sqlite store.py Async SQLite persistence │ └── memory store.py In-process ephemeral store testing │ ├── observability/ │ ├── logger.py Structured color logger │ └── metrics.py Lightweight in-memory metrics │ ├── examples/ │ ├── agent demo.py Basic event processing │ ├── monitoring demo.py Continuous monitoring + event bus │ └── automation demo.py Retry + crash recovery │ └── tests/ └── test engines.py 12 unit tests — all engines Clone git clone https://github.com/stateflow-dev/adaptive-runtime.git cd adaptive-runtime Install pip install pydantic aiosqlite Run demos python examples/agent demo.py python examples/monitoring demo.py python examples/automation demo.py Run tests pip install pytest pytest-asyncio pytest tests/ -v → 12 passed | Feature | Status | | |---|---|---| | ✅ | 5 Core Engines | Tier 1 — Released | | ✅ | SQLite + Memory store | Tier 1 — Released | | ✅ | Async event bus | Tier 1 — Released | | ✅ | Retry + crash recovery | Tier 1 — Released | | 🔜 | REST API adapter FastAPI | Tier 2 | | 🔜 | Multi-agent orchestration | Tier 2 | | 🔜 | Plugin system | Tier 2 | | 🔜 | Real-time dashboard | Tier 2 | | 🔜 | Distributed runtime | Tier 3 | Measured on a mid-range Windows laptop Python 3.10, SQLite, no GPU . | Metric | Result | |---|---| | Cold start | 446 ms | | Idle memory | 29 MB | | CPU idle usage | <0% | | SQLite save latency | 36.5 ms avg n=50 | | SQLite load latency | 2.7 ms avg n=50 | | Event processing | 109.2 ms avg n=50 | | GPU required | ❌ Never | Runs comfortably on a $5 VPS 512MB RAM . No GPU. No cloud lock-in. Issues and PRs welcome. Please open an issue first for major changes. MIT © Stateflow Labs https://github.com/stateflow-dev "The biggest AI problems in production are not model problems.They are runtime problems."