# Show HN: Adaptive Runtime – AI agent layer, no GPU, crash recovery

> Source: <https://github.com/stateflow-dev/adaptive-runtime>
> Published: 2026-05-29 11:30:31+00:00

**Runtime Intelligence Layer for Stateful AI Systems**

Nota chatbot framework.Notan LLM wrapper.Nota workflow builder.An

adaptive runtime intelligence layer— the missing piece between your AI logic and production reality.

Most AI frameworks solve the *model* problem.

Nobody solves the *runtime* problem.

```
Your AI agent in development:   Works perfectly.
Your AI agent in production:    Crashes. Forgets state. Retries blindly. Dies silently.
```

Production AI systems fail because of:

- 💥
**No crash recovery**— state lost on restart - 🧠
**No memory**— agent forgets context between sessions - 🔁
**Retry chaos**— blind retries with no back-off - 📉
**No confidence scoring**— decisions made without certainty - 🌊
**No contextual awareness**— can't adapt to changing conditions

**Adaptive Runtime fixes this.**

```
[16:08:13][RUNTIME]          Event received: service_overload
[16:08:13][CONTEXT_ENGINE]   risk=high  stability=low  pressure=0.65
[16:08:13][CONFIDENCE_ENGINE] confidence=0.84
[16:08:13][DECISION_ENGINE]  ACTION: RESTART_SERVICE
[16:08:13][STATE_ENGINE]     State persisted
[16:08:13][RECOVERY_ENGINE]  Checkpoint #3 created

  → restart_service  [high]  conf=0.840

[16:08:14][RUNTIME]          Event received: anomaly_detected
[16:08:14][CONTEXT_ENGINE]   risk=low   stability=stable  pressure=0.32
[16:08:14][CONFIDENCE_ENGINE] confidence=0.62
[16:08:14][DECISION_ENGINE]  ACTION: FLAG_FOR_REVIEW
[16:08:14][STATE_ENGINE]     State persisted

  → flag_for_review  [low]   conf=0.620
```

The runtime **thinks**, **decides**, **remembers**, and **recovers** — automatically.

```
Event (CPU spike, anomaly, timeout, auth failure...)
  │
  ▼
┌─────────────────┐
│  Context Engine │  → Analyzes conditions: risk, stability, pressure score
└────────┬────────┘
         │
         ▼
┌──────────────────────┐
│  Confidence Engine   │  → Calculates adaptive confidence (with decay + history)
└────────┬─────────────┘
         │
         ▼
┌──────────────────┐
│  Decision Engine │  → Selects action: restart / throttle / rollback / recover...
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   State Engine   │  → Persists state to SQLite (survives crashes)
└────────┬─────────┘
         │
         ▼
┌──────────────────────┐
│   Recovery Engine    │  → Creates checkpoint, handles retry with back-off
└──────────────────────┘
pip install pydantic aiosqlite
python
import asyncio
from adaptive_runtime import Runtime

async def main():
    runtime = Runtime(agent_id="my-agent")
    await runtime.start()

    result = await runtime.process({
        "type": "service_overload",
        "severity": 0.82,
        "cpu": 94,
        "memory": 88,
    })

    print(result.action)      # "restart_service"
    print(result.confidence)  # 0.7831
    print(result.reason)      # "high_resource_pressure"
    print(result.priority)    # "high"

    await runtime.stop()

asyncio.run(main())
```

**That's it.** No API keys. No cloud setup. No GPU. Runs on a $5 VPS.

``` python
import asyncio
from adaptive_runtime import Runtime

async def monitor():
    runtime = Runtime(agent_id="prod-monitor", checkpoint_every=5)

    # Subscribe to critical events
    @runtime.bus.subscribe("anomaly_detected")
    async def on_anomaly(event):
        print(f"  ⚠ Anomaly handler fired — severity={event['severity']}")

    await runtime.start()

    # Simulate real production events
    events = [
        {"type": "service_overload", "severity": 0.91, "cpu": 96, "memory": 92},
        {"type": "anomaly_detected",  "severity": 0.74, "error_rate": 0.6},
        {"type": "auth_failure",      "severity": 0.55},
        {"type": "timeout",           "severity": 0.45, "latency_ms": 4200},
        {"type": "recovery_needed",   "severity": 0.30},
    ]

    for event in events:
        result = await runtime.process(event)
        print(f"  [{result.priority.upper()}] {event['type']:25s} → {result.action}")

    # Runtime remembers everything
    history = await runtime.event_history(limit=5)
    print(f"\n  Last {len(history)} events remembered across sessions.")

    await runtime.stop()

asyncio.run(monitor())
```

Output:

```
  [HIGH]    service_overload          → scale_up_immediate
  [NORMAL]  anomaly_detected          → flag_for_review
  ⚠ Anomaly handler fired — severity=0.74
  [NORMAL]  auth_failure              → trigger_security_audit
  [LOW]     timeout                   → cache_warmup
  [LOW]     recovery_needed           → run_recovery

  Last 5 events remembered across sessions.
```

This question will come up. Here's the honest answer:

| LangChain / AutoGen | Adaptive Runtime |
|
|---|---|---|
Purpose |
LLM orchestration | Runtime behavior |
Core abstraction |
Prompt chains | Stateful events |
Intelligence |
Language model | Probabilistic engine |
Dependencies |
Heavy (openai, tiktoken, ...) | Minimal (pydantic, aiosqlite) |
GPU required |
Sometimes | Never |
Crash recovery |
❌ | ✅ Built-in |
State persistence |
External setup | ✅ Built-in SQLite |
Confidence scoring |
❌ | ✅ Adaptive |
Runs on $5 VPS |
Barely | ✅ Designed for it |
Use case |
Chat, RAG, agents | Runtime resilience |

**TL;DR:** LangChain makes LLMs useful. Adaptive Runtime makes AI systems *reliable*.

They solve different problems. Use both, or use this standalone.

Most AI problems in production are not model problems.

They areruntime problems.

Adaptive Runtime is built around the belief that future AI systems need:

**Memory**— state that survives crashes and restarts** Resilience**— self-healing with checkpoints and retry logic** Contextual behavior**— decisions that adapt to real conditions** Confidence awareness**— knowing*how certain*a decision is**Lightweight cognition**— intelligence without neural dependency

Not just prompts. Not just workflows. **Runtime intelligence.**

Persistent agent memory. Survives crashes. SQLite by default.

```
await state_engine.save_state({"health": "ok", "version": "1.2"})
state = await state_engine.load_state()          # Restored after restart
await state_engine.patch_state({"last": "ok"})   # Partial update
```

Transforms raw signals into contextual understanding — no ML needed.

```
ctx = context_engine.analyze({
    "type": "service_overload", "cpu": 94, "memory": 88, "severity": 0.82
})
# → risk="high", stability="low", context="resource_pressure", pressure=0.65
```

Adaptive probabilistic scoring with historical weighting and decay.

```
conf = confidence_engine.calculate(event, context_risk="high")
# → conf.final = 0.7831  (lower when risk is high, adapts from history)

confidence_engine.record_outcome(success=True, confidence=0.78, context_risk="high")
```

Explainable rule-based action selection. Extensible with custom rules.

```
decision = decision_engine.decide(event, "resource_pressure", "high", 0.78)
# → action="restart_service", reason="high_resource_pressure", priority="high"

# Add your own rules:
custom_rules = [("my_context", "high", 0.70, "my_action", "my_reason")]
engine = DecisionEngine(custom_rules=custom_rules)
```

Crash recovery, checkpoint snapshots, exponential back-off retry.

```
await recovery_engine.create_checkpoint(state)    # Save checkpoint
state = await recovery_engine.restore_latest()    # Restore after crash
result = await recovery_engine.retry(fn, fallback=fallback_fn)  # Retry with back-off
✅ Raspberry Pi
✅ $5 VPS (512MB RAM)  
✅ Old laptop
✅ Edge devices
✅ Offline / air-gapped systems
✅ Serverless (cold start friendly)
```

No GPU. No cloud lock-in. No heavy ML frameworks.

Just Python + asyncio + SQLite.

```
adaptive_runtime/
│
├── core/
│   ├── state_engine.py       # State persistence and memory
│   ├── context_engine.py     # Event → contextual classification
│   ├── confidence_engine.py  # Adaptive probabilistic confidence
│   ├── decision_engine.py    # Rule-based action selection
│   └── recovery_engine.py    # Crash recovery + retry orchestration
│
├── runtime/
│   ├── runtime_manager.py    # Main orchestrator (Runtime class)
│   ├── event_bus.py          # Async pub/sub event bus
│   └── cache.py              # TTL-based in-memory cache
│
├── storage/
│   ├── sqlite_store.py       # Async SQLite persistence
│   └── memory_store.py       # In-process ephemeral store (testing)
│
├── observability/
│   ├── logger.py             # Structured color logger
│   └── metrics.py            # Lightweight in-memory metrics
│
├── examples/
│   ├── agent_demo.py         # Basic event processing
│   ├── monitoring_demo.py    # Continuous monitoring + event bus
│   └── automation_demo.py    # Retry + crash recovery
│
└── tests/
    └── test_engines.py       # 12 unit tests — all engines
# Clone
git clone https://github.com/stateflow-dev/adaptive-runtime.git
cd adaptive-runtime

# Install
pip install pydantic aiosqlite

# Run demos
python examples/agent_demo.py
python examples/monitoring_demo.py
python examples/automation_demo.py

# Run tests
pip install pytest pytest-asyncio
pytest tests/ -v
# → 12 passed
```

| Feature | Status | |
|---|---|---|
| ✅ | 5 Core Engines | Tier 1 — Released |
| ✅ | SQLite + Memory store | Tier 1 — Released |
| ✅ | Async event bus | Tier 1 — Released |
| ✅ | Retry + crash recovery | Tier 1 — Released |
| 🔜 | REST API adapter (FastAPI) | Tier 2 |
| 🔜 | Multi-agent orchestration | Tier 2 |
| 🔜 | Plugin system | Tier 2 |
| 🔜 | Real-time dashboard | Tier 2 |
| 🔜 | Distributed runtime | Tier 3 |

Measured on a mid-range Windows laptop (Python 3.10, SQLite, no GPU).

| Metric | Result |
|---|---|
| Cold start | 446 ms |
| Idle memory | 29 MB |
| CPU idle usage | <0% |
| SQLite save latency | 36.5 ms avg (n=50) |
| SQLite load latency | 2.7 ms avg (n=50) |
| Event processing | 109.2 ms avg (n=50) |
| GPU required | ❌ Never |

Runs comfortably on a $5 VPS (512MB RAM). No GPU. No cloud lock-in.

Issues and PRs welcome. Please open an issue first for major changes.

MIT © [Stateflow Labs](https://github.com/stateflow-dev)

**"The biggest AI problems in production are not model problems.They are runtime problems."**