After months of production experience running LLM calls at scale, we realized something uncomfortable: every AI agent eventually crashes. Not because the code is wrong, but because LLM APIs fail in ways you can't predict.
Timeouts. Rate limits. Empty responses. Schema violations. Drift. These aren't edge cases — they're the norm.
So we built NeuralBridge: an embedded SDK that makes LLM calls self-healing.
Try running 100,000 LLM calls through any single provider. You'll see:
Most teams solve this by building their own retry logic, circuit breakers, and fallback chains. It works — until it doesn't. Because the next failure is always the one you didn't anticipate.
Instead of a gateway (which adds latency and infrastructure), we embedded the reliability logic directly into the SDK:
from neuralbridge import SelfHealingEngine
engine = SelfHealingEngine()
result = engine.call("Write a Python function for binary search")
if result.recovered:
print(f"Fault: {result.diagnosis}")
print(f"Recovery: {result.recovery_action}")
When a call fails, the engine:
| Metric | Value |
|---|---|
| Auto-recovery rate | 84.1% of faults |
| Fault patterns recognized | 280+ |
| Recovery strategies | 30+ |
| Learned rules (flywheel) | 88+ |
| Diagnosis latency | 19us P50 |
| Install size | 375 KB |
We went Apache 2.0 because reliability infrastructure should be a commodity. The SDK is free and open. Pro features (enterprise SSO, audit logs, priority support) fund continued development.
pip install neuralbridge-sdk
python
import neuralbridge as nb
result = nb.run("Explain quantum computing in one sentence")
print(result.text)
pip install neuralbridge-sdk
We'd love your feedback, issues, and contributions. What failure patterns have you seen in production that we should handle?