With the standalone IDE running, I had a sandboxed environment to write and execute Neural Document Architecture (NDA) programs. However, executing the binary AST via a standard recursive tree-walk interpreter was adding unacceptable dispatch overhead.
Every opcode instruction required match branching, dynamic type checking, and variable lookup cycles. I needed a Just-In-Time (JIT) compiler to turn the AST into native machine code.
The V.E.L.O.C.I.T.Y.-OS 12-Part RoadmapWe are building a bare-metal, self-healing operating system running entirely inside the CPU's L3 cache. Here is the roadmap for this 12-part series:
I started by designing a Tier-1 Closure-Based JIT Compiler.
Instead of compiling directly to machine instructions, the compiler walks the AST at load-time and generates a chain of nested Rust closures (Box<dyn Fn>
).
This approach resolves all opcode matches, scope checks, and control-flow branches at compile-time. At runtime, the JIT engine simply walks down a flat, pre-compiled chain of function pointers. This completely eliminates branch misprediction penalties and instruction cache misses.
Here is how the compiler defines the JIT function type and registers the compilation sequence in src/compiler/nda_jit.rs
:
// compiler/nda_jit.rs — Closure JIT definitions
pub enum JitControlFlow {
Continue,
Break,
Return,
}
// A compiled JIT closure: accepts a mutable state reference of *any* lifetime 'a
pub type JitFn = Arc<dyn for<'a> Fn(&mut JitState<'a>) -> Result<JitControlFlow, String> + Send + Sync>;
// Compile a sequence of NDA AST nodes into a flat chain of closures
fn compile_sequence(nodes: &[NdaNode], counter: &mut usize, registry: &VarRegistry) -> Vec<JitFn> {
nodes.iter().map(|n| compile_node(n, counter, registry)).collect()
}
To understand why this compiler is so fast, we have to look at how the AST nodes compile into closures.
In a standard interpreter, executing an assignment like let a = 5
and a load like a + 1
requires querying a hash map by string name inside loop ticks. The JIT closure compiler bypasses this by pre-allocating variable slots at load-time and wrapping the runtime actions in nested closures that hold direct index offsets.
Here is the exact implementation in src/compiler/nda_jit.rs
for compiling Let
and Load
nodes:
// compiler/nda_jit.rs — Compiling Let and Load AST nodes to closures
fn compile_node(node: &NdaNode, counter: &mut usize, registry: &VarRegistry) -> JitFn {
*counter += 1;
match node {
// Compile a variable declaration
NdaNode::Let { name_hash, init } => {
let slot = registry.get_or_create_slot(*name_hash);
let init_fn = compile_node(init, counter, registry);
Arc::new(move |state: &mut JitState<'_>| {
state.executed_nodes += 1;
// Evaluate the initialization expression
init_fn(state)?;
let val = state.stack.pop().ok_or("Stack underflow in Let init")?;
// Write directly to the pre-allocated flat array index
if slot >= state.variables.len() {
state.variables.resize(slot + 1, None);
}
state.variables[slot] = Some(val);
Ok(JitControlFlow::Continue)
})
}
// Compile a variable reference load
NdaNode::Load { name_hash } => {
let slot = registry.get_or_create_slot(*name_hash);
Arc::new(move |state: &mut JitState<'_>| {
state.executed_nodes += 1;
// Sub-nanosecond flat array read, no hash map overhead
let val = state.variables.get(slot)
.and_then(|v| v.as_ref())
.ok_or_else(|| format!("Load of uninitialized variable slot {}", slot))?;
state.stack.push(val.clone());
Ok(JitControlFlow::Continue)
})
}
// ... other nodes (Matrix, Norm, Loop, Add) compile similarly
}
}
By resolving variable lookups to slot indices during compilation and mapping them directly to pre-allocated indices in JitState::variables
, we reduce variable load/store operations from hash table lookups to flat memory offsets.
However, I immediately hit a massive Rust lifetime wall.
The JIT execution closures needed to query my persistent Merkle database (SiteMap
) to resolve content-addressed function calls. Because the JIT closures were stored and executed dynamically, Satisfying Rust’s borrow checker required wrapping the SiteMap
in an Arc<SiteMap>
.
This meant that every variable assignment, function call, and closure jump required cloning the atomic reference count. The CPU was wasting cycles updating memory barriers in the hot path.
To fix this, I refactored the JIT engine to accept direct reference inputs &SiteMap
instead. I solved the lifetime constraint by using Higher-Ranked Trait Bounds (HRTBs):
type JitFn = Arc<dyn for<'a> Fn(&mut JitState<'a>) -> Result<JitControlFlow, String> + Send + Sync>;
By specifying for<'a>
, I explicitly instructed the compiler that the JIT closure could accept a JitState
of any lifetime 'a
. This allowed the JIT engine to reference the live, stack-allocated database directly, eliminating Arc
clones and reference-counting heap writes entirely.
I wrapped this JIT engine in a custom JIT Sandbox (NdaJitSandbox
). Before any program was committed to the codebase, the sandbox:
AssertUnwindSafe
).Here is the architectural comparison mapping the JIT compilation pipeline and sandbox verification execution path:
Fig 1: The two-tier JIT sandbox compilation pipeline and execution pathways.When I shared the performance gains (the JIT sandbox executing a 4-layer network block in 206µs including compile-and-run time),
analyzed the structural benefits:
"The format itself enforces consistency at write time, so the model can commit incrementally — each triple is either valid against the current graph or it isn't. The correction happens at write speed, not at review time."
By compiling directly to closures, I was allowing the model's output to bypass the serialization wall completely.
But my JIT closures still relied on heap allocations and standard integer loops. I needed to push compiler performance to match—and exceed—native Rust scalar math.
In the next post, I'll document how I optimized the JIT math by introducing slot-based registries and division-free byte loops.
How do you handle runtime extensibility in compiled languages? Have you worked with closure chains or dynamic function dispatch in Rust? How do you tackle borrow checker constraints when dealing with dynamic state sharing? Let's discuss in the comments below!
*Special thanks to *
Disclaimer: AI was used throughout this project, it is just fitting that it would co-author with me, so special thanks to the Foundry for its tireless hours toiling away and Gemini for producing the cover image.