14× faster embeddings: how we rebuilt the ONNX path in Manticore Manticore Search rebuilt its ONNX path, achieving 14× faster embeddings than the previous SentenceTransformers/Candle backend. The new ONNX Runtime backend, released in Manticore Search 27.1.5, boosts throughput from 5–11 docs/sec to 70–230 docs/sec on the same hardware, making embedding speed equal to INSERT speed for auto-embeddings. When we shipped Auto Embeddings /blog/auto-embeddings/ — the feature that turns any text column into a vector automatically, with no separate model service to run — the most common piece of feedback was about speed. The previous path went through SentenceTransformers on top of Candle https://github.com/huggingface/candle , Hugging Face's pure-Rust ML inference runtime, and it left a lot of CPU on the floor: most workloads sat in the low-double-digits of docs/sec no matter how we fed them, and concurrent calls serialised on a single model session. So we spent a few weeks rebuilding how Manticore runs ONNX models. The new ONNX Runtime backend shipped in Manticore Search 27.1.5 /blog/manticore-search-27-1-5/ . ONNX Open Neural Network Exchange is the portable model format that most of the popular open-source embedding models — MiniLM, BGE, E5, and friends — already publish. The result is a backend that's ~14× faster on average than the previous SentenceTransformers/Candle path on the same hardware average cheap 16 cores / 32 threads server , same model, same weights, averaged over the full threads × batch workload grid — and that advantage holds whether you run 1 client thread or 32. The old path stayed in the 5–11 docs/sec range across the entire grid; the new one lives in the 70–230 docs/sec band. This post is the engineering log: what we tried, what surprised us, what we threw away, and what the final design looks like. TL;DR ~14× faster on average than the previous SentenceTransformers/Candle path , averaged across the full threads × batch workload grid 1 / 2 / 4 / 8 / 16 / 32 threads × batch sizes 1…128 on the same box 16 cores / 32 threads , same model, same weights.- Released in Manticore Search 27.1.5 /blog/manticore-search-27-1-5/ , the new ONNX path is now the default fast path for any HuggingFace model that ships an .onnx file. - On all-MiniLM-L12-v2 , the old Candle path sat at 5–11 docs/sec across every configuration we tried. The new ONNX path lands in the 70–230 docs/sec range — the same ~14× margin holds whether you run 1 client thread or 32 . - Single-insert latency on our test box: ~14 ms with a single client, ~56 ms under 8-way concurrent load — both well below the 200+ ms Candle was hitting. Want maximum bulk ingest throughput? Use a high batch size 32–128 on a single client thread . The new backend parallelises inside the call, so client-side fan-out just piles coordination overhead on top — peak on our box was 233 docs/sec at 1 thread + batch=64 .- The two changes that mattered most: turning , and giving up on batching documents inside the worker. intra op spinning off - No user-facing API changes. A table that already points at an ONNX-capable MODEL NAME picks up the new path automatically. Switching an existing table to a different model isn't a one-liner — Manticore doesn't allow altering MODEL NAME on a FLOAT VECTOR field in place — but you don't have to recreate the whole table either: you can add a new column with the new model alongside, rebuild its embeddings, and drop the old one. Why this matters With auto-embeddings, the database itself runs the model on every INSERT . That means embedding speed is INSERT speed — your ingest throughput is whatever the embedding step can sustain. The old SentenceTransformers/Candle path left performance on the table. Concurrency hit lock contention, batched calls plateaued because of padding overhead, and between calls the runtime parked threads in ways that prevented the next call from picking up where the previous one left off. The headline symptom was simple: top would show the box well under full load no matter what you threw at it. The whole sweep — single-row INSERTs, 128-row bulk INSERTs, one client thread, thirty-two client threads — sat at 5–11 docs/sec , because nothing about how you fed it could buy you more CPU. The new ONNX path raises the floor by an order of magnitude and gives users meaningful performance tuning options. A single-thread, single-row INSERT now lands 72 docs/sec — already ~7× the old Candle ceiling. Add concurrency or batch size and it climbs into the 130–230 docs/sec range, with the top of the grid at 233 docs/sec on a single client thread at --batch-size=64 . Averaged across the whole threads × batch matrix, the new path is ~14× the old one . Why ONNX, and not Candle Manticore's embeddings library has supported a few backends for a while. The Candle path is great for correctness and easy to ship. But for production inference of small encoder models like the MiniLM and BGE family, ONNX Runtime is hard to beat: - ONNX Runtime or ORT — Microsoft's official, hand-tuned C++ inference engine for ONNX models does graph fusion, constant folding, kernel autotuning. - Most of the popular embedding models on HuggingFace already publish a pre-fused model.onnx in their onnx/ directory. The on-disk file is already in the shape ORT wants. On the same all-MiniLM-L12-v2 weights, on CPU, the ONNX path is a noticeable step up over the Candle path. Same quality, much less per-document work. The ORT session is created with a small set of opinions: js let session = ort::session::Session::builder ? .with optimization level GraphOptimizationLevel::Level3 ? .with intra threads 0 ? // let ORT pick = all cores .with intra op spinning false ? // do NOT busy-wait between calls .with flush to zero ? // kill denormals on attention softmax .with approximate gelu ? // ~10% faster activation, no quality loss .commit from file &onnx path ?; Most of these are uncontroversial, "of course you turn that on" knobs. One is not: intra op spinning false . We'll come back to it — it's the single biggest win in the whole branch, and it's not really an ORT setting so much as a load-shape decision. The concurrency model — the part most readers will find new If you give a Rust developer "make ONNX go fast" with no other constraints, they reach for one of two patterns. We tried both. They are both wrong for this workload. Pattern 1: a single shared Session behind a Mutex a Mutex is a lock that lets only one thread touch the session at a time . Easy to reason about, easy to get right. Throughput collapses under concurrency because every caller serialises on the lock. Fine for a CLI tool, awful for a database serving many concurrent INSERTs. Pattern 2: a session pool, one Session per CPU. No more lock contention, but cold-start time multiplies, RAM use multiplies, and small inputs pay a dispatch cost just to land on a session. We had a working version of this in a development branch and it never quite delivered. The thing that unlocked the design is something most Rust ONNX wrappers get wrong: on Linux and macOS, ORT's C Run API is thread-safe. You can share one Session across many concurrent callers without any locking. The C++ side already serialises what needs serialising; the Rust API just hides it behind borrow-checker rules that do not match what the underlying library actually allows.So we wrap the session in a small platform-aware type: cfg not target os = "windows" struct SessionWrapper { inner: std::cell::UnsafeCell