Native Inference Engine for macOS 14 or newer

wpnews.pro

Latest release: v0.1.1

Grab the signed .dmg

from the latest release, drag Embershard into Applications, and right-click → Open the first time (you'll need to approve it's opening by going to System Settings->Privacy and Security->Scroll down until "Security" section and click "Open Anyway"). No clone, no toolchain. Apple Silicon, macOS 14 or newer.

Embershard is a macOS chat app with its own LLM inference engine underneath. The interesting part is what isn't there: at inference time the chat path never calls into llama.cpp. Embershard opens the GGUF on its own, pushes the weights to Metal, assembles its transformer compute graph directly on ggml

, keeps the KV cache resident across turns, and runs its own byte-level BPE / SentencePiece tokenizer. ggml

is used purely as a bag of tensor kernels.

That independence is deliberate, and so is the narrow scope: Embershard runs the llama

and qwen2

families (Llama 3.x, Mistral, Qwen 2.5, and anything that reports those architectures in its GGUF) and nothing else. It is a focused engine checked for numerical parity against the reference — not a drop-in GGUF runner.

Embershard grew out of

[ds4], which set the template for a small, honest, self-contained native project. ds4 was the inspiration; the engine, app, and writing here are their own thing.

Orchestrated by me, written with ClaudeCode. App icon by DinosoftLabs.

Wrapping libllama is the easy path, and it is the one most apps take. Embershard takes the harder one so the whole hot loop — graph construction, the KV-cache layout, the sampler, tokenization — lives in code we own and can reason about. llama.cpp

and ggml

still made it possible: their kernels, the GGUF format and its tooling, the quant formats, and a great deal of hard-won engineering were the map we followed while building everything above the tensor ops. We link ggml

for those ops and the Metal backend, and keep llama.cpp around only for the experimental multi-agent orchestrator. Thanks to Georgi Gerganov and the contributors.

Beta, but the core is measured rather than asserted:

Logit parity. Thellama

andqwen2

forward pass matches llama.cpp to a cosine of 0.999999; greedy continuations come out token-for-token identical.Resident KV cache in F16 / Q8_0 / Q4_0, incremental O(n) decode, reused across turns. When the context fills, a sliding window evicts the oldest tokens while keeping absolute RoPE positions intact (no re-roping), so long conversations keep going.Throughput at parity with llama.cpp on the same model — both are memory-bandwidth bound on identical ggml kernels, so there is nothing to win or lose here.One engine for everything. Plain chat and the planner → executor agent pipeline both run ones_gx

; llama.cpp is not in the inference path.Tokenizer parity. Token IDs match llama.cpp across the test corpus, with two backends: byte-level BPE (gpt2

:llama-bpe

,qwen2

) and SentencePiece (llama

/SPM: Llama 2, Mistral v0.1/v0.2, TinyLlama).Sharded GGUFs(-00001-of-N

) load by following the split metadata.

Architectures stop at llama

/qwen2

. Gemma, Phi, and MoE models (gpt-oss, Mixtral, …) are unsupported and filtered out of the browser. - A model that exceeds the GPU working set is not streamed from SSD — it fails cleanly, and the browser filters by available RAM up front. SSD streaming is future work.

No bespoke Metal kernels yet: prefill uses ggml_flash_attn_ext

, decode a manualggml

path. - Tokenizers past BPE/SPM (tiktoken-style for gpt-oss, etc.) aren't written, and won't matter until the matching architectures land.

es_gx.c             GGUF load (single or sharded) -> Metal weights, forward
                    graph, resident KV cache (F16/Q8_0/Q4_0) + sliding window,
                    host-side sampling
es_tok.c            tokenizer: byte-level BPE (gpt2) and SentencePiece (llama)
NativeEngine.swift  multi-turn chat + planner→executor pipeline on es_gx

(llama.cpp / es_engine.c / es_orchestrator.c stay for the CLI and tests.)

Per layer the forward pass is the familiar stack: RMSNorm, Q/K/V projection (with bias on qwen2), RoPE (NORMAL for llama, NEOX for qwen2), attention over the resident K/V cache (flash for prefill, manual mul_mat/softmax for decode), output projection and residual, a SwiGLU-gated FFN and residual. A final norm and output projection give the logits for the last position; temperature, top-k, top-p, min-p, repeat penalty and seed are applied on the host.

You need macOS 14+ on Apple Silicon, CMake 3.20+, the Xcode command-line tools, and a llama.cpp

checkout under vendor/

(for ggml

and the Metal backend).

git clone --depth 1 https://github.com/ggerganov/llama.cpp.git vendor/llama.cpp

cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)

Targets:

embershard   llama.cpp CLI (single-shot + REPL, agent mode)
test_gx      native-engine parity gate vs llama.cpp (logits + greedy)
tok_test     tokenizer parity gate vs llama.cpp
gen_gx       standalone multi-turn generation, links ggml only (no libllama)
test_engine  agent/orchestrator integration test

The engine is gated against llama.cpp on the same GGUF — this is how the numbers above are produced, not a claim taken on faith:

./build/test_gx /path/to/model.gguf "The capital of France is" 32

./build/tok_test /path/to/model.gguf

Pass criteria: argmax agrees, cosine > 0.999, greedy sequences identical, tokenizer 12/12 cases byte-for-byte.

Generation with no libllama in the link at all:

./build/gen_gx /path/to/model.gguf "My name is Alice." "What is my name?"

A SwiftUI front end that links the engine as a static library and bundles the ggml

dylibs; every token is produced by es_gx

.

cd app
swift build -c release      # development build
./make_dmg.sh               # signed .app + drag-to-Applications .dmg

Inside: tabbed chats and projects, per-chat skills, three chat modes when you start a new one — Standard, Agentic (planner → executor, for multi-step tasks), and Arena (ask up to four models at once and watch them answer side by side, concurrently) — a HuggingFace browser restricted to engine-compatible official GGUFs with the publisher shown, and an inference panel for context size, max tokens and the full sampler.

First launch on another Mac (ad-hoc signed): right-click → Open → Open, or xattr -dr com.apple.quarantine /path/to/Embershard.app

.

The browser lists official, engine-compatible GGUFs filtered by your machine's RAM (the Qwen 2.5 family; SmolLM2 from HuggingFaceTB). Search also surfaces community repacks, flagged as such. Import a local .gguf

and, if it isn't a llama

/qwen2

model, it's marked unsupported and kept out of the chat picker.

source & further reading

github.com — original article

Native Inference Engine for macOS 14 or newer

Run your AI side-project on zahid.host