Native Inference Engine for macOS 14 or newer

Embershard, a macOS chat app with its own LLM inference engine, has been released in beta v0.1.1 for Apple Silicon devices running macOS 14 or newer. The app bypasses llama.cpp for inference, instead directly loading GGUF weights to Metal and assembling transformer compute graphs on ggml, achieving logit parity and throughput comparable to llama.cpp. It supports Llama and Qwen2 model families with a resident KV cache and sliding window context management.

Latest release: v0.1.1 Grab the signed .dmg from the latest release, drag Embershard into Applications, and right-click → Open the first time you'll need to approve it's opening by going to System Settings- Privacy and Security- Scroll down until "Security" section and click "Open Anyway" . No clone, no toolchain. Apple Silicon, macOS 14 or newer. Embershard is a macOS chat app with its own LLM inference engine underneath. The interesting part is what isn't there: at inference time the chat path never calls into llama.cpp. Embershard opens the GGUF on its own, pushes the weights to Metal, assembles its transformer compute graph directly on ggml , keeps the KV cache resident across turns, and runs its own byte-level BPE / SentencePiece tokenizer. ggml is used purely as a bag of tensor kernels. That independence is deliberate, and so is the narrow scope: Embershard runs the llama and qwen2 families Llama 3.x, Mistral, Qwen 2.5, and anything that reports those architectures in its GGUF and nothing else. It is a focused engine checked for numerical parity against the reference — not a drop-in GGUF runner. Embershard grew out of ds4 , which set the template for a small, honest, self-contained native project. ds4 was the inspiration; the engine, app, and writing here are their own thing. Orchestrated by me, written with ClaudeCode. App icon by DinosoftLabs. Wrapping libllama is the easy path, and it is the one most apps take. Embershard takes the harder one so the whole hot loop — graph construction, the KV-cache layout, the sampler, tokenization — lives in code we own and can reason about. llama.cpp and ggml still made it possible: their kernels, the GGUF format and its tooling, the quant formats, and a great deal of hard-won engineering were the map we followed while building everything above the tensor ops. We link ggml for those ops and the Metal backend, and keep llama.cpp around only for the experimental multi-agent orchestrator. Thanks to Georgi Gerganov and the contributors. Beta, but the core is measured rather than asserted: Logit parity. The llama and qwen2 forward pass matches llama.cpp to a cosine of 0.999999; greedy continuations come out token-for-token identical. Resident KV cache in F16 / Q8 0 / Q4 0, incremental O n decode, reused across turns. When the context fills, a sliding window evicts the oldest tokens while keeping absolute RoPE positions intact no re-roping , so long conversations keep going. Throughput at parity with llama.cpp on the same model — both are memory-bandwidth bound on identical ggml kernels, so there is nothing to win or lose here. One engine for everything. Plain chat and the planner → executor agent pipeline both run on es gx ; llama.cpp is not in the inference path. Tokenizer parity. Token IDs match llama.cpp across the test corpus, with two backends: byte-level BPE gpt2 : llama-bpe , qwen2 and SentencePiece llama /SPM: Llama 2, Mistral v0.1/v0.2, TinyLlama . Sharded GGUFs -00001-of-N load by following the split metadata. - Architectures stop at llama / qwen2 . Gemma, Phi, and MoE models gpt-oss, Mixtral, … are unsupported and filtered out of the browser. - A model that exceeds the GPU working set is not streamed from SSD — loading it fails cleanly, and the browser filters by available RAM up front. SSD streaming is future work. - No bespoke Metal kernels yet: prefill uses ggml flash attn ext , decode a manual ggml path. - Tokenizers past BPE/SPM tiktoken-style for gpt-oss, etc. aren't written, and won't matter until the matching architectures land. php es gx.c GGUF load single or sharded - Metal weights, forward graph, resident KV cache F16/Q8 0/Q4 0 + sliding window, host-side sampling es tok.c tokenizer: byte-level BPE gpt2 and SentencePiece llama NativeEngine.swift multi-turn chat + planner→executor pipeline on es gx llama.cpp / es engine.c / es orchestrator.c stay for the CLI and tests. Per layer the forward pass is the familiar stack: RMSNorm, Q/K/V projection with bias on qwen2 , RoPE NORMAL for llama, NEOX for qwen2 , attention over the resident K/V cache flash for prefill, manual mul mat/softmax for decode , output projection and residual, a SwiGLU-gated FFN and residual. A final norm and output projection give the logits for the last position; temperature, top-k, top-p, min-p, repeat penalty and seed are applied on the host. You need macOS 14+ on Apple Silicon, CMake 3.20+, the Xcode command-line tools, and a llama.cpp checkout under vendor/ for ggml and the Metal backend . git clone --depth 1 https://github.com/ggerganov/llama.cpp.git vendor/llama.cpp cmake -B build -DGGML METAL=ON -DGGML METAL EMBED LIBRARY=ON -DCMAKE BUILD TYPE=Release cmake --build build -j$ sysctl -n hw.ncpu Targets: embershard llama.cpp CLI single-shot + REPL, agent mode test gx native-engine parity gate vs llama.cpp logits + greedy tok test tokenizer parity gate vs llama.cpp gen gx standalone multi-turn generation, links ggml only no libllama test engine agent/orchestrator integration test The engine is gated against llama.cpp on the same GGUF — this is how the numbers above are produced, not a claim taken on faith: Load once with llama.cpp reference and once with es gx, feed identical token IDs, compare last-token logits, then greedy-generate with both and diff the token sequences. ./build/test gx /path/to/model.gguf "The capital of France is" 32 Tokenizer: es tok vs llama tokenize over a corpus. ./build/tok test /path/to/model.gguf Pass criteria: argmax agrees, cosine 0.999, greedy sequences identical, tokenizer 12/12 cases byte-for-byte. Generation with no libllama in the link at all: ./build/gen gx /path/to/model.gguf "My name is Alice." "What is my name?" A SwiftUI front end that links the engine as a static library and bundles the ggml dylibs; every token is produced by es gx . cd app swift build -c release development build ./make dmg.sh signed .app + drag-to-Applications .dmg Inside: tabbed chats and projects, per-chat skills, three chat modes when you start a new one — Standard , Agentic planner → executor, for multi-step tasks , and Arena ask up to four models at once and watch them answer side by side, concurrently — a HuggingFace browser restricted to engine-compatible official GGUFs with the publisher shown, and an inference panel for context size, max tokens and the full sampler. First launch on another Mac ad-hoc signed : right-click → Open → Open, or xattr -dr com.apple.quarantine /path/to/Embershard.app . The browser lists official, engine-compatible GGUFs filtered by your machine's RAM the Qwen 2.5 family; SmolLM2 from HuggingFaceTB . Search also surfaces community repacks, flagged as such. Import a local .gguf and, if it isn't a llama / qwen2 model, it's marked unsupported and kept out of the chat picker.