{"slug": "native-inference-engine-for-macos-14-or-newer", "title": "Native Inference Engine for macOS 14 or newer", "summary": "Embershard, a macOS chat app with its own LLM inference engine, has been released in beta v0.1.1 for Apple Silicon devices running macOS 14 or newer. The app bypasses llama.cpp for inference, instead directly loading GGUF weights to Metal and assembling transformer compute graphs on ggml, achieving logit parity and throughput comparable to llama.cpp. It supports Llama and Qwen2 model families with a resident KV cache and sliding window context management.", "body_md": "**Latest release:** v0.1.1\n\nGrab the signed `.dmg`\n\nfrom the latest release, drag Embershard into\nApplications, and right-click → Open the first time (you'll need to approve it's opening by going to System Settings->Privacy and Security->Scroll down until \"Security\" section and click \"Open Anyway\"). No clone, no toolchain. Apple Silicon, macOS 14 or newer.\n\nEmbershard is a macOS chat app with its own LLM inference engine underneath. The\ninteresting part is what *isn't* there: at inference time the chat path never\ncalls into llama.cpp. Embershard opens the GGUF on its own, pushes the weights to\nMetal, assembles its transformer compute graph directly on `ggml`\n\n, keeps the KV\ncache resident across turns, and runs its own byte-level BPE / SentencePiece\ntokenizer. `ggml`\n\nis used purely as a bag of tensor kernels.\n\nThat independence is deliberate, and so is the narrow scope: Embershard runs the\n`llama`\n\nand `qwen2`\n\nfamilies (Llama 3.x, Mistral, Qwen 2.5, and anything that\nreports those architectures in its GGUF) and nothing else. It is a focused engine\nchecked for numerical parity against the reference — not a drop-in GGUF runner.\n\nEmbershard grew out of\n\n[ds4], which set the template for a small, honest, self-contained native project. ds4 was the inspiration; the engine, app, and writing here are their own thing.\n\nOrchestrated by me, written with ClaudeCode. App icon by DinosoftLabs.\n\nWrapping libllama is the easy path, and it is the one most apps take. Embershard\ntakes the harder one so the whole hot loop — graph construction, the KV-cache\nlayout, the sampler, tokenization — lives in code we own and can reason about.\n`llama.cpp`\n\nand `ggml`\n\nstill made it possible: their kernels, the GGUF format and\nits tooling, the quant formats, and a great deal of hard-won engineering were the\nmap we followed while building everything above the tensor ops. We link `ggml`\n\nfor those ops and the Metal backend, and keep llama.cpp around only for the\nexperimental multi-agent orchestrator. Thanks to Georgi Gerganov and the\ncontributors.\n\nBeta, but the core is measured rather than asserted:\n\n**Logit parity.** The`llama`\n\nand`qwen2`\n\nforward pass matches llama.cpp to a cosine of 0.999999; greedy continuations come out token-for-token identical.**Resident KV cache** in F16 / Q8_0 / Q4_0, incremental O(n) decode, reused across turns. When the context fills, a sliding window evicts the oldest tokens while keeping absolute RoPE positions intact (no re-roping), so long conversations keep going.**Throughput at parity** with llama.cpp on the same model — both are memory-bandwidth bound on identical ggml kernels, so there is nothing to win or lose here.**One engine for everything.** Plain chat and the planner → executor agent pipeline both run on`es_gx`\n\n; llama.cpp is not in the inference path.**Tokenizer parity.** Token IDs match llama.cpp across the test corpus, with two backends: byte-level BPE (`gpt2`\n\n:`llama-bpe`\n\n,`qwen2`\n\n) and SentencePiece (`llama`\n\n/SPM: Llama 2, Mistral v0.1/v0.2, TinyLlama).**Sharded GGUFs**(`-00001-of-N`\n\n) load by following the split metadata.\n\n- Architectures stop at\n`llama`\n\n/`qwen2`\n\n. Gemma, Phi, and MoE models (gpt-oss, Mixtral, …) are unsupported and filtered out of the browser. - A model that exceeds the GPU working set is not streamed from SSD — loading it fails cleanly, and the browser filters by available RAM up front. SSD streaming is future work.\n- No bespoke Metal kernels yet: prefill uses\n`ggml_flash_attn_ext`\n\n, decode a manual`ggml`\n\npath. - Tokenizers past BPE/SPM (tiktoken-style for gpt-oss, etc.) aren't written, and won't matter until the matching architectures land.\n\n``` php\nes_gx.c             GGUF load (single or sharded) -> Metal weights, forward\n                    graph, resident KV cache (F16/Q8_0/Q4_0) + sliding window,\n                    host-side sampling\nes_tok.c            tokenizer: byte-level BPE (gpt2) and SentencePiece (llama)\nNativeEngine.swift  multi-turn chat + planner→executor pipeline on es_gx\n\n(llama.cpp / es_engine.c / es_orchestrator.c stay for the CLI and tests.)\n```\n\nPer layer the forward pass is the familiar stack: RMSNorm, Q/K/V projection (with bias on qwen2), RoPE (NORMAL for llama, NEOX for qwen2), attention over the resident K/V cache (flash for prefill, manual mul_mat/softmax for decode), output projection and residual, a SwiGLU-gated FFN and residual. A final norm and output projection give the logits for the last position; temperature, top-k, top-p, min-p, repeat penalty and seed are applied on the host.\n\nYou need macOS 14+ on Apple Silicon, CMake 3.20+, the Xcode command-line tools,\nand a `llama.cpp`\n\ncheckout under `vendor/`\n\n(for `ggml`\n\nand the Metal backend).\n\n```\ngit clone --depth 1 https://github.com/ggerganov/llama.cpp.git vendor/llama.cpp\n\ncmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release\ncmake --build build -j$(sysctl -n hw.ncpu)\n```\n\nTargets:\n\n```\nembershard   llama.cpp CLI (single-shot + REPL, agent mode)\ntest_gx      native-engine parity gate vs llama.cpp (logits + greedy)\ntok_test     tokenizer parity gate vs llama.cpp\ngen_gx       standalone multi-turn generation, links ggml only (no libllama)\ntest_engine  agent/orchestrator integration test\n```\n\nThe engine is gated against llama.cpp on the same GGUF — this is how the numbers above are produced, not a claim taken on faith:\n\n```\n# Load once with llama.cpp (reference) and once with es_gx, feed identical token\n# IDs, compare last-token logits, then greedy-generate with both and diff the\n# token sequences.\n./build/test_gx /path/to/model.gguf \"The capital of France is\" 32\n\n# Tokenizer: es_tok vs llama_tokenize over a corpus.\n./build/tok_test /path/to/model.gguf\n```\n\nPass criteria: argmax agrees, cosine > 0.999, greedy sequences identical, tokenizer 12/12 cases byte-for-byte.\n\nGeneration with no libllama in the link at all:\n\n```\n./build/gen_gx /path/to/model.gguf \"My name is Alice.\" \"What is my name?\"\n```\n\nA SwiftUI front end that links the engine as a static library and bundles the\n`ggml`\n\ndylibs; every token is produced by `es_gx`\n\n.\n\n```\ncd app\nswift build -c release      # development build\n./make_dmg.sh               # signed .app + drag-to-Applications .dmg\n```\n\nInside: tabbed chats and projects, per-chat skills, three chat modes when you\nstart a new one — **Standard**, **Agentic** (planner → executor, for multi-step\ntasks), and **Arena** (ask up to four models at once and watch them answer side\nby side, concurrently) — a HuggingFace browser restricted to engine-compatible\nofficial GGUFs with the publisher shown, and an inference panel for context size,\nmax tokens and the full sampler.\n\nFirst launch on another Mac (ad-hoc signed): right-click → Open → Open, or\n`xattr -dr com.apple.quarantine /path/to/Embershard.app`\n\n.\n\nThe browser lists official, engine-compatible GGUFs filtered by your machine's\nRAM (the Qwen 2.5 family; SmolLM2 from HuggingFaceTB). Search also surfaces\ncommunity repacks, flagged as such. Import a local `.gguf`\n\nand, if it isn't a\n`llama`\n\n/`qwen2`\n\nmodel, it's marked unsupported and kept out of the chat picker.", "url": "https://wpnews.pro/news/native-inference-engine-for-macos-14-or-newer", "canonical_source": "https://github.com/tictacguy/embershard", "published_at": "2026-06-17 06:55:49+00:00", "updated_at": "2026-06-17 07:22:54.156061+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "ai-infrastructure", "ai-tools", "developer-tools"], "entities": ["Embershard", "llama.cpp", "ggml", "Metal", "GGUF", "Llama", "Qwen2", "ClaudeCode"], "alternates": {"html": "https://wpnews.pro/news/native-inference-engine-for-macos-14-or-newer", "markdown": "https://wpnews.pro/news/native-inference-engine-for-macos-14-or-newer.md", "text": "https://wpnews.pro/news/native-inference-engine-for-macos-14-or-newer.txt", "jsonld": "https://wpnews.pro/news/native-inference-engine-for-macos-14-or-newer.jsonld"}}