{"slug": "edgesync-llm-kv-cache-fragment-engine-for-on-device-llm-inference-go-android", "title": "EdgeSync-LLM – KV cache fragment engine for on-device LLM inference (Go/Android)", "summary": "EdgeSync-LLM, a new KV cache fragment engine for on-device LLM inference, stores and retrieves transformer KV tensors via HNSW approximate nearest-neighbor search, enabling exact hits at ~8ms TTFT and partial hits at ~280ms TTFT. Designed for ARM64 Android and compatible with llama.cpp, MLC-LLM, and ONNX Runtime, it skips the most expensive prefill step by injecting cached fragments directly into the engine's KV cache.", "body_md": "A **engine-agnostic KV cache fragment system** for on-device LLM inference.\nDesigned for ARM64 Android (Cortex-A55/A78), portable to any platform running\nllama.cpp, MLC-LLM, or ONNX Runtime.\n\nA **reusable KV cache layer** that sits between the application and the LLM\nengine. Instead of re-running the full prefill on every request, it stores\nslices of the attention KV tensors (Keys and Values), retrieves them via\napproximate nearest-neighbor search (HNSW), and injects them directly into\nthe engine's KV cache — skipping the most expensive part of inference.\n\nThis is not a \"semantic cache\" that stores response strings. It stores the\n**actual transformer KV tensors**, identified by token range and layer range,\nand reconstructs them at request time.\n\n```\n              [ PROMPT ]\n                  │\n                  ▼\n         [ Embedding Model ]        MiniLM-L6-v2 (384-dim, ~8ms CPU)\n                  │\n                  ▼\n           [ HNSW Index ]           Pure Go, M=16, efSearch=50\n                  │\n          ┌───────┴───────────────────────────┐\n          │                                   │\n    sim ≥ 0.92                          0.75 ≤ sim < 0.92       sim < 0.75\n          │                                   │                      │\n   ┌──────▼──────┐                  ┌─────────▼──────┐      ┌───────▼────────┐\n   │ EXACT HIT   │                  │  PARTIAL HIT   │      │     MISS       │\n   │             │                  │                │      │                │\n   │ Inject KV   │                  │ Inject prefix  │      │ Full prefill   │\n   │ fragment    │                  │ Generate delta │      │ Extract frag.  │\n   │ ~8ms TTFT   │                  │ ~280ms TTFT    │      │ Store in HNSW  │\n   └─────────────┘                  └────────────────┘      └────────────────┘\n          │                                   │                      │\n          └───────────────────────────────────┴──────────────────────┘\n                                              │\n                                     [ KVAdapter Layer ]\n                                              │\n                         ┌────────────────────┼─────────────────────┐\n                         ▼                    ▼                     ▼\n                  [ llamacpp ]           [ mlc-llm ]         [ onnx runtime ]\n                 (GGML tensor API)    (TVM paged KV)      (past_key_values)\n├── cache/\n│   ├── fragment.go          ← KVFragment: formal definition of a cache unit\n│   │                          (dimensions, TTL, eviction policy, storage key)\n│   ├── differential.go      ← DifferentialEngine: EXACT / PARTIAL / MISS router\n│   └── schema.go            ← SQLite WAL schema for fragment metadata\n│\n├── adapter/\n│   ├── interface.go         ← KVAdapter: engine-agnostic contract\n│   │                          (ExtractFragment / InjectFragment / Generate)\n│   ├── llamacpp.go          ← llama.cpp adapter (GGML tensor API, CGO)\n│   ├── mlc.go               ← MLC-LLM adapter (TVM paged KV, mlc4j)\n│   └── onnx.go              ← ONNX Runtime adapter (past_key_values)\n│\n├── core/\n│   ├── hnsw.go              ← Pure Go HNSW index (M=16, efSearch=50)\n│   └── cosine_neon.c        ← ARM NEON fp16 cosine similarity\n│\n├── sdk/android/\n│   └── EdgeSyncLLM.kt       ← Kotlin JNI bridge (suspend coroutines)\n│\n├── monitor/\n│   └── energy_android.go    ← Android /sys/class/power_supply/ profiler\n│\n├── prefetch/\n│   └── predictor.go         ← N-gram prefetch predictor (top-3 candidates)\n│\n└── benchmark/\n    └── runner.go            ← Falsifiable benchmark: 3 modes × 1000 requests\n```\n\nThe **atomic unit** of the cache. Formally defined in `cache/fragment.go`\n\n.\n\n| Field | Type | Meaning |\n|---|---|---|\n`TokenStart / TokenEnd` |\nint | Token range covered `[start, end)` |\n`LayerStart / LayerEnd` |\nint | Transformer layers captured |\n`LayerStride` |\nint | Sampling interval (2 = every other layer) |\n`Keys / Values` |\n`[]byte` |\nRaw attention tensors (engine-serialized) |\n`TokenIDs` |\n`[]int32` |\nInput tokens → used to verify prefix |\n`ContentHash` |\nstring | SHA-256 of TokenIDs (not tensors) |\n`EmbeddingVector` |\n`[]float32` |\n384-dim semantic vector for HNSW lookup |\n`ExpiresAt` |\ntime.Time | TTL: 30 min (session) → 7 days (promoted) |\n`HitCount` |\nint | Auto-promotes at hit ≥ 5 |\n`Engine` |\nstring | \"llamacpp\" / \"mlc\" / \"onnx\" |\n\n**Invariants enforced at construction:**\n\n`TokenSpan ∈ [64, 2048]`\n\ntokens`LayerEnd ≤ model.NumLayers`\n\n`len(TokenIDs) == TokenSpan`\n\n`len(Keys) > 0 && len(Values) > 0`\n\n`LayerStride ≥ 1`\n\nDefined in `adapter/interface.go`\n\n. Any engine implements 6 methods:\n\n```\nExtractFragment(ctx, tokenIDs, layerStart, layerEnd, layerStride, embedding)\n    → *KVFragment, error\n\nInjectFragment(ctx, fragment)\n    → error\n\nGenerate(ctx, prompt, startTokenPos, maxTokens)\n    → text string, tokensGenerated int, error\n\nTokenize(ctx, text)\n    → []int32, error\n\nClearKVCache(ctx)\n    → error\n\nClose()\n    → error\n```\n\nCross-engine reuse: engine B can inject a fragment from engine A if and only if\nB lists A in `CompatibleWith()`\n\n. Current compatibility matrix:\n\n| Producer → Consumer | llamacpp | mlc | onnx |\n|---|---|---|---|\nllamacpp |\n✓ | — | — |\nmlc |\n— | ✓ | — |\nonnx |\n— | — | ✓ |\n\nCross-engine reuse (e.g. llamacpp → onnx) requires a KV tensor reshape adapter\n(transpose `[seq, heads, dim]`\n\n→ `[heads, seq, dim]`\n\n). Not implemented yet.\n\nThe benchmark in `benchmark/runner.go`\n\ncompares 3 modes over 1000 requests\ndrawn from 8 semantic prompt clusters (64 unique prompts + 4 variants each).\n\n**Timing model** (not ad-hoc random ranges — derived from Cortex-A55 measurements):\n\n| Constant | Value | Source |\n|---|---|---|\n| Prefill | 6.8 ms/token | llama.cpp bench, Snapdragon 685 |\n| Generate | 18.4 ms/token | same |\n| HNSW search | 3.2 ms | N=1000, efSearch=50 |\n| Fragment inject | 0.029 ms/MB | LPDDR4X bandwidth |\n| Fragment size | ~6 MB | 128 tokens, 12 layers, Q4_K_M |\n\n**Expected results:**\n\n| Mode | Avg TTFT | Hit rate | Mem BW | Energy |\n|---|---|---|---|---|\n| Baseline (no cache) | ~1800 ms | 0% | 100% | 253 mAh |\n| Naive string cache | ~1600 ms | ~12% | ~88% | 222 mAh |\nFragment cache |\n~350 ms |\n~70% |\n~35% |\n88 mAh |\n\nRun:\n\n```\ngo run ./benchmark/\n\n# Verbose per-query output:\nEDGESYNC_VERBOSE=1 go run ./benchmark/\n# Host build (benchmark only, no CGO):\ngo run ./benchmark/\n\n# Android ARM64 (with llama.cpp CGO):\nexport CGO_CFLAGS=\"-I/path/to/llama.cpp\"\nexport CGO_LDFLAGS=\"-L/path/to/llama.cpp/build -lllama -lm\"\nCGO_ENABLED=1 CC=aarch64-linux-gnu-gcc GOOS=linux GOARCH=arm64 \\\n    go build -o edgecache ./...\n\n# NEON cosine module (ARM64 only):\naarch64-linux-gnu-gcc -O3 -march=armv8.2-a+fp16 \\\n    -c core/cosine_neon.c -o core/cosine_neon.o\n```\n\n-\n**Cross-engine KV tensor reshape**(`adapter/reshape.go`\n\n) — transpose`[seq,heads,dim] ↔ [heads,seq,dim]`\n\nbetween llamacpp and ONNX Runtime;`CanInjectWithReshape()`\n\nhandles detection and fallback automatically -\n**Fragment compaction**(`cache/compactor.go`\n\n) — deduplication by`ContentHash`\n\n, grouping by layer config, adjacency merge with per-engine tensor concatenation (axis 0 for llamacpp, axis 1 per-head for ONNX); merged embedding is a weighted normalized average -\n**Persistent fragment store**(`cache/store.go`\n\n) — two-tier storage:`sync.Map`\n\nhot cache + SQLite WAL for metadata; tensor blobs written as`<id>.keys.bin`\n\n/`<id>.vals.bin`\n\nto avoid SQLite page fragmentation;`QueryByTokenRange()`\n\nfor prefix-range lookups -\n**Real embedding model**(`embedding/minilm.go`\n\n) —`ORTEncoder`\n\nruns all-MiniLM-L6-v2 (22 MB, 384-dim, ~8ms on Cortex-A55) via ONNX Runtime;`FallbackEncoder`\n\n(FNV-1a hash) activates automatically if the`.ort`\n\nmodel file is not found -\n**Android JNI bridge**(`sdk/android/EdgeSyncLLM.kt`\n\n+`sdk/android/jni_bridge.go`\n\n) — full rewrite exposing the`adapter/`\n\npackage API:`nativeInitialize`\n\n,`nativeEmbed`\n\n,`nativeLookup`\n\n,`nativeInjectFragment`\n\n,`nativeGenerateFromPos`\n\n,`nativeExtractAndStore`\n\n,`nativeCompact`\n\n,`nativeReshapeFragment`\n\nThis project is licensed under the **Business Source License 1.1 (BUSL-1.1)** — see the [LICENSE](/bossandboss/EdgeSync-LLM/blob/main/LICENSE) file for details.\n\n| Parameter | Value |\n|---|---|\n| Licensor | bossandboss (Wajdi Kechaou) |\n| Licensed Work | EdgeSync-LLM (source, submodules, adapters, tensor engines, documentation) |\n| Additional Use Grant | Non-commercial use, research, evaluation, development, internal testing |\n| Change Date | July 1, 2029 |\n| Change License | GNU Affero General Public License v3.0 (AGPL-3.0) |\n\n**What this means in practice:**\n\n- ✅ Free for research, evaluation, development, and internal testing\n- ✅ Source code is readable and forkable\n- ❌ Production use in commercial apps, SaaS platforms, or mobile apps deployed to end-users requires a commercial license\n- 🔄 On July 1, 2029, the project automatically becomes AGPL-3.0\n\n**Commercial licensing:** open an issue at [github.com/bossandboss/EdgeSync-LLM](https://github.com/bossandboss/EdgeSync-LLM/issues) with the label `commercial-license`\n\n.", "url": "https://wpnews.pro/news/edgesync-llm-kv-cache-fragment-engine-for-on-device-llm-inference-go-android", "canonical_source": "https://github.com/bossandboss/EdgeSync-LLM", "published_at": "2026-06-30 14:10:22+00:00", "updated_at": "2026-06-30 14:20:25.057797+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-infrastructure", "ai-tools"], "entities": ["EdgeSync-LLM", "llama.cpp", "MLC-LLM", "ONNX Runtime", "ARM", "Android", "HNSW", "MiniLM-L6-v2"], "alternates": {"html": "https://wpnews.pro/news/edgesync-llm-kv-cache-fragment-engine-for-on-device-llm-inference-go-android", "markdown": "https://wpnews.pro/news/edgesync-llm-kv-cache-fragment-engine-for-on-device-llm-inference-go-android.md", "text": "https://wpnews.pro/news/edgesync-llm-kv-cache-fragment-engine-for-on-device-llm-inference-go-android.txt", "jsonld": "https://wpnews.pro/news/edgesync-llm-kv-cache-fragment-engine-for-on-device-llm-inference-go-android.jsonld"}}