EdgeSync-LLM – KV cache fragment engine for on-device LLM inference (Go/Android)

EdgeSync-LLM, a new KV cache fragment engine for on-device LLM inference, stores and retrieves transformer KV tensors via HNSW approximate nearest-neighbor search, enabling exact hits at ~8ms TTFT and partial hits at ~280ms TTFT. Designed for ARM64 Android and compatible with llama.cpp, MLC-LLM, and ONNX Runtime, it skips the most expensive prefill step by injecting cached fragments directly into the engine's KV cache.

A engine-agnostic KV cache fragment system for on-device LLM inference. Designed for ARM64 Android Cortex-A55/A78 , portable to any platform running llama.cpp, MLC-LLM, or ONNX Runtime. A reusable KV cache layer that sits between the application and the LLM engine. Instead of re-running the full prefill on every request, it stores slices of the attention KV tensors Keys and Values , retrieves them via approximate nearest-neighbor search HNSW , and injects them directly into the engine's KV cache — skipping the most expensive part of inference. This is not a "semantic cache" that stores response strings. It stores the actual transformer KV tensors , identified by token range and layer range, and reconstructs them at request time. PROMPT │ ▼ Embedding Model MiniLM-L6-v2 384-dim, ~8ms CPU │ ▼ HNSW Index Pure Go, M=16, efSearch=50 │ ┌───────┴───────────────────────────┐ │ │ sim ≥ 0.92 0.75 ≤ sim < 0.92 sim < 0.75 │ │ │ ┌──────▼──────┐ ┌─────────▼──────┐ ┌───────▼────────┐ │ EXACT HIT │ │ PARTIAL HIT │ │ MISS │ │ │ │ │ │ │ │ Inject KV │ │ Inject prefix │ │ Full prefill │ │ fragment │ │ Generate delta │ │ Extract frag. │ │ ~8ms TTFT │ │ ~280ms TTFT │ │ Store in HNSW │ └─────────────┘ └────────────────┘ └────────────────┘ │ │ │ └───────────────────────────────────┴──────────────────────┘ │ KVAdapter Layer │ ┌────────────────────┼─────────────────────┐ ▼ ▼ ▼ llamacpp mlc-llm onnx runtime GGML tensor API TVM paged KV past key values ├── cache/ │ ├── fragment.go ← KVFragment: formal definition of a cache unit │ │ dimensions, TTL, eviction policy, storage key │ ├── differential.go ← DifferentialEngine: EXACT / PARTIAL / MISS router │ └── schema.go ← SQLite WAL schema for fragment metadata │ ├── adapter/ │ ├── interface.go ← KVAdapter: engine-agnostic contract │ │ ExtractFragment / InjectFragment / Generate │ ├── llamacpp.go ← llama.cpp adapter GGML tensor API, CGO │ ├── mlc.go ← MLC-LLM adapter TVM paged KV, mlc4j │ └── onnx.go ← ONNX Runtime adapter past key values │ ├── core/ │ ├── hnsw.go ← Pure Go HNSW index M=16, efSearch=50 │ └── cosine neon.c ← ARM NEON fp16 cosine similarity │ ├── sdk/android/ │ └── EdgeSyncLLM.kt ← Kotlin JNI bridge suspend coroutines │ ├── monitor/ │ └── energy android.go ← Android /sys/class/power supply/ profiler │ ├── prefetch/ │ └── predictor.go ← N-gram prefetch predictor top-3 candidates │ └── benchmark/ └── runner.go ← Falsifiable benchmark: 3 modes × 1000 requests The atomic unit of the cache. Formally defined in cache/fragment.go . | Field | Type | Meaning | |---|---|---| TokenStart / TokenEnd | int | Token range covered start, end | LayerStart / LayerEnd | int | Transformer layers captured | LayerStride | int | Sampling interval 2 = every other layer | Keys / Values | byte | Raw attention tensors engine-serialized | TokenIDs | int32 | Input tokens → used to verify prefix | ContentHash | string | SHA-256 of TokenIDs not tensors | EmbeddingVector | float32 | 384-dim semantic vector for HNSW lookup | ExpiresAt | time.Time | TTL: 30 min session → 7 days promoted | HitCount | int | Auto-promotes at hit ≥ 5 | Engine | string | "llamacpp" / "mlc" / "onnx" | Invariants enforced at construction: TokenSpan ∈ 64, 2048 tokens LayerEnd ≤ model.NumLayers len TokenIDs == TokenSpan len Keys 0 && len Values 0 LayerStride ≥ 1 Defined in adapter/interface.go . Any engine implements 6 methods: ExtractFragment ctx, tokenIDs, layerStart, layerEnd, layerStride, embedding → KVFragment, error InjectFragment ctx, fragment → error Generate ctx, prompt, startTokenPos, maxTokens → text string, tokensGenerated int, error Tokenize ctx, text → int32, error ClearKVCache ctx → error Close → error Cross-engine reuse: engine B can inject a fragment from engine A if and only if B lists A in CompatibleWith . Current compatibility matrix: | Producer → Consumer | llamacpp | mlc | onnx | |---|---|---|---| llamacpp | ✓ | — | — | mlc | — | ✓ | — | onnx | — | — | ✓ | Cross-engine reuse e.g. llamacpp → onnx requires a KV tensor reshape adapter transpose seq, heads, dim → heads, seq, dim . Not implemented yet. The benchmark in benchmark/runner.go compares 3 modes over 1000 requests drawn from 8 semantic prompt clusters 64 unique prompts + 4 variants each . Timing model not ad-hoc random ranges — derived from Cortex-A55 measurements : | Constant | Value | Source | |---|---|---| | Prefill | 6.8 ms/token | llama.cpp bench, Snapdragon 685 | | Generate | 18.4 ms/token | same | | HNSW search | 3.2 ms | N=1000, efSearch=50 | | Fragment inject | 0.029 ms/MB | LPDDR4X bandwidth | | Fragment size | ~6 MB | 128 tokens, 12 layers, Q4 K M | Expected results: | Mode | Avg TTFT | Hit rate | Mem BW | Energy | |---|---|---|---|---| | Baseline no cache | ~1800 ms | 0% | 100% | 253 mAh | | Naive string cache | ~1600 ms | ~12% | ~88% | 222 mAh | Fragment cache | ~350 ms | ~70% | ~35% | 88 mAh | Run: go run ./benchmark/ Verbose per-query output: EDGESYNC VERBOSE=1 go run ./benchmark/ Host build benchmark only, no CGO : go run ./benchmark/ Android ARM64 with llama.cpp CGO : export CGO CFLAGS="-I/path/to/llama.cpp" export CGO LDFLAGS="-L/path/to/llama.cpp/build -lllama -lm" CGO ENABLED=1 CC=aarch64-linux-gnu-gcc GOOS=linux GOARCH=arm64 \ go build -o edgecache ./... NEON cosine module ARM64 only : aarch64-linux-gnu-gcc -O3 -march=armv8.2-a+fp16 \ -c core/cosine neon.c -o core/cosine neon.o - Cross-engine KV tensor reshape adapter/reshape.go — transpose seq,heads,dim ↔ heads,seq,dim between llamacpp and ONNX Runtime; CanInjectWithReshape handles detection and fallback automatically - Fragment compaction cache/compactor.go — deduplication by ContentHash , grouping by layer config, adjacency merge with per-engine tensor concatenation axis 0 for llamacpp, axis 1 per-head for ONNX ; merged embedding is a weighted normalized average - Persistent fragment store cache/store.go — two-tier storage: sync.Map hot cache + SQLite WAL for metadata; tensor blobs written as <id .keys.bin / <id .vals.bin to avoid SQLite page fragmentation; QueryByTokenRange for prefix-range lookups - Real embedding model embedding/minilm.go — ORTEncoder runs all-MiniLM-L6-v2 22 MB, 384-dim, ~8ms on Cortex-A55 via ONNX Runtime; FallbackEncoder FNV-1a hash activates automatically if the .ort model file is not found - Android JNI bridge sdk/android/EdgeSyncLLM.kt + sdk/android/jni bridge.go — full rewrite exposing the adapter/ package API: nativeInitialize , nativeEmbed , nativeLookup , nativeInjectFragment , nativeGenerateFromPos , nativeExtractAndStore , nativeCompact , nativeReshapeFragment This project is licensed under the Business Source License 1.1 BUSL-1.1 — see the LICENSE /bossandboss/EdgeSync-LLM/blob/main/LICENSE file for details. | Parameter | Value | |---|---| | Licensor | bossandboss Wajdi Kechaou | | Licensed Work | EdgeSync-LLM source, submodules, adapters, tensor engines, documentation | | Additional Use Grant | Non-commercial use, research, evaluation, development, internal testing | | Change Date | July 1, 2029 | | Change License | GNU Affero General Public License v3.0 AGPL-3.0 | What this means in practice: - ✅ Free for research, evaluation, development, and internal testing - ✅ Source code is readable and forkable - ❌ Production use in commercial apps, SaaS platforms, or mobile apps deployed to end-users requires a commercial license - 🔄 On July 1, 2029, the project automatically becomes AGPL-3.0 Commercial licensing: open an issue at github.com/bossandboss/EdgeSync-LLM https://github.com/bossandboss/EdgeSync-LLM/issues with the label commercial-license .