cd /news/large-language-models/edgesync-llm-kv-cache-fragment-engin… · home topics large-language-models article
[ARTICLE · art-44944] src=github.com ↗ pub= topic=large-language-models verified=true sentiment=· neutral

EdgeSync-LLM – KV cache fragment engine for on-device LLM inference (Go/Android)

EdgeSync-LLM, a new KV cache fragment engine for on-device LLM inference, stores and retrieves transformer KV tensors via HNSW approximate nearest-neighbor search, enabling exact hits at ~8ms TTFT and partial hits at ~280ms TTFT. Designed for ARM64 Android and compatible with llama.cpp, MLC-LLM, and ONNX Runtime, it skips the most expensive prefill step by injecting cached fragments directly into the engine's KV cache.

read6 min views1 publishedJun 30, 2026
EdgeSync-LLM – KV cache fragment engine for on-device LLM inference (Go/Android)
Image: source

A engine-agnostic KV cache fragment system for on-device LLM inference. Designed for ARM64 Android (Cortex-A55/A78), portable to any platform running llama.cpp, MLC-LLM, or ONNX Runtime.

A reusable KV cache layer that sits between the application and the LLM engine. Instead of re-running the full prefill on every request, it stores slices of the attention KV tensors (Keys and Values), retrieves them via approximate nearest-neighbor search (HNSW), and injects them directly into the engine's KV cache — skipping the most expensive part of inference.

This is not a "semantic cache" that stores response strings. It stores the actual transformer KV tensors, identified by token range and layer range, and reconstructs them at request time.

              [ PROMPT ]
                  │
                  ▼
         [ Embedding Model ]        MiniLM-L6-v2 (384-dim, ~8ms CPU)
                  │
                  ▼
           [ HNSW Index ]           Pure Go, M=16, efSearch=50
                  │
          ┌───────┴───────────────────────────┐
          │                                   │
    sim ≥ 0.92                          0.75 ≤ sim < 0.92       sim < 0.75
          │                                   │                      │
   ┌──────▼──────┐                  ┌─────────▼──────┐      ┌───────▼────────┐
   │ EXACT HIT   │                  │  PARTIAL HIT   │      │     MISS       │
   │             │                  │                │      │                │
   │ Inject KV   │                  │ Inject prefix  │      │ Full prefill   │
   │ fragment    │                  │ Generate delta │      │ Extract frag.  │
   │ ~8ms TTFT   │                  │ ~280ms TTFT    │      │ Store in HNSW  │
   └─────────────┘                  └────────────────┘      └────────────────┘
          │                                   │                      │
          └───────────────────────────────────┴──────────────────────┘
                                              │
                                     [ KVAdapter Layer ]
                                              │
                         ┌────────────────────┼─────────────────────┐
                         ▼                    ▼                     ▼
                  [ llamacpp ]           [ mlc-llm ]         [ onnx runtime ]
                 (GGML tensor API)    (TVM paged KV)      (past_key_values)
├── cache/
│   ├── fragment.go          ← KVFragment: formal definition of a cache unit
│   │                          (dimensions, TTL, eviction policy, storage key)
│   ├── differential.go      ← DifferentialEngine: EXACT / PARTIAL / MISS router
│   └── schema.go            ← SQLite WAL schema for fragment metadata
│
├── adapter/
│   ├── interface.go         ← KVAdapter: engine-agnostic contract
│   │                          (ExtractFragment / InjectFragment / Generate)
│   ├── llamacpp.go          ← llama.cpp adapter (GGML tensor API, CGO)
│   ├── mlc.go               ← MLC-LLM adapter (TVM paged KV, mlc4j)
│   └── onnx.go              ← ONNX Runtime adapter (past_key_values)
│
├── core/
│   ├── hnsw.go              ← Pure Go HNSW index (M=16, efSearch=50)
│   └── cosine_neon.c        ← ARM NEON fp16 cosine similarity
│
├── sdk/android/
│   └── EdgeSyncLLM.kt       ← Kotlin JNI bridge (suspend coroutines)
│
├── monitor/
│   └── energy_android.go    ← Android /sys/class/power_supply/ profiler
│
├── prefetch/
│   └── predictor.go         ← N-gram prefetch predictor (top-3 candidates)
│
└── benchmark/
    └── runner.go            ← Falsifiable benchmark: 3 modes × 1000 requests

The atomic unit of the cache. Formally defined in cache/fragment.go

.

Field Type Meaning
TokenStart / TokenEnd
int Token range covered [start, end)
LayerStart / LayerEnd
int Transformer layers captured
LayerStride
int Sampling interval (2 = every other layer)
Keys / Values
[]byte
Raw attention tensors (engine-serialized)
TokenIDs
[]int32
Input tokens → used to verify prefix
ContentHash
string SHA-256 of TokenIDs (not tensors)
EmbeddingVector
[]float32
384-dim semantic vector for HNSW lookup
ExpiresAt
time.Time TTL: 30 min (session) → 7 days (promoted)
HitCount
int Auto-promotes at hit ≥ 5
Engine
string "llamacpp" / "mlc" / "onnx"

Invariants enforced at construction:

TokenSpan ∈ [64, 2048]

tokensLayerEnd ≤ model.NumLayers

len(TokenIDs) == TokenSpan

len(Keys) > 0 && len(Values) > 0

LayerStride ≥ 1

Defined in adapter/interface.go

. Any engine implements 6 methods:

ExtractFragment(ctx, tokenIDs, layerStart, layerEnd, layerStride, embedding)
    → *KVFragment, error

InjectFragment(ctx, fragment)
    → error

Generate(ctx, prompt, startTokenPos, maxTokens)
    → text string, tokensGenerated int, error

Tokenize(ctx, text)
    → []int32, error

ClearKVCache(ctx)
    → error

Close()
    → error

Cross-engine reuse: engine B can inject a fragment from engine A if and only if B lists A in CompatibleWith()

. Current compatibility matrix:

Producer → Consumer llamacpp mlc onnx
llamacpp
mlc
onnx

Cross-engine reuse (e.g. llamacpp → onnx) requires a KV tensor reshape adapter (transpose [seq, heads, dim]

[heads, seq, dim]

). Not implemented yet.

The benchmark in benchmark/runner.go

compares 3 modes over 1000 requests drawn from 8 semantic prompt clusters (64 unique prompts + 4 variants each).

Timing model (not ad-hoc random ranges — derived from Cortex-A55 measurements):

Constant Value Source
Prefill 6.8 ms/token llama.cpp bench, Snapdragon 685
Generate 18.4 ms/token same
HNSW search 3.2 ms N=1000, efSearch=50
Fragment inject 0.029 ms/MB LPDDR4X bandwidth
Fragment size ~6 MB 128 tokens, 12 layers, Q4_K_M

Expected results:

Mode Avg TTFT Hit rate Mem BW Energy
Baseline (no cache) ~1800 ms 0% 100% 253 mAh
Naive string cache ~1600 ms ~12% ~88% 222 mAh
Fragment cache
~350 ms
~70%
~35%
88 mAh

Run:

go run ./benchmark/

EDGESYNC_VERBOSE=1 go run ./benchmark/
go run ./benchmark/

export CGO_CFLAGS="-I/path/to/llama.cpp"
export CGO_LDFLAGS="-L/path/to/llama.cpp/build -lllama -lm"
CGO_ENABLED=1 CC=aarch64-linux-gnu-gcc GOOS=linux GOARCH=arm64 \
    go build -o edgecache ./...

aarch64-linux-gnu-gcc -O3 -march=armv8.2-a+fp16 \
    -c core/cosine_neon.c -o core/cosine_neon.o

Cross-engine KV tensor reshape(adapter/reshape.go

) — transpose[seq,heads,dim] ↔ [heads,seq,dim]

between llamacpp and ONNX Runtime;CanInjectWithReshape()

handles detection and fallback automatically - Fragment compaction(cache/compactor.go

) — deduplication byContentHash

, grouping by layer config, adjacency merge with per-engine tensor concatenation (axis 0 for llamacpp, axis 1 per-head for ONNX); merged embedding is a weighted normalized average - Persistent fragment store(cache/store.go

) — two-tier storage:sync.Map

hot cache + SQLite WAL for metadata; tensor blobs written as<id>.keys.bin

/<id>.vals.bin

to avoid SQLite page fragmentation;QueryByTokenRange()

for prefix-range lookups - Real embedding model(embedding/minilm.go

) —ORTEncoder

runs all-MiniLM-L6-v2 (22 MB, 384-dim, ~8ms on Cortex-A55) via ONNX Runtime;FallbackEncoder

(FNV-1a hash) activates automatically if the.ort

model file is not found - Android JNI bridge(sdk/android/EdgeSyncLLM.kt

+sdk/android/jni_bridge.go

) — full rewrite exposing theadapter/

package API:nativeInitialize

,nativeEmbed

,nativeLookup

,nativeInjectFragment

,nativeGenerateFromPos

,nativeExtractAndStore

,nativeCompact

,nativeReshapeFragment

This project is licensed under the Business Source License 1.1 (BUSL-1.1) — see the LICENSE file for details.

Parameter Value
Licensor bossandboss (Wajdi Kechaou)
Licensed Work EdgeSync-LLM (source, submodules, adapters, tensor engines, documentation)
Additional Use Grant Non-commercial use, research, evaluation, development, internal testing
Change Date July 1, 2029
Change License GNU Affero General Public License v3.0 (AGPL-3.0)

What this means in practice:

  • ✅ Free for research, evaluation, development, and internal testing
  • ✅ Source code is readable and forkable
  • ❌ Production use in commercial apps, SaaS platforms, or mobile apps deployed to end-users requires a commercial license
  • 🔄 On July 1, 2029, the project automatically becomes AGPL-3.0

Commercial licensing: open an issue at github.com/bossandboss/EdgeSync-LLM with the label commercial-license

.

── more in #large-language-models 4 stories · sorted by recency
── more on @edgesync-llm 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/edgesync-llm-kv-cach…] indexed:0 read:6min 2026-06-30 ·