EdgeSync-LLM – KV cache fragment engine for on-device LLM inference (Go/Android)

wpnews.pro

A engine-agnostic KV cache fragment system for on-device LLM inference. Designed for ARM64 Android (Cortex-A55/A78), portable to any platform running llama.cpp, MLC-LLM, or ONNX Runtime.

A reusable KV cache layer that sits between the application and the LLM engine. Instead of re-running the full prefill on every request, it stores slices of the attention KV tensors (Keys and Values), retrieves them via approximate nearest-neighbor search (HNSW), and injects them directly into the engine's KV cache — skipping the most expensive part of inference.

This is not a "semantic cache" that stores response strings. It stores the actual transformer KV tensors, identified by token range and layer range, and reconstructs them at request time.

              [ PROMPT ]
                  │
                  ▼
         [ Embedding Model ]        MiniLM-L6-v2 (384-dim, ~8ms CPU)
                  │
                  ▼
           [ HNSW Index ]           Pure Go, M=16, efSearch=50
                  │
          ┌───────┴───────────────────────────┐
          │                                   │
    sim ≥ 0.92                          0.75 ≤ sim < 0.92       sim < 0.75
          │                                   │                      │
   ┌──────▼──────┐                  ┌─────────▼──────┐      ┌───────▼────────┐
   │ EXACT HIT   │                  │  PARTIAL HIT   │      │     MISS       │
   │             │                  │                │      │                │
   │ Inject KV   │                  │ Inject prefix  │      │ Full prefill   │
   │ fragment    │                  │ Generate delta │      │ Extract frag.  │
   │ ~8ms TTFT   │                  │ ~280ms TTFT    │      │ Store in HNSW  │
   └─────────────┘                  └────────────────┘      └────────────────┘
          │                                   │                      │
          └───────────────────────────────────┴──────────────────────┘
                                              │
                                     [ KVAdapter Layer ]
                                              │
                         ┌────────────────────┼─────────────────────┐
                         ▼                    ▼                     ▼
                  [ llamacpp ]           [ mlc-llm ]         [ onnx runtime ]
                 (GGML tensor API)    (TVM paged KV)      (past_key_values)
├── cache/
│   ├── fragment.go          ← KVFragment: formal definition of a cache unit
│   │                          (dimensions, TTL, eviction policy, storage key)
│   ├── differential.go      ← DifferentialEngine: EXACT / PARTIAL / MISS router
│   └── schema.go            ← SQLite WAL schema for fragment metadata
│
├── adapter/
│   ├── interface.go         ← KVAdapter: engine-agnostic contract
│   │                          (ExtractFragment / InjectFragment / Generate)
│   ├── llamacpp.go          ← llama.cpp adapter (GGML tensor API, CGO)
│   ├── mlc.go               ← MLC-LLM adapter (TVM paged KV, mlc4j)
│   └── onnx.go              ← ONNX Runtime adapter (past_key_values)
│
├── core/
│   ├── hnsw.go              ← Pure Go HNSW index (M=16, efSearch=50)
│   └── cosine_neon.c        ← ARM NEON fp16 cosine similarity
│
├── sdk/android/
│   └── EdgeSyncLLM.kt       ← Kotlin JNI bridge (suspend coroutines)
│
├── monitor/
│   └── energy_android.go    ← Android /sys/class/power_supply/ profiler
│
├── prefetch/
│   └── predictor.go         ← N-gram prefetch predictor (top-3 candidates)
│
└── benchmark/
    └── runner.go            ← Falsifiable benchmark: 3 modes × 1000 requests

The atomic unit of the cache. Formally defined in cache/fragment.go

.

Field	Type	Meaning
`TokenStart / TokenEnd`
int	Token range covered `[start, end)`
`LayerStart / LayerEnd`
int	Transformer layers captured
`LayerStride`
int	Sampling interval (2 = every other layer)
`Keys / Values`
`[]byte`
Raw attention tensors (engine-serialized)
`TokenIDs`
`[]int32`
Input tokens → used to verify prefix
`ContentHash`
string	SHA-256 of TokenIDs (not tensors)
`EmbeddingVector`
`[]float32`
384-dim semantic vector for HNSW lookup
`ExpiresAt`
time.Time	TTL: 30 min (session) → 7 days (promoted)
`HitCount`
int	Auto-promotes at hit ≥ 5
`Engine`
string	"llamacpp" / "mlc" / "onnx"

Invariants enforced at construction:

TokenSpan ∈ [64, 2048]

tokensLayerEnd ≤ model.NumLayers

len(TokenIDs) == TokenSpan

len(Keys) > 0 && len(Values) > 0

LayerStride ≥ 1

Defined in adapter/interface.go

. Any engine implements 6 methods:

ExtractFragment(ctx, tokenIDs, layerStart, layerEnd, layerStride, embedding)
    → *KVFragment, error

InjectFragment(ctx, fragment)
    → error

Generate(ctx, prompt, startTokenPos, maxTokens)
    → text string, tokensGenerated int, error

Tokenize(ctx, text)
    → []int32, error

ClearKVCache(ctx)
    → error

Close()
    → error

Cross-engine reuse: engine B can inject a fragment from engine A if and only if B lists A in CompatibleWith()

. Current compatibility matrix:

Producer → Consumer	llamacpp	mlc
llamacpp
✓	—	—
mlc
—	✓	—
onnx
—	—	✓

Cross-engine reuse (e.g. llamacpp → onnx) requires a KV tensor reshape adapter (transpose [seq, heads, dim]

→ [heads, seq, dim]

). Not implemented yet.

The benchmark in benchmark/runner.go

compares 3 modes over 1000 requests drawn from 8 semantic prompt clusters (64 unique prompts + 4 variants each).

Timing model (not ad-hoc random ranges — derived from Cortex-A55 measurements):

Constant	Value	Source
Prefill	6.8 ms/token	llama.cpp bench, Snapdragon 685
Generate	18.4 ms/token	same
HNSW search	3.2 ms	N=1000, efSearch=50
Fragment inject	0.029 ms/MB	LPDDR4X bandwidth
Fragment size	~6 MB	128 tokens, 12 layers, Q4_K_M

Expected results:

Mode	Avg TTFT	Hit rate	Mem BW	Energy
Baseline (no cache)	~1800 ms	0%	100%	253 mAh
Naive string cache	~1600 ms	~12%	~88%	222 mAh
Fragment cache
~350 ms
~70%
~35%
88 mAh

Run:

go run ./benchmark/

EDGESYNC_VERBOSE=1 go run ./benchmark/
go run ./benchmark/

export CGO_CFLAGS="-I/path/to/llama.cpp"
export CGO_LDFLAGS="-L/path/to/llama.cpp/build -lllama -lm"
CGO_ENABLED=1 CC=aarch64-linux-gnu-gcc GOOS=linux GOARCH=arm64 \
    go build -o edgecache ./...

aarch64-linux-gnu-gcc -O3 -march=armv8.2-a+fp16 \
    -c core/cosine_neon.c -o core/cosine_neon.o

Cross-engine KV tensor reshape(adapter/reshape.go

) — transpose[seq,heads,dim] ↔ [heads,seq,dim]

between llamacpp and ONNX Runtime;CanInjectWithReshape()

handles detection and fallback automatically - Fragment compaction(cache/compactor.go

) — deduplication byContentHash

, grouping by layer config, adjacency merge with per-engine tensor concatenation (axis 0 for llamacpp, axis 1 per-head for ONNX); merged embedding is a weighted normalized average - Persistent fragment store(cache/store.go

) — two-tier storage:sync.Map

hot cache + SQLite WAL for metadata; tensor blobs written as<id>.keys.bin

/<id>.vals.bin

to avoid SQLite page fragmentation;QueryByTokenRange()

for prefix-range lookups - Real embedding model(embedding/minilm.go

) —ORTEncoder

runs all-MiniLM-L6-v2 (22 MB, 384-dim, ~8ms on Cortex-A55) via ONNX Runtime;FallbackEncoder

(FNV-1a hash) activates automatically if the.ort

model file is not found - Android JNI bridge(sdk/android/EdgeSyncLLM.kt

+sdk/android/jni_bridge.go

) — full rewrite exposing theadapter/

package API:nativeInitialize

,nativeEmbed

,nativeLookup

,nativeInjectFragment

,nativeGenerateFromPos

,nativeExtractAndStore

,nativeCompact

,nativeReshapeFragment

This project is licensed under the Business Source License 1.1 (BUSL-1.1) — see the LICENSE file for details.

Parameter	Value
Licensor	bossandboss (Wajdi Kechaou)
Licensed Work	EdgeSync-LLM (source, submodules, adapters, tensor engines, documentation)
Additional Use Grant	Non-commercial use, research, evaluation, development, internal testing
Change Date	July 1, 2029
Change License	GNU Affero General Public License v3.0 (AGPL-3.0)

What this means in practice:

✅ Free for research, evaluation, development, and internal testing
✅ Source code is readable and forkable
❌ Production use in commercial apps, SaaS platforms, or mobile apps deployed to end-users requires a commercial license
🔄 On July 1, 2029, the project automatically becomes AGPL-3.0

Commercial licensing: open an issue at github.com/bossandboss/EdgeSync-LLM with the label commercial-license

.

source & further reading

github.com — original article

EdgeSync-LLM – KV cache fragment engine for on-device LLM inference (Go/Android)

Run your AI side-project on zahid.host