# EdgeSync-LLM – KV cache fragment engine for on-device LLM inference (Go/Android)

> Source: <https://github.com/bossandboss/EdgeSync-LLM>
> Published: 2026-06-30 14:10:22+00:00

A **engine-agnostic KV cache fragment system** for on-device LLM inference.
Designed for ARM64 Android (Cortex-A55/A78), portable to any platform running
llama.cpp, MLC-LLM, or ONNX Runtime.

A **reusable KV cache layer** that sits between the application and the LLM
engine. Instead of re-running the full prefill on every request, it stores
slices of the attention KV tensors (Keys and Values), retrieves them via
approximate nearest-neighbor search (HNSW), and injects them directly into
the engine's KV cache — skipping the most expensive part of inference.

This is not a "semantic cache" that stores response strings. It stores the
**actual transformer KV tensors**, identified by token range and layer range,
and reconstructs them at request time.

```
              [ PROMPT ]
                  │
                  ▼
         [ Embedding Model ]        MiniLM-L6-v2 (384-dim, ~8ms CPU)
                  │
                  ▼
           [ HNSW Index ]           Pure Go, M=16, efSearch=50
                  │
          ┌───────┴───────────────────────────┐
          │                                   │
    sim ≥ 0.92                          0.75 ≤ sim < 0.92       sim < 0.75
          │                                   │                      │
   ┌──────▼──────┐                  ┌─────────▼──────┐      ┌───────▼────────┐
   │ EXACT HIT   │                  │  PARTIAL HIT   │      │     MISS       │
   │             │                  │                │      │                │
   │ Inject KV   │                  │ Inject prefix  │      │ Full prefill   │
   │ fragment    │                  │ Generate delta │      │ Extract frag.  │
   │ ~8ms TTFT   │                  │ ~280ms TTFT    │      │ Store in HNSW  │
   └─────────────┘                  └────────────────┘      └────────────────┘
          │                                   │                      │
          └───────────────────────────────────┴──────────────────────┘
                                              │
                                     [ KVAdapter Layer ]
                                              │
                         ┌────────────────────┼─────────────────────┐
                         ▼                    ▼                     ▼
                  [ llamacpp ]           [ mlc-llm ]         [ onnx runtime ]
                 (GGML tensor API)    (TVM paged KV)      (past_key_values)
├── cache/
│   ├── fragment.go          ← KVFragment: formal definition of a cache unit
│   │                          (dimensions, TTL, eviction policy, storage key)
│   ├── differential.go      ← DifferentialEngine: EXACT / PARTIAL / MISS router
│   └── schema.go            ← SQLite WAL schema for fragment metadata
│
├── adapter/
│   ├── interface.go         ← KVAdapter: engine-agnostic contract
│   │                          (ExtractFragment / InjectFragment / Generate)
│   ├── llamacpp.go          ← llama.cpp adapter (GGML tensor API, CGO)
│   ├── mlc.go               ← MLC-LLM adapter (TVM paged KV, mlc4j)
│   └── onnx.go              ← ONNX Runtime adapter (past_key_values)
│
├── core/
│   ├── hnsw.go              ← Pure Go HNSW index (M=16, efSearch=50)
│   └── cosine_neon.c        ← ARM NEON fp16 cosine similarity
│
├── sdk/android/
│   └── EdgeSyncLLM.kt       ← Kotlin JNI bridge (suspend coroutines)
│
├── monitor/
│   └── energy_android.go    ← Android /sys/class/power_supply/ profiler
│
├── prefetch/
│   └── predictor.go         ← N-gram prefetch predictor (top-3 candidates)
│
└── benchmark/
    └── runner.go            ← Falsifiable benchmark: 3 modes × 1000 requests
```

The **atomic unit** of the cache. Formally defined in `cache/fragment.go`

.

| Field | Type | Meaning |
|---|---|---|
`TokenStart / TokenEnd` |
int | Token range covered `[start, end)` |
`LayerStart / LayerEnd` |
int | Transformer layers captured |
`LayerStride` |
int | Sampling interval (2 = every other layer) |
`Keys / Values` |
`[]byte` |
Raw attention tensors (engine-serialized) |
`TokenIDs` |
`[]int32` |
Input tokens → used to verify prefix |
`ContentHash` |
string | SHA-256 of TokenIDs (not tensors) |
`EmbeddingVector` |
`[]float32` |
384-dim semantic vector for HNSW lookup |
`ExpiresAt` |
time.Time | TTL: 30 min (session) → 7 days (promoted) |
`HitCount` |
int | Auto-promotes at hit ≥ 5 |
`Engine` |
string | "llamacpp" / "mlc" / "onnx" |

**Invariants enforced at construction:**

`TokenSpan ∈ [64, 2048]`

tokens`LayerEnd ≤ model.NumLayers`

`len(TokenIDs) == TokenSpan`

`len(Keys) > 0 && len(Values) > 0`

`LayerStride ≥ 1`

Defined in `adapter/interface.go`

. Any engine implements 6 methods:

```
ExtractFragment(ctx, tokenIDs, layerStart, layerEnd, layerStride, embedding)
    → *KVFragment, error

InjectFragment(ctx, fragment)
    → error

Generate(ctx, prompt, startTokenPos, maxTokens)
    → text string, tokensGenerated int, error

Tokenize(ctx, text)
    → []int32, error

ClearKVCache(ctx)
    → error

Close()
    → error
```

Cross-engine reuse: engine B can inject a fragment from engine A if and only if
B lists A in `CompatibleWith()`

. Current compatibility matrix:

| Producer → Consumer | llamacpp | mlc | onnx |
|---|---|---|---|
llamacpp |
✓ | — | — |
mlc |
— | ✓ | — |
onnx |
— | — | ✓ |

Cross-engine reuse (e.g. llamacpp → onnx) requires a KV tensor reshape adapter
(transpose `[seq, heads, dim]`

→ `[heads, seq, dim]`

). Not implemented yet.

The benchmark in `benchmark/runner.go`

compares 3 modes over 1000 requests
drawn from 8 semantic prompt clusters (64 unique prompts + 4 variants each).

**Timing model** (not ad-hoc random ranges — derived from Cortex-A55 measurements):

| Constant | Value | Source |
|---|---|---|
| Prefill | 6.8 ms/token | llama.cpp bench, Snapdragon 685 |
| Generate | 18.4 ms/token | same |
| HNSW search | 3.2 ms | N=1000, efSearch=50 |
| Fragment inject | 0.029 ms/MB | LPDDR4X bandwidth |
| Fragment size | ~6 MB | 128 tokens, 12 layers, Q4_K_M |

**Expected results:**

| Mode | Avg TTFT | Hit rate | Mem BW | Energy |
|---|---|---|---|---|
| Baseline (no cache) | ~1800 ms | 0% | 100% | 253 mAh |
| Naive string cache | ~1600 ms | ~12% | ~88% | 222 mAh |
Fragment cache |
~350 ms |
~70% |
~35% |
88 mAh |

Run:

```
go run ./benchmark/

# Verbose per-query output:
EDGESYNC_VERBOSE=1 go run ./benchmark/
# Host build (benchmark only, no CGO):
go run ./benchmark/

# Android ARM64 (with llama.cpp CGO):
export CGO_CFLAGS="-I/path/to/llama.cpp"
export CGO_LDFLAGS="-L/path/to/llama.cpp/build -lllama -lm"
CGO_ENABLED=1 CC=aarch64-linux-gnu-gcc GOOS=linux GOARCH=arm64 \
    go build -o edgecache ./...

# NEON cosine module (ARM64 only):
aarch64-linux-gnu-gcc -O3 -march=armv8.2-a+fp16 \
    -c core/cosine_neon.c -o core/cosine_neon.o
```

-
**Cross-engine KV tensor reshape**(`adapter/reshape.go`

) — transpose`[seq,heads,dim] ↔ [heads,seq,dim]`

between llamacpp and ONNX Runtime;`CanInjectWithReshape()`

handles detection and fallback automatically -
**Fragment compaction**(`cache/compactor.go`

) — deduplication by`ContentHash`

, grouping by layer config, adjacency merge with per-engine tensor concatenation (axis 0 for llamacpp, axis 1 per-head for ONNX); merged embedding is a weighted normalized average -
**Persistent fragment store**(`cache/store.go`

) — two-tier storage:`sync.Map`

hot cache + SQLite WAL for metadata; tensor blobs written as`<id>.keys.bin`

/`<id>.vals.bin`

to avoid SQLite page fragmentation;`QueryByTokenRange()`

for prefix-range lookups -
**Real embedding model**(`embedding/minilm.go`

) —`ORTEncoder`

runs all-MiniLM-L6-v2 (22 MB, 384-dim, ~8ms on Cortex-A55) via ONNX Runtime;`FallbackEncoder`

(FNV-1a hash) activates automatically if the`.ort`

model file is not found -
**Android JNI bridge**(`sdk/android/EdgeSyncLLM.kt`

+`sdk/android/jni_bridge.go`

) — full rewrite exposing the`adapter/`

package API:`nativeInitialize`

,`nativeEmbed`

,`nativeLookup`

,`nativeInjectFragment`

,`nativeGenerateFromPos`

,`nativeExtractAndStore`

,`nativeCompact`

,`nativeReshapeFragment`

This project is licensed under the **Business Source License 1.1 (BUSL-1.1)** — see the [LICENSE](/bossandboss/EdgeSync-LLM/blob/main/LICENSE) file for details.

| Parameter | Value |
|---|---|
| Licensor | bossandboss (Wajdi Kechaou) |
| Licensed Work | EdgeSync-LLM (source, submodules, adapters, tensor engines, documentation) |
| Additional Use Grant | Non-commercial use, research, evaluation, development, internal testing |
| Change Date | July 1, 2029 |
| Change License | GNU Affero General Public License v3.0 (AGPL-3.0) |

**What this means in practice:**

- ✅ Free for research, evaluation, development, and internal testing
- ✅ Source code is readable and forkable
- ❌ Production use in commercial apps, SaaS platforms, or mobile apps deployed to end-users requires a commercial license
- 🔄 On July 1, 2029, the project automatically becomes AGPL-3.0

**Commercial licensing:** open an issue at [github.com/bossandboss/EdgeSync-LLM](https://github.com/bossandboss/EdgeSync-LLM/issues) with the label `commercial-license`

.
