Quantized LoRA Adapters for On-Device LLMs: Hot-Swapping Task-Specific Behaviors on Android Without Reloading the Base Model

wpnews.pro

---
title: "QLoRA Adapters on Android: Hot-Swap LLM Tasks in Under 100ms"
published: true
description: "Load a 4-bit quantized base model once on Android and hot-swap 2MB LoRA adapters for different tasks using llama.cpp, Kotlin, and NEON optimizations."
tags: kotlin, android, architecture, mobile
canonical_url: https://blog.mvpfactory.co/qlora-adapters-android-hot-swap-llm-tasks
---

## What We Will Build

Let me show you a pattern I use in every project that runs on-device LLMs:  a single 4-bit quantized base model into memory via `mmap`, then dynamically swapping ~2MB LoRA adapter weights to switch between summarization, code review, translation — any task you need. All in under 100ms on modern Android hardware.

By the end of this tutorial, you will have a lifecycle-aware Kotlin service that manages a base model and multiple QLoRA adapters, with proper native memory cleanup and an LRU cache for instant task switching.

## Prerequisites

- Android Studio with NDK installed
- A device or emulator with at least 8GB RAM (Pixel 7+ recommended)
- `llama.cpp` built for Android with LoRA support enabled
- A GGUF-quantized base model (Q4_K_M, 7B parameters)
- One or more LoRA adapter files (~1.5–3MB each)
- Familiarity with Kotlin coroutines and Jetpack Lifecycle

## Step 1: Understand Why One Model Per Task Does Not Scale

Most teams start with the obvious approach — one fine-tuned model per task. Here is the gotcha that will save you hours of frustration: a 7B parameter model quantized to 4-bit (Q4_K_M) runs around 3.8–4.2GB in RAM. Need three tasks? That is 12GB of model weight, untenable on any shipping Android device.

The mistake is treating model specialization as a model-level concern when it is actually a **weight-delta concern**. QLoRA adapters encode task-specific behavior as small rank-decomposition matrices layered on top of a frozen base model.

| Approach | RAM for 3 tasks | Cold-start latency | Task-switch latency |
|---|---|---|---|
| 3 separate Q4 models | ~12.0 GB | 8–12s each | 8–12s (full reload) |
| 1 base + 3 LoRA adapters | ~4.2 GB + 6 MB | 8–12s (once) | 50–90ms |
| 1 merged model per task | ~12.0 GB on disk | 8–12s each | 8–12s (full reload) |

The adapter approach cuts both memory and switching latency by orders of magnitude.

## Step 2: The mmap Trick That Makes Sub-100ms Swaps Possible

The docs do not mention this, but the key to fast adapter swaps is how `llama.cpp` handles model  on Android. When you load a GGUF model with `mmap` enabled, the OS maps the file directly into virtual address space without copying it into the process heap. Base model weights get page-faulted on demand from flash storage.

LoRA adapters, by contrast, are small enough to live entirely in resident memory. A swap means:

1. Deallocating the current adapter's rank-decomposition matrices (~2MB)
2. Allocating and  the new adapter (~2MB)
3. **No base model teardown or reload**

On a Pixel 8 with UFS 4.0 storage, this benchmarks consistently at 50–90ms. The base model's memory-mapped pages stay warm in the page cache across swaps.

## Step 3: NEON-Optimized Matrix Fusion for Merged Inference

You do not want to compute `base_output + lora_output` as two separate matrix multiplications at inference time. The better path is fusing the LoRA weights into the base weights for active layers using ARM NEON intrinsics.

The math: for a given layer, the effective weight becomes `W_eff = W_base + (alpha/r) * B * A`, where `A` and `B` are the low-rank matrices and `r` is the adapter rank. With rank 8–16 (typical for mobile adapters), this fusion takes 15–30ms across all target layers on an 8-core ARM processor using NEON SIMD.

Your actual inference path sees **zero overhead** from using an adapter versus a natively fine-tuned model. That is the whole point.

## Step 4: Build the Kotlin Service with Lifecycle-Aware Adapter Management

Here is the minimal setup to get this working. The lifecycle management is where mobile teams stumble — the model  and adapter math are well-documented, but keeping native memory from leaking when Android kills your activity is not.

kotlin

class AdapterManager(

private val baseModel: LlamaModel

) : DefaultLifecycleObserver {

private var activeAdapter: LoraAdapter? = null
private val adapterCache = LruCache<String, ByteArray>(3) // cache top 3

suspend fun switchAdapter(taskId: String): Result<Long> {
    val startNs = System.nanoTime()
    activeAdapter?.detach()

    val weights = adapterCache.get(taskId)
        ?: loadAdapterFromAssets(taskId).also { adapterCache.put(taskId, it) }

    activeAdapter = baseModel.attachLoraAdapter(weights)
    val elapsedMs = (System.nanoTime() - startNs) / 1_000_000
    return Result.success(elapsedMs)
}

override fun onStop(owner: LifecycleOwner) {
    activeAdapter?.detach()
    activeAdapter = null
}

}

Key design decisions:

- The `LruCache` holds adapter bytes for up to 3 adapters. At ~2MB each, the 6MB cache cost is negligible, and cache hits eliminate file-read latency.
- Detaching adapters in `onStop` prevents leaked native memory when the app backgrounds. This matters because `llama.cpp` allocations live outside the JVM heap — the garbage collector will never touch them.
- The `suspend` function keeps the swap off the main thread while remaining trivially callable from ViewModels.

This maps well to on-device agentic workflows. An on-device agent can break a goal into steps — one step might need an intent-analysis adapter, the next a response-generation adapter, and a third a summarization adapter. Sub-100ms swaps make multi-adapter pipelines viable on mobile.

## Step 5: Know Your Memory Budget

| Component | RAM (resident) | RAM (virtual/mapped) |
|---|---|---|
| Base model (Q4_K_M, 7B) | ~800 MB active pages | 4.0 GB mapped |
| Active LoRA adapter | 2 MB | 2 MB |
| Cached adapters (x2) | 4 MB | 4 MB |
| Fusion workspace (NEON) | 12 MB | 12 MB |
| **Total** | **~818 MB** | **~4.02 GB** |

The distinction between resident and mapped memory matters. Android's `mmap` means your app's PSS (Proportional Set Size) reflects only actively accessed pages, not the full model file. Most OEMs' low-memory-killer thresholds will not trigger against ~800MB resident on flagships with 8–12GB RAM.

## Gotchas

- **Native memory leaks are silent killers.** `llama.cpp` allocations live outside the JVM heap. If you forget to detach adapters in `onStop`, your app will crash after extended sessions. I have seen this happen to teams repeatedly. Use `DefaultLifecycleObserver` — do not rely on `onDestroy`.
- **Fuse at swap time, not at inference time.** If you compute `W_base + LoRA_delta` per token, you add latency to every single generation step. Pay the 15–30ms fusion cost once during the swap and get native performance on every token after.
- **Do not skip the LRU cache.** Reading 2MB from assets on every swap adds unnecessary I/O. Cache your top adapters in memory — the 6MB cost is trivial compared to the base model.
- **Watch your adapter rank.** Rank 8–16 is the sweet spot for mobile. Higher ranks give marginal quality gains but increase fusion time and adapter file size significantly.
- **Test on real hardware.** Emulator benchmarks are meaningless for `mmap` and NEON performance. Always profile on a physical device with UFS storage.

During long benchmarking sessions like these, I keep [HealthyDesk](https://play.google.com/store/apps/details?id=com.healthydesk) running for break reminders and guided desk exercises — no amount of elegant adapter architecture helps if you are wrecked from six hours of unbroken profiling.

## Conclusion

Load your base model once with `mmap`, then treat adapters as the unit of task specialization. The per-adapter cost (~2MB, ~70ms swap) makes multi-task on-device LLMs practical today on flagship Android hardware. Fuse LoRA weights into base weights using NEON SIMD before inference, and bind adapter lifecycle to Android component lifecycle to prevent the silent native memory leaks that crash apps after extended sessions.

The pattern is simple: one base model, many tiny adapters, lifecycle-aware cleanup. That is your path to shipping multi-task LLMs on Android without melting the device.

source & further reading

dev.to — original article Gartner Says 40% of Apps Will Have AI Agents by December. Here's the Plumbing Nobody Puts on the Slide. Your agent returned 200 OK. Was it actually right? Calgary Energy Tech and Offshore Engineering Teams: A Match Made in the Cloud

Quantized LoRA Adapters for On-Device LLMs: Hot-Swapping Task-Specific Behaviors on Android Without Reloading the Base Model

Run your AI side-project on zahid.host