Quantized LoRA Adapters for On-Device LLMs: Hot-Swapping Task-Specific Behaviors on Android Without Reloading the Base Model

A developer demonstrates a technique for hot-swapping QLoRA adapters on Android devices, enabling task-specific LLM behaviors without reloading the base model. By loading a single 4-bit quantized base model via mmap and swapping ~2MB LoRA adapters, task switching occurs in under 100ms on modern hardware. The approach uses llama.cpp, Kotlin, and NEON optimizations to fuse adapter weights, achieving zero inference overhead compared to native fine-tuned models.

--- title: "QLoRA Adapters on Android: Hot-Swap LLM Tasks in Under 100ms" published: true description: "Load a 4-bit quantized base model once on Android and hot-swap 2MB LoRA adapters for different tasks using llama.cpp, Kotlin, and NEON optimizations." tags: kotlin, android, architecture, mobile canonical url: https://blog.mvpfactory.co/qlora-adapters-android-hot-swap-llm-tasks --- What We Will Build Let me show you a pattern I use in every project that runs on-device LLMs: loading a single 4-bit quantized base model into memory via mmap , then dynamically swapping ~2MB LoRA adapter weights to switch between summarization, code review, translation — any task you need. All in under 100ms on modern Android hardware. By the end of this tutorial, you will have a lifecycle-aware Kotlin service that manages a base model and multiple QLoRA adapters, with proper native memory cleanup and an LRU cache for instant task switching. Prerequisites - Android Studio with NDK installed - A device or emulator with at least 8GB RAM Pixel 7+ recommended - llama.cpp built for Android with LoRA support enabled - A GGUF-quantized base model Q4 K M, 7B parameters - One or more LoRA adapter files ~1.5–3MB each - Familiarity with Kotlin coroutines and Jetpack Lifecycle Step 1: Understand Why One Model Per Task Does Not Scale Most teams start with the obvious approach — one fine-tuned model per task. Here is the gotcha that will save you hours of frustration: a 7B parameter model quantized to 4-bit Q4 K M runs around 3.8–4.2GB in RAM. Need three tasks? That is 12GB of model weight, untenable on any shipping Android device. The mistake is treating model specialization as a model-level concern when it is actually a weight-delta concern . QLoRA adapters encode task-specific behavior as small rank-decomposition matrices layered on top of a frozen base model. | Approach | RAM for 3 tasks | Cold-start latency | Task-switch latency | |---|---|---|---| | 3 separate Q4 models | ~12.0 GB | 8–12s each | 8–12s full reload | | 1 base + 3 LoRA adapters | ~4.2 GB + 6 MB | 8–12s once | 50–90ms | | 1 merged model per task | ~12.0 GB on disk | 8–12s each | 8–12s full reload | The adapter approach cuts both memory and switching latency by orders of magnitude. Step 2: The mmap Trick That Makes Sub-100ms Swaps Possible The docs do not mention this, but the key to fast adapter swaps is how llama.cpp handles model loading on Android. When you load a GGUF model with mmap enabled, the OS maps the file directly into virtual address space without copying it into the process heap. Base model weights get page-faulted on demand from flash storage. LoRA adapters, by contrast, are small enough to live entirely in resident memory. A swap means: 1. Deallocating the current adapter's rank-decomposition matrices ~2MB 2. Allocating and loading the new adapter ~2MB 3. No base model teardown or reload On a Pixel 8 with UFS 4.0 storage, this benchmarks consistently at 50–90ms. The base model's memory-mapped pages stay warm in the page cache across swaps. Step 3: NEON-Optimized Matrix Fusion for Merged Inference You do not want to compute base output + lora output as two separate matrix multiplications at inference time. The better path is fusing the LoRA weights into the base weights for active layers using ARM NEON intrinsics. The math: for a given layer, the effective weight becomes W eff = W base + alpha/r B A , where A and B are the low-rank matrices and r is the adapter rank. With rank 8–16 typical for mobile adapters , this fusion takes 15–30ms across all target layers on an 8-core ARM processor using NEON SIMD. Your actual inference path sees zero overhead from using an adapter versus a natively fine-tuned model. That is the whole point. Step 4: Build the Kotlin Service with Lifecycle-Aware Adapter Management Here is the minimal setup to get this working. The lifecycle management is where mobile teams stumble — the model loading and adapter math are well-documented, but keeping native memory from leaking when Android kills your activity is not. kotlin class AdapterManager private val baseModel: LlamaModel : DefaultLifecycleObserver { private var activeAdapter: LoraAdapter? = null private val adapterCache = LruCache<String, ByteArray 3 // cache top 3 suspend fun switchAdapter taskId: String : Result<Long { val startNs = System.nanoTime activeAdapter?.detach val weights = adapterCache.get taskId ?: loadAdapterFromAssets taskId .also { adapterCache.put taskId, it } activeAdapter = baseModel.attachLoraAdapter weights val elapsedMs = System.nanoTime - startNs / 1 000 000 return Result.success elapsedMs } override fun onStop owner: LifecycleOwner { activeAdapter?.detach activeAdapter = null } } Key design decisions: - The LruCache holds adapter bytes for up to 3 adapters. At ~2MB each, the 6MB cache cost is negligible, and cache hits eliminate file-read latency. - Detaching adapters in onStop prevents leaked native memory when the app backgrounds. This matters because llama.cpp allocations live outside the JVM heap — the garbage collector will never touch them. - The suspend function keeps the swap off the main thread while remaining trivially callable from ViewModels. This maps well to on-device agentic workflows. An on-device agent can break a goal into steps — one step might need an intent-analysis adapter, the next a response-generation adapter, and a third a summarization adapter. Sub-100ms swaps make multi-adapter pipelines viable on mobile. Step 5: Know Your Memory Budget | Component | RAM resident | RAM virtual/mapped | |---|---|---| | Base model Q4 K M, 7B | ~800 MB active pages | 4.0 GB mapped | | Active LoRA adapter | 2 MB | 2 MB | | Cached adapters x2 | 4 MB | 4 MB | | Fusion workspace NEON | 12 MB | 12 MB | | Total | ~818 MB | ~4.02 GB | The distinction between resident and mapped memory matters. Android's mmap means your app's PSS Proportional Set Size reflects only actively accessed pages, not the full model file. Most OEMs' low-memory-killer thresholds will not trigger against ~800MB resident on flagships with 8–12GB RAM. Gotchas - Native memory leaks are silent killers. llama.cpp allocations live outside the JVM heap. If you forget to detach adapters in onStop , your app will crash after extended sessions. I have seen this happen to teams repeatedly. Use DefaultLifecycleObserver — do not rely on onDestroy . - Fuse at swap time, not at inference time. If you compute W base + LoRA delta per token, you add latency to every single generation step. Pay the 15–30ms fusion cost once during the swap and get native performance on every token after. - Do not skip the LRU cache. Reading 2MB from assets on every swap adds unnecessary I/O. Cache your top adapters in memory — the 6MB cost is trivial compared to the base model. - Watch your adapter rank. Rank 8–16 is the sweet spot for mobile. Higher ranks give marginal quality gains but increase fusion time and adapter file size significantly. - Test on real hardware. Emulator benchmarks are meaningless for mmap and NEON performance. Always profile on a physical device with UFS storage. During long benchmarking sessions like these, I keep HealthyDesk https://play.google.com/store/apps/details?id=com.healthydesk running for break reminders and guided desk exercises — no amount of elegant adapter architecture helps if you are wrecked from six hours of unbroken profiling. Conclusion Load your base model once with mmap , then treat adapters as the unit of task specialization. The per-adapter cost ~2MB, ~70ms swap makes multi-task on-device LLMs practical today on flagship Android hardware. Fuse LoRA weights into base weights using NEON SIMD before inference, and bind adapter lifecycle to Android component lifecycle to prevent the silent native memory leaks that crash apps after extended sessions. The pattern is simple: one base model, many tiny adapters, lifecycle-aware cleanup. That is your path to shipping multi-task LLMs on Android without melting the device.