Quantized LoRA Adapters for On-Device LLMs: Hot-Swapping Task-Specific Behaviors on Android Without Reloading the Base Model A developer demonstrates a technique for hot-swapping QLoRA adapters on Android devices, enabling task-specific LLM behaviors without reloading the base model. By loading a single 4-bit quantized base model via mmap and swapping ~2MB LoRA adapters, task switching occurs in under 100ms on modern hardware. The approach uses llama.cpp, Kotlin, and NEON optimizations to fuse adapter weights, achieving zero inference overhead compared to native fine-tuned models. --- title: "QLoRA Adapters on Android: Hot-Swap LLM Tasks in Under 100ms" published: true description: "Load a 4-bit quantized base model once on Android and hot-swap 2MB LoRA adapters for different tasks using llama.cpp, Kotlin, and NEON optimizations." tags: kotlin, android, architecture, mobile canonical url: https://blog.mvpfactory.co/qlora-adapters-android-hot-swap-llm-tasks --- What We Will Build Let me show you a pattern I use in every project that runs on-device LLMs: loading a single 4-bit quantized base model into memory via mmap , then dynamically swapping ~2MB LoRA adapter weights to switch between summarization, code review, translation — any task you need. All in under 100ms on modern Android hardware. By the end of this tutorial, you will have a lifecycle-aware Kotlin service that manages a base model and multiple QLoRA adapters, with proper native memory cleanup and an LRU cache for instant task switching. Prerequisites - Android Studio with NDK installed - A device or emulator with at least 8GB RAM Pixel 7+ recommended - llama.cpp built for Android with LoRA support enabled - A GGUF-quantized base model Q4 K M, 7B parameters - One or more LoRA adapter files ~1.5–3MB each - Familiarity with Kotlin coroutines and Jetpack Lifecycle Step 1: Understand Why One Model Per Task Does Not Scale Most teams start with the obvious approach — one fine-tuned model per task. Here is the gotcha that will save you hours of frustration: a 7B parameter model quantized to 4-bit Q4 K M runs around 3.8–4.2GB in RAM. Need three tasks? That is 12GB of model weight, untenable on any shipping Android device. The mistake is treating model specialization as a model-level concern when it is actually a weight-delta concern . QLoRA adapters encode task-specific behavior as small rank-decomposition matrices layered on top of a frozen base model. | Approach | RAM for 3 tasks | Cold-start latency | Task-switch latency | |---|---|---|---| | 3 separate Q4 models | ~12.0 GB | 8–12s each | 8–12s full reload | | 1 base + 3 LoRA adapters | ~4.2 GB + 6 MB | 8–12s once | 50–90ms | | 1 merged model per task | ~12.0 GB on disk | 8–12s each | 8–12s full reload | The adapter approach cuts both memory and switching latency by orders of magnitude. Step 2: The mmap Trick That Makes Sub-100ms Swaps Possible The docs do not mention this, but the key to fast adapter swaps is how llama.cpp handles model loading on Android. When you load a GGUF model with mmap enabled, the OS maps the file directly into virtual address space without copying it into the process heap. Base model weights get page-faulted on demand from flash storage. LoRA adapters, by contrast, are small enough to live entirely in resident memory. A swap means: 1. Deallocating the current adapter's rank-decomposition matrices ~2MB 2. Allocating and loading the new adapter ~2MB 3. No base model teardown or reload On a Pixel 8 with UFS 4.0 storage, this benchmarks consistently at 50–90ms. The base model's memory-mapped pages stay warm in the page cache across swaps. Step 3: NEON-Optimized Matrix Fusion for Merged Inference You do not want to compute base output + lora output as two separate matrix multiplications at inference time. The better path is fusing the LoRA weights into the base weights for active layers using ARM NEON intrinsics. The math: for a given layer, the effective weight becomes W eff = W base + alpha/r B A , where A and B are the low-rank matrices and r is the adapter rank. With rank 8–16 typical for mobile adapters , this fusion takes 15–30ms across all target layers on an 8-core ARM processor using NEON SIMD. Your actual inference path sees zero overhead from using an adapter versus a natively fine-tuned model. That is the whole point. Step 4: Build the Kotlin Service with Lifecycle-Aware Adapter Management Here is the minimal setup to get this working. The lifecycle management is where mobile teams stumble — the model loading and adapter math are well-documented, but keeping native memory from leaking when Android kills your activity is not. kotlin class AdapterManager private val baseModel: LlamaModel : DefaultLifecycleObserver { private var activeAdapter: LoraAdapter? = null private val adapterCache = LruCache