Six Lines, Zero API Calls: Running LLMs On-Device in React Native

wpnews.pro

Every AI feature I've worked on has done the same quiet thing: collect the user's text, send it to someone else's server, pay per token, and pray the network holds. That's fine until it isn't:

Your user is on a flight, with no network and a dead feature.

It's a journaling app, where "we send your private thoughts to a third party" is a hard no.

Finance notices the OpenAI bill climbing in a straight line with usage.

There's another option most React Native devs still treat as exotic: run the model on the device. No API call, no network, no per-token cost. The first time I wired this into an offline text-enhancement tool with Expo, the surprise wasn't that it worked. It's that the actual model code was about six lines. The hard parts were everywhere except the model.

This is a walkthrough of react-native-executorch (by Software Mansion, the Reanimated and Gesture Handler folks), built on Meta's ExecuTorch runtime. We'll build a working local chat screen, but more importantly I'll show you the traps. The ones that cost me an afternoon each. The ones an AI-generated tutorial will confidently get wrong because the API changed underneath it.

React Native ExecuTorch provides a declarative way to run AI models on-device using React Native, powered by ExecuTorch 🚀. It offers out-of-the-box support for a wide range of LLMs, computer vision models, and more. Visit our HuggingFace page to explore these models.

ExecuTorch, developed by Meta, is a novel framework allowing AI model execution on devices like mobile phones or microcontrollers.

React Native ExecuTorch bridges the gap between React Native and native platform capabilities, enabling developers to efficiently run local AI models on mobile devices. This can be achieved without the need for extensive expertise in native programming or machine learning.

The minimal supported version are:

Before any code, one idea that saves a lot of confusion: the LLM is one part of your app, not the app itself. The library gives you the model and nothing else. Everything around it is still your job.

It helps to picture three things working together:

Your normal app code is predictable. Same input, same output, every time: the buttons, the list, the navigation.

The model is not. Give it the same prompt twice and you'll get slightly different answers, because it generates text by predicting likely next words, not by looking facts up. That's also why it sometimes states wrong things with total confidence. It isn't a defect you can patch, it's how the thing works, so plan for it.

The person reading the output is the final check. They decide what to trust and what to ignore.

react-native-executorch owns only the middle piece. It hands you a stream of words and a few status flags. It does not manage your chat UI, decide when to run the model, or judge whether the answer is any good. Those are yours to build.

New Architecture only. The library does not support the old RN architecture. If your app is still on it, that's your first migration.

Expo SDK 54+ if you're on Expo (which I'd recommend). Older SDKs break on the file-system APIs the library now depends on.

A custom dev build, not Expo Go. This relies on native modules. Expo Go will not load it. This trips up everyone the first time.

A real iOS device for release builds. Because ExecuTorch runs natively, you can't produce an iOS release build targeting the simulator. Debug on the sim is fine; release testing needs hardware.

That last pair isn't optional advice, it's the difference between "why won't this run" and a working build. Write them on a sticky note.

Installation is two steps: install the core package, then add a resource fetcher adapter.

npm install react-native-executorch

Then a resource fetcher adapter. These are platform-specific, so install the one that matches your project.


npm install react-native-executorch-expo-resource-fetcher expo-file-system expo-asset

npm install react-native-executorch-bare-resource-fetcher @dr.pogodin/react-native-fs @kesha-antonov/react-native-background-down

Before you call any other API, you must initialize ExecuTorch with that adapter, once, at your app's entry point:

// App.tsx (or index.js), top level, runs once

import { initExecutorch } from "react-native-executorch";

import { ExpoResourceFetcher } from "react-native-executorch-expo-resource-fetcher";

initExecutorch({ resourceFetcher: ExpoResourceFetcher });

Skip this and the first model you load throws ResourceFetcherAdapterNotInitialized

. It's the most common setup mistake, and an easy one to miss because initExecutorch

lives at your entry point, far from where you actually call useLLM

.

One more, if you plan to bundle a model with the app via require()

instead of down it. Add the binary extensions to Metro:

// metro.config.js

defaultConfig.resolver.assetExts.push("pte"); // exported model

defaultConfig.resolver.assetExts.push("bin"); // tokenizer

Here's the whole "load an LLM" surface:

import { models, useLLM } from "react-native-executorch";

function Chat() {
  const llm = useLLM({ model: models.llm.lfm2_5_1_2b_instruct() });

  // ...
}

models.llm.*

is a factory of pre-exported, ready-to-run models. One factory call gives the runtime everything it needs, already bundled:

.pte

format, already convertedSoftware Mansion hosts the full lineup on HuggingFace, so you point at a model and the library handles fetching and wiring up the rest. No manual file juggling.

I'm using LFM2.5 1.2B here because it's the library's own default and small enough to behave on mid-range hardware. You've got real choices though. The bundled lineup includes:

Text models: Qwen 3 (0.6B / 1.7B / 4B), Llama 3.2 (1B / 3B), Phi 4 Mini, SmolLM 2, Hammer 2.1

Vision-capable: Gemma 4 and LFM2.5-VL

Why I'd start small: a 4B model is noticeably smarter and noticeably more likely to crash with an out-of-memory error on a budget Android. Pick the smallest model that clears your quality bar, then size up only if you must.

The hook gives you state to drive your UI:

llm.downloadProgress

: 0 to 1 while the model downloads on first launch

llm.isReady

: flips true when it's loaded and usable

llm.error

: populated if anything blows up

llm.isGenerating

: true while tokens are streaming

llm.response

: the generated text, updated token by token

There are two ways to use this hook, and the docs name them well: functional vs managed. The distinction matters, so don't skim it.

You pass the full message array every time, you keep the history, you get a token stream back. Nothing is remembered for you.

import { models, useLLM, type Message } from "react-native-executorch";

import { View, Text, Button } from "react-native";

function Chat() {
  const llm = useLLM({ model: models.llm.lfm2_5_1_2b_instruct() });

  const handleGenerate = async () => {
    const chat: Message[] = [
      { role: "system", content: "You are a concise, helpful assistant." },

      { role: "user", content: "Explain a closure in one sentence." },
    ];

    // resolves to the full string; llm.response updates live as it streams

    const final = await llm.generate(chat);

    console.log("done:", final);
  };

  if (!llm.isReady) {
    return (
      <Text> model… {Math.round(llm.downloadProgress * 100)}%</Text>
    );
  }

  return (
    <View>
      <Button title="Generate" onPress={handleGenerate} />

      <Text>{llm.response}</Text>
    </View>
  );
}

Note the shape of generate

: it both returns a promise and streams into llm.response

. So you render llm.response

for the live typewriter effect, and await

the return value when you need the finished string for, say, saving to a DB. Same call, two consumption patterns.

If you're building an actual back-and-forth chat, you don't want to hand-roll the history array. sendMessage

plus messageHistory

plus configure

does it for you:

import { useEffect } from "react";

import { models, useLLM, DEFAULT_SYSTEM_PROMPT } from "react-native-executorch";

function ManagedChat() {
  const llm = useLLM({ model: models.llm.lfm2_5_1_2b_instruct() });

  const { configure } = llm;

  useEffect(() => {
    configure({
      chatConfig: {
        systemPrompt: `${DEFAULT_SYSTEM_PROMPT} Keep answers short.`,
      },

      generationConfig: {
        temperature: 0.7,

        topP: 0.9,
      },
    });
  }, [configure]);

  const send = () => llm.sendMessage("Who are you?");

  return (
    <View>
      {llm.messageHistory.map((m, i) => (
        <Text key={i}>
          {m.role}: {m.content}
        </Text>
      ))}

      <Button title="Send" onPress={send} disabled={!llm.isReady} />
    </View>
  );
}

configure

only affects the managed path. chatConfig

and toolsConfig

do nothing to generate()

. That's a subtle footgun: set a system prompt in configure

, then call generate

and wonder why it's ignored. Mode and config have to match.

My take: use managed for chat, functional for one-shot transforms (summarize this, rewrite that, extract JSON from this). The text-enhancement tool I mentioned was pure functional. There's no conversation, just input string to improved string, and managed state would've been overhead I'd have to fight.

This is the section a docs-paraphrase can't write, so here's the honest list of what actually bit me.

1. Dismounting mid-generation crashes the app. Hard crash, not a warning. If the user navigates away while tokens are still streaming, you go down. The fix is to interrupt and wait:

// before unmount / on a stop button

llm.interrupt();

// then wait until llm.isGenerating === false before tearing down

Wire a stop button to interrupt()

and gate isGenerating

into your navigation guards. I learned this the way everyone does, with a back-button press during a long answer.

2. First launch downloads a model. A big one. These files run from roughly 700MB to over a gigabyte. The hosted models stream down on first use and cache in your app's documents directory, but if you don't render downloadProgress

, the user stares at a dead screen and force-quits. Build the UX first, not last. And consider letting users pick a model, or bundle a small one for offline-from-install.

3. RAM, not CPU, is your ceiling. Crashes on cheaper devices are almost always out-of-memory, not slowness. Use quantized models. If you're testing on an Android emulator and it dies, bump the emulator's RAM before you blame your code. I wasted real time debugging "my" bug that was just a starved emulator.

4. Expo Go will never work. Said it above, saying it again, because you will forget once and spend ten minutes confused. Native modules mean a custom dev build.

5. One model runner at a time. The architecture is built around a single active model instance. Don't try to stand up two useLLM

components side by side and expect both to run.

6. Token batching exists for a reason. A fast model can push 60+ tokens/sec, and if every token triggers a React re-render, your UI jank-fest begins. The library batches token emissions (default around 10 tokens or 80ms, whichever first). If generation feels choppy or your list stutters, tune outputTokenBatchSize

and batchTimeInterval

in generationConfig

rather than reaching for a FlatList

rewrite.

Once the basic loop works, the library has more than chat:

Tool calling. Define functions the model can invoke (check weather, toggle a setting, hit a local API). You give it tools

plus an executeToolCallback

, and in managed mode it parses and runs the calls for you. Use a model whose chat template actually supports it. Hammer 2.1 is purpose-built for function calling.

Structured output. Need clean JSON instead of prose? There's a helper that turns a schema (plain JSON Schema or Zod) into formatting instructions, plus a validator to fix and check the result. This is how you'd build an offline "extract fields from this text" feature.

Vision and audio. Gemma 4 and LFM2.5-VL take a capabilities

array and accept an imagePath

or audio buffer on sendMessage

. On-device OCR into an LLM is a genuinely good offline-translation pattern.

RAG. There's a companion @react-native-rag/executorch

package that plugs this LLM (and on-device embeddings) into a vector store for fully local retrieval-augmented generation. If your "model is one component" instinct is itching, that's the package that proves the point.

Get the basic loop running, then start making it real. A sensible order:

Validate on real hardware first. Clone the repo, run examples/llm

on an actual phone (not the simulator), and watch the first-launch download happen. The number that matters is cold-start time on a mid-range device, and it should drive your model choice more than any benchmark table.

Study a real, shipped app. The minimal example gets you running; a production app shows you the parts the docs skip. Private Mind is Software Mansion's open-source, fully offline AI chatbot built on this library (it's live on the App Store and Play Store). Clone it and poke around: model down and management, on-device benchmarking, chat history, and custom assistant presets are all in there to learn from.

Check the benchmark pages before you commit to a model. The docs publish real inference-time, memory, and model-size numbers per model and device. Pick from those, not from vibes.

Build the screen before the chat screen. On-device AI lives or dies on that first impression, and it's the one thing the model code can't do for you.

Lazy-load with preventLoad. Every hook takes a

preventLoad

flag so a model doesn't download or eat RAM until the user actually opens that feature. On a multi-feature app, this is the difference between a 1GB cold install and a fast one.Once it runs, the fun part is what you build on top. A few possibilities the library makes surprisingly easy:

Hybrid privacy: redact on-device, then go to the cloud. usePrivacyFilter

is a local model that finds personal info (names, emails, phone numbers, addresses, even API keys) in text without it ever leaving the phone. Scrub that text locally, then send the safe version to a bigger cloud model. You get GPT-class quality without handing over the sensitive bits.

A fully offline voice assistant. Chain speech-to-text (Whisper is built in), then the LLM, then text-to-speech. Speak, get a spoken answer, no network anywhere in the loop. The library even supports streaming LLM output straight into speech as it generates.

Point-the-camera answers. Run on-device OCR on a photo of a sign, menu, or document, feed the extracted text to the LLM, and you've got an offline translator or document Q&A tool. Like a private, no-signal version of Google Lens.

Answers from the user's own notes. The companion @react-native-rag/executorch

package wires the model and on-device embeddings into a local vector store, so the model can answer questions grounded in the user's documents, fully offline.

Each of these is the same core idea pointed at a different problem: the model is one piece, and the product is the experience you wrap around it.

source & further reading

dev.to — original article Recall boosts Claude Code with offline memory for smooth project continuity Sparse KV Caches Cut Attention Scaling 5 Free Microsoft Resources Every Student Developer Should Bookmark in 2026

Six Lines, Zero API Calls: Running LLMs On-Device in React Native

Run your AI side-project on zahid.host