{"slug": "six-lines-zero-api-calls-running-llms-on-device-in-react-native", "title": "Six Lines, Zero API Calls: Running LLMs On-Device in React Native", "summary": "Software Mansion's react-native-executorch library, built on Meta's ExecuTorch runtime, enables running AI models on-device in React Native without API calls or network connectivity. A developer demonstrates building a local chat screen with just six lines of model code, highlighting that the library handles model execution while developers must build the surrounding UI and logic. The library requires React Native's New Architecture, Expo SDK 54+, and custom dev builds.", "body_md": "Every AI feature I've worked on has done the same quiet thing: collect the user's text, send it to someone else's server, pay per token, and pray the network holds. That's fine until it isn't:\n\nYour user is on a flight, with no network and a dead feature.\n\nIt's a journaling app, where \"we send your private thoughts to a third party\" is a hard no.\n\nFinance notices the OpenAI bill climbing in a straight line with usage.\n\nThere's another option most React Native devs still treat as exotic: run the model on the device. No API call, no network, no per-token cost. The first time I wired this into an offline text-enhancement tool with Expo, the surprise wasn't that it worked. It's that the actual model code was about six lines. The hard parts were everywhere except the model.\n\nThis is a walkthrough of react-native-executorch (by Software Mansion, the Reanimated and Gesture Handler folks), built on Meta's ExecuTorch runtime. We'll build a working local chat screen, but more importantly I'll show you the traps. The ones that cost me an afternoon each. The ones an AI-generated tutorial will confidently get wrong because the API changed underneath it.\n\n**React Native ExecuTorch** provides a declarative way to run AI models on-device using React Native, powered by **ExecuTorch** 🚀. It offers out-of-the-box support for a wide range of LLMs, computer vision models, and more. Visit our [HuggingFace](https://huggingface.co/software-mansion) page to explore these models.\n\n[ ExecuTorch](https://executorch.ai), developed by Meta, is a novel framework allowing AI model execution on devices like mobile phones or microcontrollers.\n\nReact Native ExecuTorch bridges the gap between React Native and native platform capabilities, enabling developers to efficiently run local AI models on mobile devices. This can be achieved without the need for extensive expertise in native programming or machine learning.\n\nThe minimal supported version are:\n\nBefore any code, one idea that saves a lot of confusion: the LLM is one part of your app, not the app itself. The library gives you the model and nothing else. Everything around it is still your job.\n\nIt helps to picture three things working together:\n\n**Your normal app code** is predictable. Same input, same output, every time: the buttons, the list, the navigation.\n\n**The model** is not. Give it the same prompt twice and you'll get slightly different answers, because it generates text by predicting likely next words, not by looking facts up. That's also why it sometimes states wrong things with total confidence. It isn't a defect you can patch, it's how the thing works, so plan for it.\n\n**The person reading the output** is the final check. They decide what to trust and what to ignore.\n\nreact-native-executorch owns only the middle piece. It hands you a stream of words and a few status flags. It does not manage your chat UI, decide when to run the model, or judge whether the answer is any good. Those are yours to build.\n\n**New Architecture only.** The library does not support the old RN architecture. If your app is still on it, that's your first migration.\n\n**Expo SDK 54+** if you're on Expo (which I'd recommend). Older SDKs break on the file-system APIs the library now depends on.\n\n**A custom dev build, not Expo Go.** This relies on native modules. Expo Go will not load it. This trips up *everyone* the first time.\n\n**A real iOS device for release builds.** Because ExecuTorch runs natively, you can't produce an iOS *release* build targeting the simulator. Debug on the sim is fine; release testing needs hardware.\n\nThat last pair isn't optional advice, it's the difference between \"why won't this run\" and a working build. Write them on a sticky note.\n\nInstallation is two steps: install the core package, then add a resource fetcher adapter.\n\n```\nnpm install react-native-executorch\n```\n\nThen a **resource fetcher adapter**. These are platform-specific, so install the one that matches your project.\n\n```\n# Expo projects\n\nnpm install react-native-executorch-expo-resource-fetcher expo-file-system expo-asset\n# Bare React Native\n\nnpm install react-native-executorch-bare-resource-fetcher @dr.pogodin/react-native-fs @kesha-antonov/react-native-background-downloader\n```\n\n**Before you call any other API**, you must initialize ExecuTorch with that adapter, once, at your app's entry point:\n\n``` js\n// App.tsx (or index.js), top level, runs once\n\nimport { initExecutorch } from \"react-native-executorch\";\n\nimport { ExpoResourceFetcher } from \"react-native-executorch-expo-resource-fetcher\";\n\ninitExecutorch({ resourceFetcher: ExpoResourceFetcher });\n```\n\nSkip this and the first model you load throws `ResourceFetcherAdapterNotInitialized`\n\n. It's the most common setup mistake, and an easy one to miss because `initExecutorch`\n\nlives at your entry point, far from where you actually call `useLLM`\n\n.\n\nOne more, if you plan to bundle a model with the app via `require()`\n\ninstead of downloading it. Add the binary extensions to Metro:\n\n```\n// metro.config.js\n\ndefaultConfig.resolver.assetExts.push(\"pte\"); // exported model\n\ndefaultConfig.resolver.assetExts.push(\"bin\"); // tokenizer\n```\n\nHere's the whole \"load an LLM\" surface:\n\n``` js\nimport { models, useLLM } from \"react-native-executorch\";\n\nfunction Chat() {\n  const llm = useLLM({ model: models.llm.lfm2_5_1_2b_instruct() });\n\n  // ...\n}\n```\n\n`models.llm.*`\n\nis a factory of pre-exported, ready-to-run models. One factory call gives the runtime everything it needs, already bundled:\n\n`.pte`\n\nformat, already converted[Software Mansion hosts the full lineup on HuggingFace](https://huggingface.co/software-mansion), so you point at a model and the library handles fetching and wiring up the rest. No manual file juggling.\n\nI'm using LFM2.5 1.2B here because it's the library's own default and small enough to behave on mid-range hardware. You've got real choices though. The bundled lineup includes:\n\n**Text models:** Qwen 3 (0.6B / 1.7B / 4B), Llama 3.2 (1B / 3B), Phi 4 Mini, SmolLM 2, Hammer 2.1\n\n**Vision-capable:** Gemma 4 and LFM2.5-VL\n\n**Why I'd start small:** a 4B model is noticeably smarter and noticeably more likely to crash with an out-of-memory error on a budget Android. Pick the smallest model that clears your quality bar, then size up only if you must.\n\nThe hook gives you state to drive your UI:\n\n`llm.downloadProgress`\n\n: 0 to 1 while the model downloads on first launch\n\n`llm.isReady`\n\n: flips true when it's loaded and usable\n\n`llm.error`\n\n: populated if anything blows up\n\n`llm.isGenerating`\n\n: true while tokens are streaming\n\n`llm.response`\n\n: the generated text, updated *token by token*\n\nThere are two ways to use this hook, and the docs name them well: **functional** vs **managed**. The distinction matters, so don't skim it.\n\nYou pass the full message array every time, you keep the history, you get a token stream back. Nothing is remembered for you.\n\n``` js\nimport { models, useLLM, type Message } from \"react-native-executorch\";\n\nimport { View, Text, Button } from \"react-native\";\n\nfunction Chat() {\n  const llm = useLLM({ model: models.llm.lfm2_5_1_2b_instruct() });\n\n  const handleGenerate = async () => {\n    const chat: Message[] = [\n      { role: \"system\", content: \"You are a concise, helpful assistant.\" },\n\n      { role: \"user\", content: \"Explain a closure in one sentence.\" },\n    ];\n\n    // resolves to the full string; llm.response updates live as it streams\n\n    const final = await llm.generate(chat);\n\n    console.log(\"done:\", final);\n  };\n\n  if (!llm.isReady) {\n    return (\n      <Text>Loading model… {Math.round(llm.downloadProgress * 100)}%</Text>\n    );\n  }\n\n  return (\n    <View>\n      <Button title=\"Generate\" onPress={handleGenerate} />\n\n      <Text>{llm.response}</Text>\n    </View>\n  );\n}\n```\n\nNote the shape of `generate`\n\n: it both returns a promise *and* streams into `llm.response`\n\n. So you render `llm.response`\n\nfor the live typewriter effect, and `await`\n\nthe return value when you need the finished string for, say, saving to a DB. Same call, two consumption patterns.\n\nIf you're building an actual back-and-forth chat, you don't want to hand-roll the history array. `sendMessage`\n\nplus `messageHistory`\n\nplus `configure`\n\ndoes it for you:\n\n``` js\nimport { useEffect } from \"react\";\n\nimport { models, useLLM, DEFAULT_SYSTEM_PROMPT } from \"react-native-executorch\";\n\nfunction ManagedChat() {\n  const llm = useLLM({ model: models.llm.lfm2_5_1_2b_instruct() });\n\n  const { configure } = llm;\n\n  useEffect(() => {\n    configure({\n      chatConfig: {\n        systemPrompt: `${DEFAULT_SYSTEM_PROMPT} Keep answers short.`,\n      },\n\n      generationConfig: {\n        temperature: 0.7,\n\n        topP: 0.9,\n      },\n    });\n  }, [configure]);\n\n  const send = () => llm.sendMessage(\"Who are you?\");\n\n  return (\n    <View>\n      {llm.messageHistory.map((m, i) => (\n        <Text key={i}>\n          {m.role}: {m.content}\n        </Text>\n      ))}\n\n      <Button title=\"Send\" onPress={send} disabled={!llm.isReady} />\n    </View>\n  );\n}\n```\n\n`configure`\n\nonly affects the managed path. `chatConfig`\n\nand `toolsConfig`\n\ndo nothing to `generate()`\n\n. That's a subtle footgun: set a system prompt in `configure`\n\n, then call `generate`\n\nand wonder why it's ignored. Mode and config have to match.\n\n**My take:** use managed for chat, functional for one-shot transforms (summarize this, rewrite that, extract JSON from this). The text-enhancement tool I mentioned was pure functional. There's no conversation, just *input string to improved string*, and managed state would've been overhead I'd have to fight.\n\nThis is the section a docs-paraphrase can't write, so here's the honest list of what actually bit me.\n\n**1. Dismounting mid-generation crashes the app.** Hard crash, not a warning. If the user navigates away while tokens are still streaming, you go down. The fix is to interrupt and wait:\n\n```\n// before unmount / on a stop button\n\nllm.interrupt();\n\n// then wait until llm.isGenerating === false before tearing down\n```\n\nWire a stop button to `interrupt()`\n\nand gate `isGenerating`\n\ninto your navigation guards. I learned this the way everyone does, with a back-button press during a long answer.\n\n**2. First launch downloads a model. A big one.** These files run from roughly 700MB to over a gigabyte. The hosted models stream down on first use and cache in your app's documents directory, but if you don't render `downloadProgress`\n\n, the user stares at a dead screen and force-quits. Build the loading UX *first*, not last. And consider letting users pick a model, or bundle a small one for offline-from-install.\n\n**3. RAM, not CPU, is your ceiling.** Crashes on cheaper devices are almost always out-of-memory, not slowness. Use quantized models. If you're testing on an Android emulator and it dies, bump the emulator's RAM before you blame your code. I wasted real time debugging \"my\" bug that was just a starved emulator.\n\n**4. Expo Go will never work.** Said it above, saying it again, because you *will* forget once and spend ten minutes confused. Native modules mean a custom dev build.\n\n**5. One model runner at a time.** The architecture is built around a single active model instance. Don't try to stand up two `useLLM`\n\ncomponents side by side and expect both to run.\n\n**6. Token batching exists for a reason.** A fast model can push 60+ tokens/sec, and if every token triggers a React re-render, your UI jank-fest begins. The library batches token emissions (default around 10 tokens or 80ms, whichever first). If generation feels choppy or your list stutters, tune `outputTokenBatchSize`\n\nand `batchTimeInterval`\n\nin `generationConfig`\n\nrather than reaching for a `FlatList`\n\nrewrite.\n\nOnce the basic loop works, the library has more than chat:\n\n**Tool calling.** Define functions the model can invoke (check weather, toggle a setting, hit a local API). You give it `tools`\n\nplus an `executeToolCallback`\n\n, and in managed mode it parses and runs the calls for you. Use a model whose chat template actually supports it. Hammer 2.1 is purpose-built for function calling.\n\n**Structured output.** Need clean JSON instead of prose? There's a helper that turns a schema (plain JSON Schema or Zod) into formatting instructions, plus a validator to fix and check the result. This is how you'd build an offline \"extract fields from this text\" feature.\n\n**Vision and audio.** Gemma 4 and LFM2.5-VL take a `capabilities`\n\narray and accept an `imagePath`\n\nor audio buffer on `sendMessage`\n\n. On-device OCR into an LLM is a genuinely good offline-translation pattern.\n\n**RAG.** There's a companion `@react-native-rag/executorch`\n\npackage that plugs this LLM (and on-device embeddings) into a vector store for fully local retrieval-augmented generation. If your \"model is one component\" instinct is itching, that's the package that proves the point.\n\nGet the basic loop running, then start making it real. A sensible order:\n\n**Validate on real hardware first.** Clone the repo, run `examples/llm`\n\non an actual phone (not the simulator), and watch the first-launch download happen. The number that matters is cold-start time on a mid-range device, and it should drive your model choice more than any benchmark table.\n\n**Study a real, shipped app.** The minimal example gets you running; a production app shows you the parts the docs skip. [Private Mind](https://github.com/software-mansion-labs/private-mind) is Software Mansion's open-source, fully offline AI chatbot built on this library (it's live on the App Store and Play Store). Clone it and poke around: model downloading and management, on-device benchmarking, chat history, and custom assistant presets are all in there to learn from.\n\n**Check the benchmark pages before you commit to a model.** The docs publish real inference-time, memory, and model-size numbers per model and device. Pick from those, not from vibes.\n\n**Build the loading screen before the chat screen.** On-device AI lives or dies on that first impression, and it's the one thing the model code can't do for you.\n\n**Lazy-load with preventLoad.** Every hook takes a\n\n`preventLoad`\n\nflag so a model doesn't download or eat RAM until the user actually opens that feature. On a multi-feature app, this is the difference between a 1GB cold install and a fast one.Once it runs, the fun part is what you build on top. A few possibilities the library makes surprisingly easy:\n\n**Hybrid privacy: redact on-device, then go to the cloud.** `usePrivacyFilter`\n\nis a local model that finds personal info (names, emails, phone numbers, addresses, even API keys) in text without it ever leaving the phone. Scrub that text locally, then send the safe version to a bigger cloud model. You get GPT-class quality without handing over the sensitive bits.\n\n**A fully offline voice assistant.** Chain speech-to-text (Whisper is built in), then the LLM, then text-to-speech. Speak, get a spoken answer, no network anywhere in the loop. The library even supports streaming LLM output straight into speech as it generates.\n\n**Point-the-camera answers.** Run on-device OCR on a photo of a sign, menu, or document, feed the extracted text to the LLM, and you've got an offline translator or document Q&A tool. Like a private, no-signal version of Google Lens.\n\n**Answers from the user's own notes.** The companion `@react-native-rag/executorch`\n\npackage wires the model and on-device embeddings into a local vector store, so the model can answer questions grounded in the user's documents, fully offline.\n\nEach of these is the same core idea pointed at a different problem: the model is one piece, and the product is the experience you wrap around it.", "url": "https://wpnews.pro/news/six-lines-zero-api-calls-running-llms-on-device-in-react-native", "canonical_source": "https://dev.to/vikrantnegi/six-lines-zero-api-calls-running-llms-on-device-in-react-native-3ahl", "published_at": "2026-06-22 04:49:24+00:00", "updated_at": "2026-06-22 05:09:49.584965+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "developer-tools", "ai-infrastructure"], "entities": ["Software Mansion", "Meta", "ExecuTorch", "React Native", "Expo", "react-native-executorch", "HuggingFace"], "alternates": {"html": "https://wpnews.pro/news/six-lines-zero-api-calls-running-llms-on-device-in-react-native", "markdown": "https://wpnews.pro/news/six-lines-zero-api-calls-running-llms-on-device-in-react-native.md", "text": "https://wpnews.pro/news/six-lines-zero-api-calls-running-llms-on-device-in-react-native.txt", "jsonld": "https://wpnews.pro/news/six-lines-zero-api-calls-running-llms-on-device-in-react-native.jsonld"}}