A 230-million-parameter model just outscored a 1-billion-parameter transformer on data extraction — and you can run it on a Raspberry Pi 5 today. Liquid AI released LFM 2.5-230M on June 25, 2026. On the CaseReportBench data extraction benchmark, it scored 22.51 against Alibaba’s Qwen3.5-0.8B (13.83) and Google’s Gemma 3 1B (2.28) — models with 3.5x and 4.3x more parameters, respectively. This is not a quirky benchmark result. It is an architecture story with concrete deployment implications for developers building data pipelines and edge AI systems.
What LFM 2.5-230M Is #
LFM 2.5-230M is the smallest model in Liquid AI’s LFM 2.5 family. Liquid AI is a Boston-based company spun out of MIT CSAIL, founded by researchers with backgrounds in dynamical systems, signal processing, and robotics. The company raised $250M Series A at a $2.35B valuation with AMD Ventures leading.
The model was pre-trained on 19 trillion tokens and refined through a three-stage pipeline: continual pre-training, supervised fine-tuning, and multi-stage reinforcement learning targeting tool use and structured extraction. It carries a 32,000-token context window — unusually large for a 230M model. Its stated use case is explicit: lightweight data extraction pipelines and agentic tool-calling at the edge.
The Architecture: Not a Transformer, Not Mamba #
The performance gap is explained by architecture. LFM 2.5-230M is built on the LFM2 design — a hybrid that combines gated short-range convolutions with a minority of Grouped Query Attention (GQA) layers.
Standard transformers carry a KV cache that grows with context length (O(n) memory). On a Raspberry Pi or mobile device with constrained RAM, that scaling behavior is a hard wall. Pure state-space models like Mamba avoid the KV cache entirely but lose long-range coherence. LFM2 splits the difference: convolution layers handle local sequence mixing at O(1) per-step decode cost, while a small number of GQA blocks preserve long-range interaction without ballooning memory usage.
The architecture layout was not hand-tuned — it was found via a hardware-in-the-loop search that optimized for quality under strict speed and memory budgets. The LFM2 technical report on arXiv covers the full methodology. The 42 tokens-per-second result on a Raspberry Pi 5 is not incidental — it is what happens when architecture constraints are designed around the target hardware from the start.
Benchmarks: Where It Wins, Where It Does Not #
LFM 2.5-230M is a specialist. The benchmarks reflect that clearly.
| Model | Parameters | CaseReportBench | BFCLv3 (Tool Use) | IFEval |
|---|---|---|---|---|
| LFM 2.5-230M | 230M | 22.51 | 43.26 | 71.71 |
| IBM Granite 4.0 | 350M | — | 39.58 | — |
| Qwen3.5-0.8B | 800M | 13.83 | — | — |
| Gemma 3 1B | 1,000M | 2.28 | 16.61 | 63.49 |
On MMLU-Pro, the picture flips: Qwen3.5-0.8B scores 37.42 against LFM 2.5-230M’s 20.25. General knowledge reasoning is not this model’s territory. If you need a general-purpose assistant, use something else. If you need structured output from a data pipeline running on constrained hardware, the benchmark case is strong.
Deployment: One Command to Start #
The full local inference stack has day-one support. You can run it right now via Ollama:
ollama run hf.co/LiquidAI/LFM2.5-230M-Instruct-GGUF:Q4_K_M
GGUF checkpoints are available for llama.cpp directly, with vLLM and MLX (Apple Silicon) support shipping on the same day. LM Studio and Jan offer GUI options. At Q4 quantization, the model runs under 1GB of RAM — it fits on a Raspberry Pi 5 4GB with room left for your application stack.
Hardware inference speeds from the official release: 213 tok/s on a Samsung Galaxy S25 Ultra, 42 tok/s on a Raspberry Pi 5. That second figure is interactive speed. A background extraction daemon running on a Pi at 42 tok/s is a real production option, not a proof of concept. Model checkpoints are available on HuggingFace, with community GGUF variants packaged by Unsloth on the same day.
Licensing #
LFM 2.5-230M is free for individuals and organizations with under $10 million in annual revenue. Organizations above that threshold require an enterprise license. Both instruct and base checkpoints are available. If you are using community GGUF variants, verify the licensing terms before commercial deployment — they may differ from the official release.
What This Signals #
Gartner projects SLM usage will surpass LLM usage by 2027. That forecast has been circulating for a while. What LFM 2.5-230M adds is not just another small transformer — it is evidence that non-transformer architectures can outperform larger transformers on specific tasks while running on hardware that most AI infrastructure discussions ignore entirely.
The question for developers in 2026 is not whether large models are capable — they are. It is whether your data extraction pipeline actually requires a billion-parameter transformer, or whether a 230M hybrid running at 42 tok/s on a $35 board covers the work at a fraction of the cost. Based on these benchmarks, that question deserves a serious answer.