Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Support for On-Device Inference

Liquid AI released LFM2.5-230M, its smallest open-weight model optimized for on-device agentic tasks like data extraction and tool use, achieving 213 tok/s on a Galaxy S25 Ultra and 42 tok/s on a Raspberry Pi 5. The 230M-parameter model outperforms larger rivals on instruction following and extraction benchmarks but trails on general knowledge, with day-one support across major inference frameworks.

Liquid AI shipped LFM2.5-230M , it’s the company’s smallest model to date. The release targets a specific job: running agentic tasks on phones, robots, and automation devices. Both the base and instruction-tuned checkpoints are open-weight on Hugging Face. The pitch is narrow on purpose. This is not a general reasoning model. It is built for data extraction and tool use on edge hardware. TL;DR - Liquid AI’s LFM2.5-230M is its smallest model yet: 230M params, open-weight, built on LFM2. - Runs on-device at 213 tok/s on a Galaxy S25 Ultra and 42 on a Raspberry Pi 5. - Beats larger models Qwen3.5-0.8B, Gemma 3 1B on instruction following and data extraction. - Tuned for tool use and extraction; not for math, code generation, or creative writing. - Day-one support across llama.cpp, MLX, vLLM, SGLang, and ONNX, with a 293–375 MB footprint. What is LFM2.5-230M? LFM2.5-230M is a 230-million-parameter, text-only model. It is built on the LFM2 architecture. The model has 14 layers total. Eight are double-gated LIV convolution blocks. The remaining six are grouped-query attention GQA blocks. The hybrid layout targets fast CPU inference. The context length is 32,768 tokens. The vocabulary size is 65,536. The knowledge cutoff is mid-2024. It supports ten languages, including English, Chinese, Arabic, and Japanese. Liquid AI team ships two checkpoints. LFM2.5-230M-Base is the pre-trained model for fine-tuning. LFM2.5-230M is the general-purpose instruction-tuned version. The license is lfm1.0. Training and Post-Training The model was pre-trained on 19 trillion tokens. That total includes a 32K context extension phase. The post-training recipe then runs in three stages. First comes supervised fine-tuning with distillation from the larger LFM2.5-350M. Second is direct preference optimization DPO . Third is multi-domain reinforcement learning. This preserves flexibility for downstream specialization. The distillation step is what keeps a 230M model competitive with larger checkpoints. It inherits behavior from the bigger LFM2.5-350M on targeted tasks. Benchmark Liquid AI team evaluated LFM2.5-230M across ten benchmarks. They span knowledge, instruction following, data extraction, and tool use. The instruction-following results support that. On IFEval, LFM2.5-230M scores 71.71. That beats Qwen3.5-0.8B 59.94 and Gemma 3 1B IT 63.49 . On IFBench it scores 38.40, ahead of both. On CaseReportBench, a clinical data-extraction test, it scores 22.51. | Model | Params | IFEval | IFBench | CaseReportBench | BFCLv4 | MMLU-Pro | |---|---|---|---|---|---|---| LFM2.5-230M | 230M | 71.71 | 38.40 | 22.51 | 21.03 | 20.25 | | LFM2.5-350M | 350M | 76.96 | 40.69 | 32.45 | 21.86 | 20.01 | | Granite 4.0-H-350M | 350M | 61.27 | 17.22 | 12.44 | 13.28 | 13.14 | | Qwen3.5-0.8B Instruct | 800M | 59.94 | 22.87 | 13.83 | 18.70 | 37.42 | | Gemma 3 1B IT | 1B | 63.49 | 20.33 | 2.28 | 7.17 | 14.04 | LFM2.5-230M leads on instruction following and data extraction. It trails on broad knowledge: MMLU-Pro is 20.25, behind Qwen3.5-0.8B’s 37.42. It is also weak on some agentic tool use. On τ²-Bench Telecom it scores just 5.26. Liquid AI is direct about the limits. It does not recommend the model for reasoning-heavy workloads. That means advanced math, code generation, and creative writing. Use Cases With Examples The model fits two jobs well. - The first is large-scale data extraction pipelines. Picture a pipeline parsing 100,000 clinical reports into structured fields. A 4-bit build with a 293–375 MB memory footprint runs that on commodity CPUs. You extract locally, with no per-token API bill. - The second job is lightweight on-device agentic workloads. Think a home automation hub that turns speech into tool calls. Or a phone assistant that routes a request to the right function. As an early signal, Liquid AI deployed the model on a Unitree G1 humanoid robot. It ran entirely on the robot’s onboard NVIDIA Jetson Orin. There the model acted as a skill-selection layer. It turned one natural-language instruction into a sequence of tool calls. Those calls invoked low-level skills from NVIDIA’s SONIC framework. Tool Use: How It Works LFM2.5 supports function calling in four steps. You define tools as JSON in the system prompt. The model writes a Pythonic function call between special tokens. You execute the call and return the result. The model then writes a plain-text answer. By default the call is a Python list. It sits between the <|tool call start| and <|tool call end| tokens. Here is the documented pattern, with the tool JSON abbreviated: <|im start| system List of tools: {"name": "get candidate status", "parameters": {"candidate id": {"type": "string"}}} <|im end| <|im start| user What is the current status of candidate ID 12345?<|im end| <|im start| assistant <|tool call start| get candidate status candidate id="12345" <|tool call end| Checking the current status of candidate ID 12345.<|im end| You can also force JSON-formatted calls through the system prompt. Running It: A Minimal Example The model works with Transformers 5.0.0 and up. The recommended generation settings are temperature 0.1, top k 50, and repetition penalty 1.05. Note the do sample=True flag, which is required for those sampling settings to apply. python from transformers import AutoModelForCausalLM, AutoTokenizer model id = "LiquidAI/LFM2.5-230M" model = AutoModelForCausalLM.from pretrained model id, device map="auto", dtype="bfloat16", tokenizer = AutoTokenizer.from pretrained model id inputs = tokenizer.apply chat template {"role": "user", "content": "What is C. elegans?"} , add generation prompt=True, tokenize=True, return dict=True, return tensors="pt", .to model.device output = model.generate inputs, do sample=True, temperature=0.1, top k=50, repetition penalty=1.05, max new tokens=512, print tokenizer.decode output 0 inputs "input ids" .shape -1 : , skip special tokens=True Liquid AI also publishes fine-tuning recipes. They cover SFT, DPO, and GRPO with LoRA, via Unsloth and TRL. Each ships as a Colab notebook. Interactive Explainer Check out the Model weight on HF , and Technical details https://www.liquid.ai/blog/lfm2-5-230m . Docs https://docs.liquid.ai/lfm/models/complete-library Also, feel free to follow us on and don’t forget to join our Twitter https://x.com/intent/follow?screen name=marktechpost and Subscribe to 150k+ML SubReddit https://www.reddit.com/r/machinelearningnews/ . Wait are you on telegram? our Newsletter https://www.aidevsignals.com/ now you can join us on telegram as well. https://t.me/machinelearningresearchnews Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us https://forms.gle/wbash1wF6efRj8G58