{"slug": "liquid-ai-ships-lfm2-5-230m-with-llama-cpp-mlx-vllm-sglang-and-onnx-support-for", "title": "Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Support for On-Device Inference", "summary": "Liquid AI released LFM2.5-230M, its smallest open-weight model optimized for on-device agentic tasks like data extraction and tool use, achieving 213 tok/s on a Galaxy S25 Ultra and 42 tok/s on a Raspberry Pi 5. The 230M-parameter model outperforms larger rivals on instruction following and extraction benchmarks but trails on general knowledge, with day-one support across major inference frameworks.", "body_md": "Liquid AI shipped ** LFM2.5-230M**, it’s the company’s smallest model to date. The release targets a specific job: running agentic tasks on phones, robots, and automation devices. Both the base and instruction-tuned checkpoints are open-weight on Hugging Face.\n\nThe pitch is narrow on purpose. This is not a general reasoning model. It is built for data extraction and tool use on edge hardware.\n\n**TL;DR**\n\n- Liquid AI’s LFM2.5-230M is its smallest model yet: 230M params, open-weight, built on LFM2.\n- Runs on-device at 213 tok/s on a Galaxy S25 Ultra and 42 on a Raspberry Pi 5.\n- Beats larger models (Qwen3.5-0.8B, Gemma 3 1B) on instruction following and data extraction.\n- Tuned for tool use and extraction; not for math, code generation, or creative writing.\n- Day-one support across llama.cpp, MLX, vLLM, SGLang, and ONNX, with a 293–375 MB footprint.\n\n**What is LFM2.5-230M?**\n\nLFM2.5-230M is a 230-million-parameter, text-only model. It is built on the LFM2 architecture. The model has 14 layers total. Eight are double-gated LIV convolution blocks. The remaining six are grouped-query attention (GQA) blocks. The hybrid layout targets fast CPU inference.\n\nThe context length is 32,768 tokens. The vocabulary size is 65,536. The knowledge cutoff is mid-2024. It supports ten languages, including English, Chinese, Arabic, and Japanese.\n\nLiquid AI team ships two checkpoints. LFM2.5-230M-Base is the pre-trained model for fine-tuning. LFM2.5-230M is the general-purpose instruction-tuned version. The license is lfm1.0.\n\n**Training and Post-Training**\n\nThe model was pre-trained on 19 trillion tokens. That total includes a 32K context extension phase. The post-training recipe then runs in three stages.\n\nFirst comes supervised fine-tuning with distillation from the larger LFM2.5-350M. Second is direct preference optimization (DPO). Third is multi-domain reinforcement learning. This preserves flexibility for downstream specialization.\n\nThe distillation step is what keeps a 230M model competitive with larger checkpoints. It inherits behavior from the bigger LFM2.5-350M on targeted tasks.\n\n**Benchmark**\n\nLiquid AI team evaluated LFM2.5-230M across ten benchmarks. They span knowledge, instruction following, data extraction, and tool use.\n\nThe instruction-following results support that. On IFEval, LFM2.5-230M scores 71.71. That beats Qwen3.5-0.8B (59.94) and Gemma 3 1B IT (63.49). On IFBench it scores 38.40, ahead of both. On CaseReportBench, a clinical data-extraction test, it scores 22.51.\n\n| Model | Params | IFEval | IFBench | CaseReportBench | BFCLv4 | MMLU-Pro |\n|---|---|---|---|---|---|---|\nLFM2.5-230M | 230M | 71.71 | 38.40 | 22.51 | 21.03 | 20.25 |\n| LFM2.5-350M | 350M | 76.96 | 40.69 | 32.45 | 21.86 | 20.01 |\n| Granite 4.0-H-350M | 350M | 61.27 | 17.22 | 12.44 | 13.28 | 13.14 |\n| Qwen3.5-0.8B (Instruct) | 800M | 59.94 | 22.87 | 13.83 | 18.70 | 37.42 |\n| Gemma 3 1B IT | 1B | 63.49 | 20.33 | 2.28 | 7.17 | 14.04 |\n\nLFM2.5-230M leads on instruction following and data extraction. It trails on broad knowledge: MMLU-Pro is 20.25, behind Qwen3.5-0.8B’s 37.42. It is also weak on some agentic tool use. On τ²-Bench Telecom it scores just 5.26.\n\nLiquid AI is direct about the limits. It does not recommend the model for reasoning-heavy workloads. That means advanced math, code generation, and creative writing.\n\n**Use Cases With Examples**\n\nThe model fits two jobs well.\n\n- The first is large-scale data extraction pipelines. Picture a pipeline parsing 100,000 clinical reports into structured fields. A 4-bit build with a 293–375 MB memory footprint runs that on commodity CPUs. You extract locally, with no per-token API bill.\n\n- The second job is lightweight on-device agentic workloads. Think a home automation hub that turns speech into tool calls. Or a phone assistant that routes a request to the right function.\n\nAs an early signal, Liquid AI deployed the model on a Unitree G1 humanoid robot. It ran entirely on the robot’s onboard NVIDIA Jetson Orin. There the model acted as a skill-selection layer. It turned one natural-language instruction into a sequence of tool calls. Those calls invoked low-level skills from NVIDIA’s SONIC framework.\n\n**Tool Use: How It Works**\n\nLFM2.5 supports function calling in four steps. You define tools as JSON in the system prompt. The model writes a Pythonic function call between special tokens. You execute the call and return the result. The model then writes a plain-text answer.\n\nBy default the call is a Python list. It sits between the `<|tool_call_start|>`\n\nand `<|tool_call_end|>`\n\ntokens. Here is the documented pattern, with the tool JSON abbreviated:\n\n```\n<|im_start|>system\nList of tools: [{\"name\": \"get_candidate_status\",\n  \"parameters\": {\"candidate_id\": {\"type\": \"string\"}}}]<|im_end|>\n<|im_start|>user\nWhat is the current status of candidate ID 12345?<|im_end|>\n<|im_start|>assistant\n<|tool_call_start|>[get_candidate_status(candidate_id=\"12345\")]<|tool_call_end|>Checking the current status of candidate ID 12345.<|im_end|>\n```\n\nYou can also force JSON-formatted calls through the system prompt.\n\n**Running It: A Minimal Example**\n\nThe model works with Transformers 5.0.0 and up. The recommended generation settings are temperature 0.1, top_k 50, and repetition_penalty 1.05. Note the `do_sample=True`\n\nflag, which is required for those sampling settings to apply.\n\n``` python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_id = \"LiquidAI/LFM2.5-230M\"\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_id,\n    device_map=\"auto\",\n    dtype=\"bfloat16\",\n)\ntokenizer = AutoTokenizer.from_pretrained(model_id)\n\ninputs = tokenizer.apply_chat_template(\n    [{\"role\": \"user\", \"content\": \"What is C. elegans?\"}],\n    add_generation_prompt=True,\n    tokenize=True,\n    return_dict=True,\n    return_tensors=\"pt\",\n).to(model.device)\n\noutput = model.generate(\n    **inputs,\n    do_sample=True,\n    temperature=0.1,\n    top_k=50,\n    repetition_penalty=1.05,\n    max_new_tokens=512,\n)\nprint(tokenizer.decode(output[0][inputs[\"input_ids\"].shape[-1]:], skip_special_tokens=True))\n```\n\nLiquid AI also publishes fine-tuning recipes. They cover SFT, DPO, and GRPO with LoRA, via Unsloth and TRL. Each ships as a Colab notebook.\n\n**Interactive Explainer**\n\nCheck out the ** Model weight on HF**,\n\n**and**\n\n[Technical details](https://www.liquid.ai/blog/lfm2-5-230m)**.**\n\n[Docs](https://docs.liquid.ai/lfm/models/complete-library)**Also, feel free to follow us on**\n\n**and don’t forget to join our**[Twitter](https://x.com/intent/follow?screen_name=marktechpost)\n\n**and Subscribe to**\n\n[150k+ML SubReddit](https://www.reddit.com/r/machinelearningnews/)**. Wait! are you on telegram?**\n\n[our Newsletter](https://www.aidevsignals.com/)\n\n[now you can join us on telegram as well.](https://t.me/machinelearningresearchnews)Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? [Connect with us](https://forms.gle/wbash1wF6efRj8G58)", "url": "https://wpnews.pro/news/liquid-ai-ships-lfm2-5-230m-with-llama-cpp-mlx-vllm-sglang-and-onnx-support-for", "canonical_source": "https://www.marktechpost.com/2026/06/27/liquid-ai-ships-lfm2-5-230m-with-llama-cpp-mlx-vllm-sglang-and-onnx-support-for-on-device-inference/", "published_at": "2026-06-28 04:58:20+00:00", "updated_at": "2026-06-28 05:10:26.863988+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "ai-products", "ai-tools", "ai-infrastructure"], "entities": ["Liquid AI", "LFM2.5-230M", "Hugging Face", "Galaxy S25 Ultra", "Raspberry Pi 5", "Qwen3.5-0.8B", "Gemma 3 1B", "NVIDIA Jetson Orin"], "alternates": {"html": "https://wpnews.pro/news/liquid-ai-ships-lfm2-5-230m-with-llama-cpp-mlx-vllm-sglang-and-onnx-support-for", "markdown": "https://wpnews.pro/news/liquid-ai-ships-lfm2-5-230m-with-llama-cpp-mlx-vllm-sglang-and-onnx-support-for.md", "text": "https://wpnews.pro/news/liquid-ai-ships-lfm2-5-230m-with-llama-cpp-mlx-vllm-sglang-and-onnx-support-for.txt", "jsonld": "https://wpnews.pro/news/liquid-ai-ships-lfm2-5-230m-with-llama-cpp-mlx-vllm-sglang-and-onnx-support-for.jsonld"}}