Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Support for On-Device Inference

wpnews.pro

Liquid AI shipped ** LFM2.5-230M**, it’s the company’s smallest model to date. The release targets a specific job: running agentic tasks on phones, robots, and automation devices. Both the base and instruction-tuned checkpoints are open-weight on Hugging Face.

The pitch is narrow on purpose. This is not a general reasoning model. It is built for data extraction and tool use on edge hardware.

TL;DR

Liquid AI’s LFM2.5-230M is its smallest model yet: 230M params, open-weight, built on LFM2.
Runs on-device at 213 tok/s on a Galaxy S25 Ultra and 42 on a Raspberry Pi 5.
Beats larger models (Qwen3.5-0.8B, Gemma 3 1B) on instruction following and data extraction.
Tuned for tool use and extraction; not for math, code generation, or creative writing.
Day-one support across llama.cpp, MLX, vLLM, SGLang, and ONNX, with a 293–375 MB footprint.

What is LFM2.5-230M?

LFM2.5-230M is a 230-million-parameter, text-only model. It is built on the LFM2 architecture. The model has 14 layers total. Eight are double-gated LIV convolution blocks. The remaining six are grouped-query attention (GQA) blocks. The hybrid layout targets fast CPU inference.

The context length is 32,768 tokens. The vocabulary size is 65,536. The knowledge cutoff is mid-2024. It supports ten languages, including English, Chinese, Arabic, and Japanese.

Liquid AI team ships two checkpoints. LFM2.5-230M-Base is the pre-trained model for fine-tuning. LFM2.5-230M is the general-purpose instruction-tuned version. The license is lfm1.0.

Training and Post-Training

The model was pre-trained on 19 trillion tokens. That total includes a 32K context extension phase. The post-training recipe then runs in three stages.

First comes supervised fine-tuning with distillation from the larger LFM2.5-350M. Second is direct preference optimization (DPO). Third is multi-domain reinforcement learning. This preserves flexibility for downstream specialization.

The distillation step is what keeps a 230M model competitive with larger checkpoints. It inherits behavior from the bigger LFM2.5-350M on targeted tasks.

Benchmark

Liquid AI team evaluated LFM2.5-230M across ten benchmarks. They span knowledge, instruction following, data extraction, and tool use.

The instruction-following results support that. On IFEval, LFM2.5-230M scores 71.71. That beats Qwen3.5-0.8B (59.94) and Gemma 3 1B IT (63.49). On IFBench it scores 38.40, ahead of both. On CaseReportBench, a clinical data-extraction test, it scores 22.51.

Model	Params	IFEval	IFBench	CaseReportBench	BFCLv4	MMLU-Pro
LFM2.5-230M	230M	71.71	38.40	22.51	21.03	20.25
LFM2.5-350M	350M	76.96	40.69	32.45	21.86	20.01
Granite 4.0-H-350M	350M	61.27	17.22	12.44	13.28	13.14
Qwen3.5-0.8B (Instruct)	800M	59.94	22.87	13.83	18.70	37.42
Gemma 3 1B IT	1B	63.49	20.33	2.28	7.17	14.04

LFM2.5-230M leads on instruction following and data extraction. It trails on broad knowledge: MMLU-Pro is 20.25, behind Qwen3.5-0.8B’s 37.42. It is also weak on some agentic tool use. On τ²-Bench Telecom it scores just 5.26.

Liquid AI is direct about the limits. It does not recommend the model for reasoning-heavy workloads. That means advanced math, code generation, and creative writing.

Use Cases With Examples

The model fits two jobs well.

The first is large-scale data extraction pipelines. Picture a pipeline parsing 100,000 clinical reports into structured fields. A 4-bit build with a 293–375 MB memory footprint runs that on commodity CPUs. You extract locally, with no per-token API bill.
The second job is lightweight on-device agentic workloads. Think a home automation hub that turns speech into tool calls. Or a phone assistant that routes a request to the right function.

As an early signal, Liquid AI deployed the model on a Unitree G1 humanoid robot. It ran entirely on the robot’s onboard NVIDIA Jetson Orin. There the model acted as a skill-selection layer. It turned one natural-language instruction into a sequence of tool calls. Those calls invoked low-level skills from NVIDIA’s SONIC framework.

Tool Use: How It Works

LFM2.5 supports function calling in four steps. You define tools as JSON in the system prompt. The model writes a Pythonic function call between special tokens. You execute the call and return the result. The model then writes a plain-text answer.

By default the call is a Python list. It sits between the <|tool_call_start|>

and <|tool_call_end|>

tokens. Here is the documented pattern, with the tool JSON abbreviated:

<|im_start|>system
List of tools: [{"name": "get_candidate_status",
  "parameters": {"candidate_id": {"type": "string"}}}]<|im_end|>
<|im_start|>user
What is the current status of candidate ID 12345?<|im_end|>
<|im_start|>assistant
<|tool_call_start|>[get_candidate_status(candidate_id="12345")]<|tool_call_end|>Checking the current status of candidate ID 12345.<|im_end|>

You can also force JSON-formatted calls through the system prompt.

Running It: A Minimal Example

The model works with Transformers 5.0.0 and up. The recommended generation settings are temperature 0.1, top_k 50, and repetition_penalty 1.05. Note the do_sample=True

flag, which is required for those sampling settings to apply.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "LiquidAI/LFM2.5-230M"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is C. elegans?"}],
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

output = model.generate(
    **inputs,
    do_sample=True,
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_new_tokens=512,
)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Liquid AI also publishes fine-tuning recipes. They cover SFT, DPO, and GRPO with LoRA, via Unsloth and TRL. Each ships as a Colab notebook.

Interactive Explainer

Check out the ** Model weight on HF**,

and

Technical details.

DocsAlso, feel free to follow us on

and don’t forget to join ourTwitter

and Subscribe to

150k+ML SubReddit. Wait! are you on telegram?

our Newsletter

now you can join us on telegram as well.Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

source & further reading

marktechpost.com — original article DeepSeek Releases DSpark, a Speculative Decoding Framework That Accelerates DeepSeek-V4 Per-User Generation 60–85% Over MTP-1 Meta’s Astryx Brings a CLI and MCP Server to an Open-Source React Design System Agents Can Read Building Supervised Fine-Tuning Data from NVIDIA Open-SWE-Traces: Trajectory Parsing, Patch Analysis, Token Budgets, and Tool-Use Metrics

Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Support for On-Device Inference

Run your AI side-project on zahid.host