cd /news/large-language-models/show-hn-run-llama-cpp-in-process-fro… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-22330] src=deemwar-products.github.io pub= topic=large-language-models verified=true sentiment=↑ positive

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

Mochallama, a new Java library, enables running llama.cpp inference directly within a Java process using JDK 22's Foreign Function and Memory (FFM) API, eliminating the need for separate daemon processes or native builds. The tool ships with a CLI that includes its own JDK runtime via npm, supports tool-calling local LLMs like Qwen2.5-1.5B, and offers Spring Boot integration with OpenAI-compatible endpoints for chat completions and tool use. This approach fills a gap in JVM-based local LLM deployment by providing in-process inference without JNI's crash risks or the overhead of HTTP-based solutions like Ollama.

read2 min publishedJun 5, 2026

The 10-second hook #

No Java install, no daemon, no native build β€” npx

a tool-calling local LLM and start chatting:

The CLI ships its own jlink JDK-22 runtime image via npm, so this needs no JDK on the host. qwen2.5-1.5b

is the default tool-capable preset; the model downloads on first run into ~/.chatbot_models

.

Embed it: the smallest plain-Java snippet #

Two dependencies β€” the Java jar plus the platform aggregator that resolves the right native classifier jar for your host:

JVM flags

JDK 22+ is required (FFM is GA there). Run with --enable-native-access=ALL-UNNAMED .

Or one Spring dependency #

The starter autoconfigures a local model service and the OpenAI-compatible endpoints β€” no spring-ai

dependency required:

Tell it which model to load β€” a Hugging Face id is the simplest (it resolves + caches the GGUF on first start). In src/main/resources/application.properties

:

Start the app (the model loads asynchronously β€” endpoints return 503

until state: READY

), then point any OpenAI client at it. POST /v1/chat/completions

handles non-streaming, stream:true SSE, and tools

/ tool_choice

; GET /v1/models

lists the loaded model.

A real multi-turn CLI #

mochallama chat

is a stateful REPL β€” it keeps the full conversation history, not amnesiac single turns.

Sessions persist at ~/.chatbot_models/sessions/<id>.json

. Pass `--no-save`

for an ephemeral run. Inside the REPL, slash commands `/reset`

, /help

, and /exit

are available.

Honest positioning #

Today every local-LLM path for the JVM reaches your app over HTTP β€” Ollama, llama-server, LM Studio and friends are all separate processes, and Spring AI / LangChain4j just point an HTTP client at them. The other in-process options are non-JVM, or on the JVM are pure-Java Jlama (reimplements inference on the incubating Vector API, GGUF-less) or JNI bindings whose native faults can take down the whole JVM. mochallama fills the empty quadrant: FFM (GA) + real upstream llama.cpp + Spring-autoconfigured OpenAI wire API + tools-and-SSE-together + zero native-install.

It is an inference engine and wire API, not a RAG/agent framework. For orchestration, memory, and provider-portability you still want Spring AI or LangChain4j β€” mochallama slots in under them as the local provider via its Spring AI ChatModel

adapter. And if you want a shared standalone model server with automatic GPU offload and the widest model catalogue, Ollama is the easier on-ramp. See the full, PR-welcome breakdown in Compare.

What to do next #

Quickstart β€” time-to-first-success: npx, plain Java, and Spring Boot.Why mochallama β€” the FFM-not-JNI, prebuilt-not-compiled, tool-only decisions.Examples β€” curl, OpenAI Python SDK, Spring Boot, CLI, tools + streaming.Compare β€” mochallama vs Ollama, Jlama, java-llama.cpp, Spring AI, node-llama-cpp.

── more in #large-language-models 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/show-hn-run-llama-cp…] indexed:0 read:2min 2026-06-05 Β· β€”