Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

wpnews.pro

cd /news/large-language-models/show-hn-run-llama-cpp-in-process-fro… · home › topics › large-language-models › article

[ARTICLE · art-22330] src=deemwar-products.github.io ↗ pub=2026-06-05T08:42Z topic=large-language-models verified=true sentiment=↑ positive

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

Mochallama, a new Java library, enables running llama.cpp inference directly within a Java process using JDK 22's Foreign Function and Memory (FFM) API, eliminating the need for separate daemon processes or native builds. The tool ships with a CLI that includes its own JDK runtime via npm, supports tool-calling local LLMs like Qwen2.5-1.5B, and offers Spring Boot integration with OpenAI-compatible endpoints for chat completions and tool use. This approach fills a gap in JVM-based local LLM deployment by providing in-process inference without JNI's crash risks or the overhead of HTTP-based solutions like Ollama.

read2 min views11 publishedJun 5, 2026

The 10-second hook #

No Java install, no daemon, no native build — npx

a tool-calling local LLM and start chatting:

The CLI ships its own jlink JDK-22 runtime image via npm, so this needs no JDK on the host. qwen2.5-1.5b

is the default tool-capable preset; the model downloads on first run into ~/.chatbot_models

Embed it: the smallest plain-Java snippet #

Two dependencies — the Java jar plus the platform aggregator that resolves the right native classifier jar for your host:

JVM flags

JDK 22+ is required (FFM is GA there). Run with --enable-native-access=ALL-UNNAMED .

Or one Spring dependency #

The starter autoconfigures a local model service and the OpenAI-compatible endpoints — no spring-ai

dependency required:

Tell it which model to load — a Hugging Face id is the simplest (it resolves + caches the GGUF on first start). In src/main/resources/application.properties

Start the app (the model loads asynchronously — endpoints return 503

until state: READY

), then point any OpenAI client at it. POST /v1/chat/completions

handles non-streaming, stream:true SSE, and tools

/ tool_choice

; GET /v1/models

lists the loaded model.

A real multi-turn CLI #

mochallama chat

is a stateful REPL — it keeps the full conversation history, not amnesiac single turns.

Sessions persist at ~/.chatbot_models/sessions/<id>.json

. Pass `--no-save`

for an ephemeral run. Inside the REPL, slash commands `/reset`

, /help

, and /exit

are available.

Honest positioning #

Today every local-LLM path for the JVM reaches your app over HTTP — Ollama, llama-server, LM Studio and friends are all separate processes, and Spring AI / LangChain4j just point an HTTP client at them. The other in-process options are non-JVM, or on the JVM are pure-Java Jlama (reimplements inference on the incubating Vector API, GGUF-less) or JNI bindings whose native faults can take down the whole JVM. mochallama fills the empty quadrant: FFM (GA) + real upstream llama.cpp + Spring-autoconfigured OpenAI wire API + tools-and-SSE-together + zero native-install.

It is an inference engine and wire API, not a RAG/agent framework. For orchestration, memory, and provider-portability you still want Spring AI or LangChain4j — mochallama slots in under them as the local provider via its Spring AI ChatModel

adapter. And if you want a shared standalone model server with automatic GPU offload and the widest model catalogue, Ollama is the easier on-ramp. See the full, PR-welcome breakdown in Compare.

What to do next #

Quickstart — time-to-first-success: npx, plain Java, and Spring Boot.Why mochallama — the FFM-not-JNI, prebuilt-not-compiled, tool-only decisions.Examples — curl, OpenAI Python SDK, Spring Boot, CLI, tools + streaming.Compare — mochallama vs Ollama, Jlama, java-llama.cpp, Spring AI, node-llama-cpp.

source & further reading

deemwar-products.github.io — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/show-hn-run-llama-cpp-in…

Read original on deemwar-products.github.io → deemwar-products.github.io/mochallama/

mentioned entities

Llama.cpp

Project Panama

JDK

Spring

Hugging Face

OpenAI

GGUF

Mochallama

metadata

slugshow-hn-run-llama-cpp-in-process-from-java-with-project-panama-ffm

topic#large-language-models

secondary4 topics

sentimentpositive

canonicaldeemwar-products.github.io

navigation

← prevMonterey Park voters approve fir…

next →Anthropic says Claude now writes…

── more in #large-language-models 4 stories · sorted by recency

simonwillison.net · 21 Jul · #large-language-models

Nativ: Run AI models locally on your Mac

github.com · 14 Jul · #large-language-models

Show HN: Low-latency local LLM runner via OpenJDK Panama FFM (Java 22)

kdnuggets.com · 21 Jul · #large-language-models

Run the Mythos Enhanced Coding Model Locally with llama.cpp and Pi

cryptobriefing.com · 21 Jul · #large-language-models

OpenAI tests lightweight ChatGPT web app for logged-out users

── more on @llama.cpp 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 7 Jul · #artificial-intelligence

In the age of AI, Hong Kong’s strategy as a ‘superconnector’ is progressing

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required