# Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

> Source: <https://deemwar-products.github.io/mochallama/>
> Published: 2026-06-05 08:42:24+00:00

## The 10-second hook

No Java install, no daemon, no native build — `npx`

a tool-calling local LLM and start chatting:

The CLI ships its own jlink JDK-22 runtime image via npm, so this needs no JDK on the host. `qwen2.5-1.5b`

is the default tool-capable preset; the model downloads on first run into `~/.chatbot_models`

.

## Embed it: the smallest plain-Java snippet

Two dependencies — the Java jar plus the platform aggregator that resolves the right native classifier jar for your host:

JVM flags

JDK 22+ is required (FFM is GA there). Run with `--enable-native-access=ALL-UNNAMED`

.

## Or one Spring dependency

The starter autoconfigures a local model service and the OpenAI-compatible endpoints — no `spring-ai`

dependency required:

Tell it which model to load — a Hugging Face id is the simplest (it resolves + caches the GGUF on first start). In `src/main/resources/application.properties`

:

Start the app (the model loads asynchronously — endpoints return `503`

until `state: READY`

), then point any OpenAI client at it. `POST /v1/chat/completions`

handles non-streaming, `stream:true`

SSE, and `tools`

/ `tool_choice`

; `GET /v1/models`

lists the loaded model.

## A real multi-turn CLI

`mochallama chat`

is a stateful REPL — it keeps the full conversation history, not amnesiac single turns.

Sessions persist at `~/.chatbot_models/sessions/<id>.json`

. Pass `--no-save`

for an ephemeral run. Inside the REPL, slash commands `/reset`

, `/help`

, and `/exit`

are available.

## Honest positioning

Today every local-LLM path for the JVM reaches your app over HTTP — Ollama, llama-server, LM Studio and friends are all separate processes, and Spring AI / LangChain4j just point an HTTP client at them. The other in-process options are non-JVM, or on the JVM are pure-Java Jlama (reimplements inference on the incubating Vector API, GGUF-less) or JNI bindings whose native faults can take down the whole JVM. mochallama fills the empty quadrant: FFM (GA) + real upstream llama.cpp + Spring-autoconfigured OpenAI wire API + tools-and-SSE-together + zero native-install.

It is an inference engine and wire API, not a RAG/agent framework. For orchestration, memory, and provider-portability you still want Spring AI or LangChain4j — mochallama slots in **under** them as the local provider via its Spring AI `ChatModel`

adapter. And if you want a shared standalone model server with automatic GPU offload and the widest model catalogue, Ollama is the easier on-ramp. See the full, PR-welcome breakdown in [Compare](/mochallama/compare).

## What to do next

[Quickstart](/mochallama/quickstart) — time-to-first-success: npx, plain Java, and Spring Boot.[Why mochallama](/mochallama/why) — the FFM-not-JNI, prebuilt-not-compiled, tool-only decisions.[Examples](/mochallama/examples/) — curl, OpenAI Python SDK, Spring Boot, CLI, tools + streaming.[Compare](/mochallama/compare) — mochallama vs Ollama, Jlama, java-llama.cpp, Spring AI, node-llama-cpp.