Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

Mochallama, a new Java library, enables running llama.cpp inference directly within a Java process using JDK 22's Foreign Function and Memory (FFM) API, eliminating the need for separate daemon processes or native builds. The tool ships with a CLI that includes its own JDK runtime via npm, supports tool-calling local LLMs like Qwen2.5-1.5B, and offers Spring Boot integration with OpenAI-compatible endpoints for chat completions and tool use. This approach fills a gap in JVM-based local LLM deployment by providing in-process inference without JNI's crash risks or the overhead of HTTP-based solutions like Ollama.

The 10-second hook No Java install, no daemon, no native build — npx a tool-calling local LLM and start chatting: The CLI ships its own jlink JDK-22 runtime image via npm, so this needs no JDK on the host. qwen2.5-1.5b is the default tool-capable preset; the model downloads on first run into ~/.chatbot models . Embed it: the smallest plain-Java snippet Two dependencies — the Java jar plus the platform aggregator that resolves the right native classifier jar for your host: JVM flags JDK 22+ is required FFM is GA there . Run with --enable-native-access=ALL-UNNAMED . Or one Spring dependency The starter autoconfigures a local model service and the OpenAI-compatible endpoints — no spring-ai dependency required: Tell it which model to load — a Hugging Face id is the simplest it resolves + caches the GGUF on first start . In src/main/resources/application.properties : Start the app the model loads asynchronously — endpoints return 503 until state: READY , then point any OpenAI client at it. POST /v1/chat/completions handles non-streaming, stream:true SSE, and tools / tool choice ; GET /v1/models lists the loaded model. A real multi-turn CLI mochallama chat is a stateful REPL — it keeps the full conversation history, not amnesiac single turns. Sessions persist at ~/.chatbot models/sessions/<id .json . Pass --no-save for an ephemeral run. Inside the REPL, slash commands /reset , /help , and /exit are available. Honest positioning Today every local-LLM path for the JVM reaches your app over HTTP — Ollama, llama-server, LM Studio and friends are all separate processes, and Spring AI / LangChain4j just point an HTTP client at them. The other in-process options are non-JVM, or on the JVM are pure-Java Jlama reimplements inference on the incubating Vector API, GGUF-less or JNI bindings whose native faults can take down the whole JVM. mochallama fills the empty quadrant: FFM GA + real upstream llama.cpp + Spring-autoconfigured OpenAI wire API + tools-and-SSE-together + zero native-install. It is an inference engine and wire API, not a RAG/agent framework. For orchestration, memory, and provider-portability you still want Spring AI or LangChain4j — mochallama slots in under them as the local provider via its Spring AI ChatModel adapter. And if you want a shared standalone model server with automatic GPU offload and the widest model catalogue, Ollama is the easier on-ramp. See the full, PR-welcome breakdown in Compare /mochallama/compare . What to do next Quickstart /mochallama/quickstart — time-to-first-success: npx, plain Java, and Spring Boot. Why mochallama /mochallama/why — the FFM-not-JNI, prebuilt-not-compiled, tool-only decisions. Examples /mochallama/examples/ — curl, OpenAI Python SDK, Spring Boot, CLI, tools + streaming. Compare /mochallama/compare — mochallama vs Ollama, Jlama, java-llama.cpp, Spring AI, node-llama-cpp.