{"slug": "show-hn-run-llama-cpp-in-process-from-java-with-project-panama-ffm", "title": "Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM", "summary": "Mochallama, a new Java library, enables running llama.cpp inference directly within a Java process using JDK 22's Foreign Function and Memory (FFM) API, eliminating the need for separate daemon processes or native builds. The tool ships with a CLI that includes its own JDK runtime via npm, supports tool-calling local LLMs like Qwen2.5-1.5B, and offers Spring Boot integration with OpenAI-compatible endpoints for chat completions and tool use. This approach fills a gap in JVM-based local LLM deployment by providing in-process inference without JNI's crash risks or the overhead of HTTP-based solutions like Ollama.", "body_md": "## The 10-second hook\n\nNo Java install, no daemon, no native build — `npx`\n\na tool-calling local LLM and start chatting:\n\nThe CLI ships its own jlink JDK-22 runtime image via npm, so this needs no JDK on the host. `qwen2.5-1.5b`\n\nis the default tool-capable preset; the model downloads on first run into `~/.chatbot_models`\n\n.\n\n## Embed it: the smallest plain-Java snippet\n\nTwo dependencies — the Java jar plus the platform aggregator that resolves the right native classifier jar for your host:\n\nJVM flags\n\nJDK 22+ is required (FFM is GA there). Run with `--enable-native-access=ALL-UNNAMED`\n\n.\n\n## Or one Spring dependency\n\nThe starter autoconfigures a local model service and the OpenAI-compatible endpoints — no `spring-ai`\n\ndependency required:\n\nTell it which model to load — a Hugging Face id is the simplest (it resolves + caches the GGUF on first start). In `src/main/resources/application.properties`\n\n:\n\nStart the app (the model loads asynchronously — endpoints return `503`\n\nuntil `state: READY`\n\n), then point any OpenAI client at it. `POST /v1/chat/completions`\n\nhandles non-streaming, `stream:true`\n\nSSE, and `tools`\n\n/ `tool_choice`\n\n; `GET /v1/models`\n\nlists the loaded model.\n\n## A real multi-turn CLI\n\n`mochallama chat`\n\nis a stateful REPL — it keeps the full conversation history, not amnesiac single turns.\n\nSessions persist at `~/.chatbot_models/sessions/<id>.json`\n\n. Pass `--no-save`\n\nfor an ephemeral run. Inside the REPL, slash commands `/reset`\n\n, `/help`\n\n, and `/exit`\n\nare available.\n\n## Honest positioning\n\nToday every local-LLM path for the JVM reaches your app over HTTP — Ollama, llama-server, LM Studio and friends are all separate processes, and Spring AI / LangChain4j just point an HTTP client at them. The other in-process options are non-JVM, or on the JVM are pure-Java Jlama (reimplements inference on the incubating Vector API, GGUF-less) or JNI bindings whose native faults can take down the whole JVM. mochallama fills the empty quadrant: FFM (GA) + real upstream llama.cpp + Spring-autoconfigured OpenAI wire API + tools-and-SSE-together + zero native-install.\n\nIt is an inference engine and wire API, not a RAG/agent framework. For orchestration, memory, and provider-portability you still want Spring AI or LangChain4j — mochallama slots in **under** them as the local provider via its Spring AI `ChatModel`\n\nadapter. And if you want a shared standalone model server with automatic GPU offload and the widest model catalogue, Ollama is the easier on-ramp. See the full, PR-welcome breakdown in [Compare](/mochallama/compare).\n\n## What to do next\n\n[Quickstart](/mochallama/quickstart) — time-to-first-success: npx, plain Java, and Spring Boot.[Why mochallama](/mochallama/why) — the FFM-not-JNI, prebuilt-not-compiled, tool-only decisions.[Examples](/mochallama/examples/) — curl, OpenAI Python SDK, Spring Boot, CLI, tools + streaming.[Compare](/mochallama/compare) — mochallama vs Ollama, Jlama, java-llama.cpp, Spring AI, node-llama-cpp.", "url": "https://wpnews.pro/news/show-hn-run-llama-cpp-in-process-from-java-with-project-panama-ffm", "canonical_source": "https://deemwar-products.github.io/mochallama/", "published_at": "2026-06-05 08:42:24+00:00", "updated_at": "2026-06-05 08:47:46.755539+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-infrastructure", "natural-language-processing", "generative-ai"], "entities": ["Llama.cpp", "Project Panama", "JDK", "Spring", "Hugging Face", "OpenAI", "GGUF", "Mochallama"], "alternates": {"html": "https://wpnews.pro/news/show-hn-run-llama-cpp-in-process-from-java-with-project-panama-ffm", "markdown": "https://wpnews.pro/news/show-hn-run-llama-cpp-in-process-from-java-with-project-panama-ffm.md", "text": "https://wpnews.pro/news/show-hn-run-llama-cpp-in-process-from-java-with-project-panama-ffm.txt", "jsonld": "https://wpnews.pro/news/show-hn-run-llama-cpp-in-process-from-java-with-project-panama-ffm.jsonld"}}