The 10-second hook #
No Java install, no daemon, no native build β npx
a tool-calling local LLM and start chatting:
The CLI ships its own jlink JDK-22 runtime image via npm, so this needs no JDK on the host. qwen2.5-1.5b
is the default tool-capable preset; the model downloads on first run into ~/.chatbot_models
.
Embed it: the smallest plain-Java snippet #
Two dependencies β the Java jar plus the platform aggregator that resolves the right native classifier jar for your host:
JVM flags
JDK 22+ is required (FFM is GA there). Run with --enable-native-access=ALL-UNNAMED
.
Or one Spring dependency #
The starter autoconfigures a local model service and the OpenAI-compatible endpoints β no spring-ai
dependency required:
Tell it which model to load β a Hugging Face id is the simplest (it resolves + caches the GGUF on first start). In src/main/resources/application.properties
:
Start the app (the model loads asynchronously β endpoints return 503
until state: READY
), then point any OpenAI client at it. POST /v1/chat/completions
handles non-streaming, stream:true
SSE, and tools
/ tool_choice
; GET /v1/models
lists the loaded model.
A real multi-turn CLI #
mochallama chat
is a stateful REPL β it keeps the full conversation history, not amnesiac single turns.
Sessions persist at ~/.chatbot_models/sessions/<id>.json
. Pass `--no-save`
for an ephemeral run. Inside the REPL, slash commands `/reset`
, /help
, and /exit
are available.
Honest positioning #
Today every local-LLM path for the JVM reaches your app over HTTP β Ollama, llama-server, LM Studio and friends are all separate processes, and Spring AI / LangChain4j just point an HTTP client at them. The other in-process options are non-JVM, or on the JVM are pure-Java Jlama (reimplements inference on the incubating Vector API, GGUF-less) or JNI bindings whose native faults can take down the whole JVM. mochallama fills the empty quadrant: FFM (GA) + real upstream llama.cpp + Spring-autoconfigured OpenAI wire API + tools-and-SSE-together + zero native-install.
It is an inference engine and wire API, not a RAG/agent framework. For orchestration, memory, and provider-portability you still want Spring AI or LangChain4j β mochallama slots in under them as the local provider via its Spring AI ChatModel
adapter. And if you want a shared standalone model server with automatic GPU offload and the widest model catalogue, Ollama is the easier on-ramp. See the full, PR-welcome breakdown in Compare.
What to do next #
Quickstart β time-to-first-success: npx, plain Java, and Spring Boot.Why mochallama β the FFM-not-JNI, prebuilt-not-compiled, tool-only decisions.Examples β curl, OpenAI Python SDK, Spring Boot, CLI, tools + streaming.Compare β mochallama vs Ollama, Jlama, java-llama.cpp, Spring AI, node-llama-cpp.