llmaker is an open-source platform for running the complete modern LLM stack on your own infrastructure β large language models, vector databases, embeddings, caching, observability, and a built-in retrieval & agent layer β provisioned, networked, and production-shaped from a single command.
Build private retrieval-augmented chatbots, FAQ assistants, and recommendation engines locally. No third-party API keys. No data leaving your machine.
Quickstart Β· Why llmaker Β· Stacks Β· The agent Β· Architecture Β· CLI Β· Roadmap
Running a model locally is easy. Shipping an application is not. A production
retrieval system needs a vector database, an embeddings service, a caching layer,
an orchestration layer, and observability β each containerized, networked, and
configured to discover the others. Assembling that is a recurring tax: a sprawl of
docker run
flags, a brittle Compose file, and hundreds of lines of framework glue.
llmaker removes that tax. One CLI provisions the entire stack on a private network
and operates it as a single fleet β live status, logs, and a resource
dashboard across every model and service. Stacks are declarative and
reconcilable (apply --prune
), models are OpenAI-compatible, and retrieval is traced out of the box. From a single model to a complete application:
llmaker stack up assistant # one command β a private ChatGPT-style UI over a local model
llmaker stack init rag # β¦or scaffold any stack to edit, then apply it:
llmaker apply # assistant Β· voice Β· rag Β· research Β· code Β· chatbot Β· faq Β· recommend Β· sql
llmaker up --model llama3:8b # a local endpoint β explicit, or a preset:
llmaker up chat # chat Β· code Β· small Β· embed Β· vision
llmaker chat <name> # test it in the terminal
llmaker open <name> # open its built-in web UI
llmaker service catalog # browse what's available
llmaker service add qdrant # vector database β qdrant:6333
llmaker service add redis # cache / memory β redis:6379
llmaker service add langfuse # observability β langfuse:3000
llmaker ls # every model + service, one view (--json)
llmaker top # live resource dashboard (TUI)
llmaker status <name> # gauges, loaded models, endpoints
llmaker logs <name> -f # stream logs from any container
llmaker pull mistral --on chat # download a model with progress
llmaker stop / start / rm # lifecycle management
AGENT=$(llmaker service ls --json | jq -r '.[]|select(.service=="agent").url')
curl "$AGENT/api/ingest" -F file=@handbook.pdf # add knowledge
curl "$AGENT/api/chat" -d '{"question":"refund policy?"}' # grounded answer + sources
curl "$AGENT/api/recommend" -d '{"like":["sku1","sku2"]}' # semantic recommendations
Everything lands on a private network where each container discovers the others by name β no Compose file and no glue code.
The complete stack, curated |
Models and the infrastructure around them β vector databases (Qdrant, Chroma, pgvector, Weaviate), Redis, embeddings, Open WebUI, n8n, Flowise, Whisper, Langfuse β from one versioned catalog. |
Automatic service discovery |
Every model and service joins a private Docker network and resolves by name. Your application reaches chat:8080 and qdrant:6333 with zero IP wiring. |
A retrieval & tool agent, built in |
A FastAPI + LangGraph service: rewrite β retrieve β rerank β generate (multi-turn, MMR), a tool-calling loop (calculator, knowledge base, self-hosted web search, SQL), and a semantic recommendation API. |
Observability by default |
Every instance exposes Prometheus /metrics (requests, tokens/sec, CPU/RAM/GPU) for scraping, and the RAG stack ships Langfuse β every query traced (retrieval hits and scores, model and token usage) with no setup. |
Measurable quality |
An evaluation harness (/api/eval ) grades answers for groundedness, relevance, and correctness with an LLM judge β retrieval quality you can track across changes, not guess at. |
More than RAG |
First-class endpoints for summarization (map-reduce over long docs), structured JSON extraction, and speech-to-text (Whisper), plus optional Redis-backed conversation memory. |
Declarative, reconcilable |
Define your stack in one file. llmaker apply brings it to the desired state in dependency order; --prune removes what's no longer declared. |
OpenAI-compatible |
Each model exposes a stable /v1/* API (chat, completions, embeddings, streaming) behind one contract β Ollama runs it today, with a llama.cpp backend
|
Private by design |
Containers bind to 127.0.0.1 by default. Your documents, embeddings, and traces never leave your infrastructure. No per-token cost, no vendor lock-in. |
Operable |
A single static Go binary, a labeled-container model with no state file to drift, --json output everywhere, and a live top dashboard. |
Data ownership. Proprietary documents, customer data, and prompts stay on hardware you control. Nothing is sent to a third-party API.No assembly tax. The vector DB, embeddings, cache, agent, and tracing come pre-integrated and networked β not as a Compose file you maintain by hand.Predictable cost. Inference and retrieval run on infrastructure you already pay for. No per-token billing, no rate limits.Portability. The samestack.yaml
runs on a laptop, a CI runner, or a server. Swap the model or the vector database without touching your application.
| Model runners (Ollama, LM Studio) | DIY Docker Compose | Frameworks (LangChain) | llmaker | | |---|---|---|---|---| | Run local models, OpenAI-compatible | β | β | β | β | | Vector DB, embeddings, cache β curated | β | manual | β | β | | Service discovery between containers | β | manual | n/a | β | | One-command application (RAG, recsys) | β | β | β | β | | Built-in retrieval & recommendation agent | β | β | you code it | β | | Observability / tracing integrated | β | manual | manual | β | | Declarative provisioning & reconciliation | β | partial | β | β |
Requires
[Docker]. Runllmaker doctor
afterward to validate your environment.
curl -fsSL https://raw.githubusercontent.com/raiyanyahya/llmaker/master/scripts/install.sh | sh
go install github.com/raiyanyahya/llmaker/cmd/llmaker@latest
git clone https://github.com/raiyanyahya/llmaker && cd llmaker && make build
Homebrew and winget packages are on the roadmap. The agent image is built locally with make image-agent until it is published to a registry.
Provision and run a complete retrieval-augmented generation stack:
llmaker stack up assistant # scaffold + apply in one step (assistant needs no agent image)
llmaker stack init rag # generate stack.yaml (assistant | voice | rag | research | code | chatbot | faq | recommend | sql)
make image-agent # build the agent image once (stacks that include the agent)
llmaker apply -f stack.yaml # provision the stack β model + services, networked
llmaker ls # inspect models and services in one view
Resolve the agent endpoint and use it:
AGENT=$(llmaker service ls --json | jq -r '.[] | select(.service=="agent").url')
curl "$AGENT/api/ingest" -F file=@handbook.pdf # ingest documents
curl "$AGENT/api/chat" -d '{"question":"β¦","history":[],"top_k":4}' # query, with sources
llmaker also runs individual models β the easiest way to expose a local, OpenAI-compatible endpoint:
llmaker up --model llama3:8b # provision a model instance
python
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:11500/v1", api_key="not-needed")
client.chat.completions.create(model="llama3:8b",
messages=[{"role": "user", "content": "Hello"}])
A stack is a model plus the services around it, provisioned together. Scaffold
and run one in a single step with llmaker stack up <name>
, or generate a
stack.yaml
to edit with llmaker stack init <name>
and apply it with
llmaker apply
.
| Template | Application | Components |
|---|---|---|
assistant |
||
| A private, ChatGPT-style assistant over a local model β chats, prompts, RAG in the UI. No agent image to build | LLM Β· Open WebUI | |
voice |
||
| Talk to a model β speech-to-text in the browser via self-hosted Whisper. No agent image to build | LLM Β· Open WebUI Β· Whisper | |
rag |
||
| Document Q&A β ingest files, query with grounded answers and sources, fully traced | LLM Β· Qdrant Β· embeddings Β· agent Β· Langfuse Β· Postgres | |
research |
||
| A tool-using assistant that searches the live web and your documents, then synthesizes | ||
| LLM Β· SearXNG Β· Qdrant Β· embeddings Β· agent | ||
code |
||
| A code assistant β ingest a repo, ask grounded questions and review | code LLM Β· Qdrant Β· embeddings Β· agent | |
chatbot |
||
| A multi-turn assistant with a web UI and per-session memory | LLM Β· Redis Β· agent | |
faq |
||
| A knowledge-base assistant tuned for short, grounded answers | LLM Β· Qdrant Β· embeddings Β· agent | |
recommend |
||
| A semantic recommendation engine β "more like this", no LLM required | Qdrant Β· embeddings Β· agent | |
sql |
||
| Ask your database in plain English β the agent runs read-only SQL (enforced) and grounds in docs | LLM Β· Postgres Β· Qdrant Β· embeddings Β· agent |
The catalog's agent
is a FastAPI + LangGraph service (agent/
) that turns a bare model and vector store into an application. It is a standard service on the network, configured by environment to discover the others by name.
Retrieval as an explicit graph β rewrite β retrieve β rerank β generate
:
rewriteβ collapses multi-turn history into a standalone query, so follow-ups that depend on context ("and when wasitreleased?") resolve correctly. The model is only invoked when there is history to resolve.retrieveβ embeds the query and retrieves a candidate set from the vector store.** rerank**β appliesMaximal Marginal Relevancefor relevant, non-redundant context.generateβ produces the answer from that context and the conversation.
POST /api/ingest multipart file or text β chunk, embed, store
POST /api/chat { question, history?, top_k?, session_id? } β answer + sources
POST /api/agent { question, history?, session_id? } β tool-using answer + tool calls
POST /api/summarize { text, instructions?, max_words? } β summary (map-reduce for long text)
POST /api/extract { text, fields: { name: description } } β JSON with exactly those keys
POST /api/transcribe multipart audio file β { text } (needs a whisper service)
POST /api/eval { cases: [{ question, reference? }] } β graded answers + summary
POST /api/items { items: [{ id, text, metadata? }] } β index for recommendation
POST /api/recommend { query } or { like: [id, β¦] } β ranked items
Tool calling. Beyond retrieval, /api/agent
runs a tool-calling loop where
the model decides which tools to invoke β a calculator, the knowledge base
(retrieval as a tool), the current time, a self-hosted web search
(SearXNG, no paid API), and an optional read-only SQL tool over your
database β and the loop executes them until it has an answer. The response
includes every tool call it made. Adding a tool is one entry in
agent/app/tools.py
.
Tracing. The rag
stack provisions Langfuse and the agent traces every query to it, with zero configuration β each request (RAG or tool-using) appears as a trace with its retrieval, tool, and generation steps. Tracing is enabled by the template and is otherwise opt-in via two environment variables.
Evaluation. /api/eval
runs a question set through the same pipeline and grades each answer β groundedness and relevance by LLM-as-judge, plus correctness against a reference and context recall against expected sources when you supply them. You get per-case scores and an aggregate summary, and every case is traced to Langfuse alongside your live traffic β so retrieval quality is measurable, not a vibe.
Beyond chat. Two everyday tasks are first-class endpoints: /api/summarize
condenses text (map-reducing long inputs chunk by chunk so a whole report fits),
and /api/extract
turns text into a typed JSON object from the fields you name β
parsed defensively so a chatty model never breaks the contract. With a whisper
service on the network, /api/transcribe
adds speech-to-text.
Memory. The agent is stateless by default (the client passes history
). Set
REDIS_URL
and it persists history server-side: send a session_id
with
/api/chat
or /api/agent
and prior turns are loaded, prepended, and saved
automatically β capped and expiring, and best-effort so Redis being down never
fails a chat. llmaker stack init chatbot
wires it up.
Recommendations reuse the same embeddings and vector store, with no model
involved: index items once, then retrieve by free-text intent (query
) or by
example (like
, which averages the seed items into a profile and excludes them from the results).
Full agent contract and configuration: agent/README.md.
Compose a stack from the catalog directly, or let a template do it:
llmaker service catalog # list available services
llmaker service add qdrant # vector database β qdrant:6333
llmaker service add redis # cache / memory β redis:6379
llmaker service add embeddings # embeddings (HF TEI) β embeddings:80
llmaker service add searxng # web search β searxng:8080
llmaker service add whisper # speech-to-text β whisper:8000
llmaker service add open-webui # ChatGPT-style UI β open-webui:8080
llmaker service add langfuse # observability β langfuse:3000
| Category | Services |
|---|---|
| Vector databases | Qdrant Β· Chroma Β· pgvector (Postgres) Β· Weaviate |
| Cache / memory | Redis (powers per-session agent memory) |
| Embeddings | HuggingFace Text-Embeddings-Inference |
| Search | SearXNG (self-hosted metasearch) |
| Speech-to-text | Whisper (faster-whisper, OpenAI-compatible) |
| Observability | Langfuse |
| Web UI & apps | Open WebUI (ChatGPT-style UI) Β· n8n (workflow automation) Β· Flowise (visual LLM app builder) |
| Agent | LangGraph retrieval & recommendation agent |
Every model and service joins a private Docker network (llmaker-net
) and is addressable there by name β service discovery without IPs, links, or a Compose file. Applications running on the host or in their own container reach the stack the same way:
docker run --rm --network llmaker-net redis:7-alpine redis-cli -h redis ping # β PONG
Adding a service is a single entry in internal/service/catalog.go
; the CLI, fleet view, and declarative engine pick it up automatically.
stack init
generates one of these; it can also be authored by hand. apply
reconciles the running stack to the file β provisioning services before the
applications that depend on them β and --prune
removes anything not declared.
Give the file a top-level name:
and --prune
is scoped to that stack, so applying one stack never deletes another's containers (scaffolded stacks are named automatically). An unnamed file prunes the whole managed fleet.
defaults: { backend: ollama }
instances:
- { name: chat, model: llama3:8b, memory: 8g } # β chat:8080
services:
- use: qdrant # β qdrant:6333
- { name: cache, use: redis } # β cache:6379
- { name: embeddings, use: embeddings, env: { MODEL_ID: BAAI/bge-small-en-v1.5 } }
- use: agent # β agent:8800
Unset ports are assigned automatically; a stack may be services-only. See examples/stack.yaml and
examples/llm.yaml
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β llmaker CLI (Go β single static binary) β
β orchestration Β· Docker SDK Β· private networking Β· declarative apply β
βββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β provision Β· start Β· stop Β· HTTP
βΌ
ββββββββββββββββ llmaker-net (private network, DNS by name) ββββββββββββββββ
βββ Model instance ββββββββββββ βββ Services ββββββββββββββββββββββββββββ
β engine β facade (FastAPI) β β qdrant Β· embeddings Β· redis Β· pgvector β
β Ollama Β· llama.cpp* β β langfuse Β· β¦ β
β OpenAI /v1/* Β· web UI β β qdrant:6333 embeddings:80 β
β chat:8080 β ββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββ β²
β² β
ββββββββββββββββ¬βββββββββββββββββββ
βββ Agent (FastAPI + LangGraph) ββββ
β rewrite β retrieve β rerank β β agent:8800
β generate Β· ingest Β· recommend β
βββββββββββββββββββββββββββββββββββββ
host ports (127.0.0.1:PORT) mapped per container
- The llama.cpp backend is scaffolded but still maturing; Ollama is the verified default β see the roadmap.
The control plane is a single Go binary; the data plane is containers on a private
network. Orchestration logic is decoupled from Docker behind a Runtime
interface, and the fleet is tracked entirely through container labels β there is no local state file to drift out of sync. Model facades and the agent are Python (FastAPI), each communicating over the same HTTP contract.
| Command | Description |
|---|---|
| `llmaker stack up <assistant | voice |
| Scaffold a stack and apply it in one command | |
llmaker stack init <template> |
|
| Generate a ready-to-apply stack definition to edit | |
llmaker apply -f stack.yaml |
|
Provision / reconcile a declarative stack β --prune |
|
llmaker up [preset] |
|
| Provision a model instance β preset, flags, or interactive wizard | |
| `llmaker stop | start |
Instance lifecycle β restart = stop+start, rm --force removes a running one |
|
llmaker service catalog |
|
| List available services | |
llmaker service add <type> [name] |
|
Provision a service β --env , --port , --memory |
|
| `llmaker service ls | rm |
Manage services β --json |
|
llmaker ls |
|
List the fleet β models and services β --json , --quiet |
|
llmaker top |
|
| Live resource dashboard across the fleet | |
llmaker status <name> |
|
Detailed instance status β --json |
|
llmaker pull <model> --on <name> |
|
Download a model with progress β --default |
|
llmaker chat [name] |
|
Interactive or one-shot chat β --message , stdin |
|
llmaker open <name> |
|
Open a container's web UI β --print |
|
llmaker logs <name> -f |
|
| Stream logs from any container | |
llmaker doctor |
|
| Validate the environment (Docker, GPU, platform caveats) |
| Setting | Where | Default |
|---|---|---|
| backend / model | --backend Β· --model Β· stack.yaml |
|
ollama Β· backend default |
||
| memory Β· cpus Β· gpu | flags Β· stack.yaml |
|
| host-derived | ||
| port Β· host | --port Β· --host |
|
auto Β· 127.0.0.1 |
||
| service environment | service add --env Β· env: in stack.yaml |
|
| per-service defaults | ||
API_KEY Β· CORS_ORIGINS Β· KEEP_ALIVE |
||
--api-key Β· --cors Β· --keep-alive |
||
open Β· * Β· 5m |
Per-service and agent configuration (model URLs, chunking, reranking, tracing keys) is documented in agent/README.md and
facade/README.md
Every container binds to 127.0.0.1
by default; nothing is exposed until you opt in, and exposure pairs with authentication:
llmaker up --host 0.0.0.0 --api-key "$(openssl rand -hex 16)"
When API_KEY
is set, every /v1/*
and /api/*
request requires a bearer token
(liveness probes excepted). The agent enforces its own API_KEY
identically. The Langfuse keys and database password in the catalog are development defaults β rotate them before exposing a stack beyond localhost.
Docker on macOS cannot pass through the Apple GPU; a containerized engine runs
CPU-only. llmaker doctor
detects and reports this. On Linux with NVIDIA, --gpu
reserves GPUs via the NVIDIA Container Toolkit.
| Image | Size | Use |
|---|---|---|
llmaker-ollama:latest |
||
| ~8.5 GB | GPU-capable (Linux + NVIDIA) | |
llmaker-ollama:cpu |
||
| ~360 MB | CPU-only β laptops, CI, macOS | |
llmaker-agent:latest |
||
| ~510 MB | LangGraph agent β RAG, tools, eval, summarize/extract, transcribe |
Images are resolved with a pull-if-missing policy, so locally built images
(make image-agent
) are used directly without contacting a registry.
make build # build ./bin/llmaker
make check # gofmt + vet + go test (CI parity)
make facade-setup && make facade-test # model facade (pytest)
make agent-setup && make agent-test # retrieval/recommendation agent (pytest)
make images # build backend + agent images
The Go control plane is tested against an in-memory runtime (no Docker required).
The model facade and the agent β routes, the LangGraph pipeline, reranking,
tracing, and recommendation β are tested against in-memory fakes. CI runs Go race
tests, gofmt
, a ruff-linted Python test matrix, and image builds on every push.
cmd/llmaker/ CLI entrypoint
internal/
backend/ inference engines and image references
service/ the service catalog
engine/ domain model, ports, labels, Runtime interface
dockerrt/ Docker implementation and the private network
enginetest/ in-memory Runtime for tests
config/ stack.yaml parsing and dependency ordering
cli/ Β· ui/ Β· tui/ Cobra commands and the terminal interface
facade/ model facade (FastAPI) + per-model web UI
agent/ retrieval & recommendation agent (FastAPI + LangGraph)
images/ backend and agent Dockerfiles
Status: alpha.Checked capabilities are implemented and covered by the test suite; the core stack is verified end-to-end against live Docker.
- Model instances β OpenAI-compatible facade, per-model UI, fleet management
- Service catalog β vector databases, cache, embeddings, search, observability
- Private networking β automatic service discovery by name
- Declarative stacks β
stack init
templates and reconcilingapply --prune
- Retrieval agent β LangGraph
rewrite β retrieve β rerank β generate
, multi-turn - Recommendation engine β semantic
query
and "more like this" - Integrated observability β Langfuse tracing
- Tool-calling agent β calculator, knowledge base, time, web search, read-only SQL
- Self-hosted web search β SearXNG service + a
web_search
agent tool - Evaluation harness β
/api/eval
graded by LLM-as-judge, traced to Langfuse - Summarization & extraction β
/api/summarize
(map-reduce),/api/extract
(typed JSON) - Speech-to-text β Whisper service +
/api/transcribe
- Conversation memory β Redis-backed per-session history (
session_id
) - More agent tooling β dedicated cross-encoder reranking; richer eval datasets
- Additional backends β llama.cpp model management; Metal on macOS
- Distribution β multi-architecture images, package managers, releases
Contributions are welcome. Keep the suite green (make check
, make facade-test
,
make agent-test
), match the surrounding style, and include tests. Adding a service is a single catalog entry; adding a model backend is a single facade adapter.
Apache 2.0 Β© Raiyan Yahya.