Show HN: Self hosting a modern LLM stack

Llmaker, an open-source platform for self-hosting a complete LLM stack including models, vector databases, embeddings, caching, observability, and an agent layer, launched on Hacker News. The platform provisions and manages the entire stack from a single CLI command, enabling private retrieval-augmented chatbots and recommendation engines without third-party APIs. It aims to eliminate the complexity of assembling and networking multiple containerized services for production LLM applications.

llmaker is an open-source platform for running the complete modern LLM stack on your own infrastructure — large language models, vector databases, embeddings, caching, observability, and a built-in retrieval & agent layer — provisioned, networked, and production-shaped from a single command. Build private retrieval-augmented chatbots, FAQ assistants, and recommendation engines locally. No third-party API keys. No data leaving your machine. Quickstart quickstart · Why llmaker why-self-host-your-llm-stack · Stacks stacks · The agent the-agent · Architecture architecture · CLI cli-reference · Roadmap roadmap Running a model locally is easy. Shipping an application is not. A production retrieval system needs a vector database, an embeddings service, a caching layer, an orchestration layer, and observability — each containerized, networked, and configured to discover the others. Assembling that is a recurring tax: a sprawl of docker run flags, a brittle Compose file, and hundreds of lines of framework glue. llmaker removes that tax. One CLI provisions the entire stack on a private network and operates it as a single fleet — live status, logs, and a resource dashboard across every model and service. Stacks are declarative and reconcilable apply --prune , models are OpenAI-compatible , and retrieval is traced out of the box . From a single model to a complete application: ── Build a complete application stack ────────────────────────── llmaker stack up assistant one command → a private ChatGPT-style UI over a local model llmaker stack init rag …or scaffold any stack to edit, then apply it: llmaker apply assistant · voice · rag · research · code · chatbot · faq · recommend · sql ── …or run a single model OpenAI-compatible ────────────────── llmaker up --model llama3:8b a local endpoint — explicit, or a preset: llmaker up chat chat · code · small · embed · vision llmaker chat <name test it in the terminal llmaker open <name open its built-in web UI ── …or compose the stack à la carte, service by service ──────── llmaker service catalog browse what's available llmaker service add qdrant vector database → qdrant:6333 llmaker service add redis cache / memory → redis:6379 llmaker service add langfuse observability → langfuse:3000 ── Operate the fleet ─────────────────────────────────────────── llmaker ls every model + service, one view --json llmaker top live resource dashboard TUI llmaker status <name gauges, loaded models, endpoints llmaker logs <name -f stream logs from any container llmaker pull mistral --on chat download a model with progress llmaker stop / start / rm lifecycle management ── Consume it — the agent's API, or any OpenAI client ────────── AGENT=$ llmaker service ls --json | jq -r '. |select .service=="agent" .url' curl "$AGENT/api/ingest" -F file=@handbook.pdf add knowledge curl "$AGENT/api/chat" -d '{"question":"refund policy?"}' grounded answer + sources curl "$AGENT/api/recommend" -d '{"like": "sku1","sku2" }' semantic recommendations Everything lands on a private network where each container discovers the others by name — no Compose file and no glue code. The complete stack, curated | Models and the infrastructure around them — vector databases Qdrant, Chroma, pgvector, Weaviate , Redis, embeddings, Open WebUI, n8n, Flowise, Whisper, Langfuse — from one versioned catalog. | Automatic service discovery | Every model and service joins a private Docker network and resolves by name. Your application reaches chat:8080 and qdrant:6333 with zero IP wiring. | A retrieval & tool agent, built in | A FastAPI + LangGraph service: rewrite → retrieve → rerank → generate multi-turn, MMR , a tool-calling loop calculator, knowledge base, self-hosted web search, SQL , and a semantic recommendation API. | Observability by default | Every instance exposes Prometheus /metrics requests, tokens/sec, CPU/RAM/GPU for scraping, and the RAG stack ships Langfuse — every query traced retrieval hits and scores, model and token usage with no setup. | Measurable quality | An evaluation harness /api/eval grades answers for groundedness, relevance, and correctness with an LLM judge — retrieval quality you can track across changes, not guess at. | More than RAG | First-class endpoints for summarization map-reduce over long docs , structured JSON extraction, and speech-to-text Whisper , plus optional Redis-backed conversation memory. | Declarative, reconcilable | Define your stack in one file. llmaker apply brings it to the desired state in dependency order; --prune removes what's no longer declared. | OpenAI-compatible | Each model exposes a stable /v1/ API chat, completions, embeddings, streaming behind one contract — Ollama runs it today, with a llama.cpp backend | Private by design | Containers bind to 127.0.0.1 by default. Your documents, embeddings, and traces never leave your infrastructure. No per-token cost, no vendor lock-in. | Operable | A single static Go binary, a labeled-container model with no state file to drift, --json output everywhere, and a live top dashboard. | Data ownership. Proprietary documents, customer data, and prompts stay on hardware you control. Nothing is sent to a third-party API. No assembly tax. The vector DB, embeddings, cache, agent, and tracing come pre-integrated and networked — not as a Compose file you maintain by hand. Predictable cost. Inference and retrieval run on infrastructure you already pay for. No per-token billing, no rate limits. Portability. The same stack.yaml runs on a laptop, a CI runner, or a server. Swap the model or the vector database without touching your application. | Model runners Ollama, LM Studio | DIY Docker Compose | Frameworks LangChain | llmaker | | |---|---|---|---|---| | Run local models, OpenAI-compatible | ✓ | — | — | ✓ | | Vector DB, embeddings, cache — curated | — | manual | — | ✓ | | Service discovery between containers | — | manual | n/a | ✓ | | One-command application RAG, recsys | — | — | — | ✓ | | Built-in retrieval & recommendation agent | — | — | you code it | ✓ | | Observability / tracing integrated | — | manual | manual | ✓ | | Declarative provisioning & reconciliation | — | partial | — | ✓ | Requires Docker . Run llmaker doctor afterward to validate your environment. Prebuilt binary Linux / macOS curl -fsSL https://raw.githubusercontent.com/raiyanyahya/llmaker/master/scripts/install.sh | sh Go toolchain go install github.com/raiyanyahya/llmaker/cmd/llmaker@latest From source git clone https://github.com/raiyanyahya/llmaker && cd llmaker && make build Homebrew and winget packages are on the roadmap. The agent image is built locally with make image-agent until it is published to a registry. Provision and run a complete retrieval-augmented generation stack: llmaker stack up assistant scaffold + apply in one step assistant needs no agent image llmaker stack init rag generate stack.yaml assistant | voice | rag | research | code | chatbot | faq | recommend | sql make image-agent build the agent image once stacks that include the agent llmaker apply -f stack.yaml provision the stack — model + services, networked llmaker ls inspect models and services in one view Resolve the agent endpoint and use it: AGENT=$ llmaker service ls --json | jq -r '. | select .service=="agent" .url' curl "$AGENT/api/ingest" -F file=@handbook.pdf ingest documents curl "$AGENT/api/chat" -d '{"question":"…","history": ,"top k":4}' query, with sources llmaker also runs individual models — the easiest way to expose a local, OpenAI-compatible endpoint: llmaker up --model llama3:8b provision a model instance python from openai import OpenAI client = OpenAI base url="http://127.0.0.1:11500/v1", api key="not-needed" client.chat.completions.create model="llama3:8b", messages= {"role": "user", "content": "Hello"} A stack is a model plus the services around it, provisioned together. Scaffold and run one in a single step with llmaker stack up <name , or generate a stack.yaml to edit with llmaker stack init <name and apply it with llmaker apply . | Template | Application | Components | |---|---|---| assistant | A private, ChatGPT-style assistant over a local model — chats, prompts, RAG in the UI. No agent image to build | LLM · Open WebUI | voice | Talk to a model — speech-to-text in the browser via self-hosted Whisper. No agent image to build | LLM · Open WebUI · Whisper | rag | Document Q&A — ingest files, query with grounded answers and sources, fully traced | LLM · Qdrant · embeddings · agent · Langfuse · Postgres | research | A tool-using assistant that searches the live web and your documents, then synthesizes | LLM · SearXNG · Qdrant · embeddings · agent | code | A code assistant — ingest a repo, ask grounded questions and review | code LLM · Qdrant · embeddings · agent | chatbot | A multi-turn assistant with a web UI and per-session memory | LLM · Redis · agent | faq | A knowledge-base assistant tuned for short, grounded answers | LLM · Qdrant · embeddings · agent | recommend | A semantic recommendation engine — "more like this", no LLM required | Qdrant · embeddings · agent | sql | Ask your database in plain English — the agent runs read-only SQL enforced and grounds in docs | LLM · Postgres · Qdrant · embeddings · agent | The catalog's agent is a FastAPI + LangGraph service agent/ that turns a bare model and vector store into an application. It is a standard service on the network, configured by environment to discover the others by name. Retrieval as an explicit graph — rewrite → retrieve → rerank → generate : rewrite — collapses multi-turn history into a standalone query, so follow-ups that depend on context "and when was it released?" resolve correctly. The model is only invoked when there is history to resolve. retrieve — embeds the query and retrieves a candidate set from the vector store. rerank — applies Maximal Marginal Relevance https://en.wikipedia.org/wiki/Maximal marginal relevance for relevant, non-redundant context. generate — produces the answer from that context and the conversation. POST /api/ingest multipart file or text → chunk, embed, store POST /api/chat { question, history?, top k?, session id? } → answer + sources POST /api/agent { question, history?, session id? } → tool-using answer + tool calls POST /api/summarize { text, instructions?, max words? } → summary map-reduce for long text POST /api/extract { text, fields: { name: description } } → JSON with exactly those keys POST /api/transcribe multipart audio file → { text } needs a whisper service POST /api/eval { cases: { question, reference? } } → graded answers + summary POST /api/items { items: { id, text, metadata? } } → index for recommendation POST /api/recommend { query } or { like: id, … } → ranked items Tool calling. Beyond retrieval, /api/agent runs a tool-calling loop where the model decides which tools to invoke — a calculator , the knowledge base retrieval as a tool , the current time , a self-hosted web search SearXNG, no paid API , and an optional read-only SQL tool over your database — and the loop executes them until it has an answer. The response includes every tool call it made. Adding a tool is one entry in agent/app/tools.py . Tracing. The rag stack provisions Langfuse and the agent traces every query to it, with zero configuration — each request RAG or tool-using appears as a trace with its retrieval, tool, and generation steps. Tracing is enabled by the template and is otherwise opt-in via two environment variables. Evaluation. /api/eval runs a question set through the same pipeline and grades each answer — groundedness and relevance by LLM-as-judge, plus correctness against a reference and context recall against expected sources when you supply them. You get per-case scores and an aggregate summary, and every case is traced to Langfuse alongside your live traffic — so retrieval quality is measurable, not a vibe. Beyond chat. Two everyday tasks are first-class endpoints: /api/summarize condenses text map-reducing long inputs chunk by chunk so a whole report fits , and /api/extract turns text into a typed JSON object from the fields you name — parsed defensively so a chatty model never breaks the contract. With a whisper service on the network, /api/transcribe adds speech-to-text. Memory. The agent is stateless by default the client passes history . Set REDIS URL and it persists history server-side: send a session id with /api/chat or /api/agent and prior turns are loaded, prepended, and saved automatically — capped and expiring, and best-effort so Redis being down never fails a chat. llmaker stack init chatbot wires it up. Recommendations reuse the same embeddings and vector store, with no model involved: index items once, then retrieve by free-text intent query or by example like , which averages the seed items into a profile and excludes them from the results . Full agent contract and configuration: agent/README.md /raiyanyahya/llmaker/blob/master/agent/README.md . Compose a stack from the catalog directly, or let a template do it: llmaker service catalog list available services llmaker service add qdrant vector database → qdrant:6333 llmaker service add redis cache / memory → redis:6379 llmaker service add embeddings embeddings HF TEI → embeddings:80 llmaker service add searxng web search → searxng:8080 llmaker service add whisper speech-to-text → whisper:8000 llmaker service add open-webui ChatGPT-style UI → open-webui:8080 llmaker service add langfuse observability → langfuse:3000 | Category | Services | |---|---| | Vector databases | Qdrant · Chroma · pgvector Postgres · Weaviate | | Cache / memory | Redis powers per-session agent memory | | Embeddings | HuggingFace Text-Embeddings-Inference | | Search | SearXNG self-hosted metasearch | | Speech-to-text | Whisper faster-whisper, OpenAI-compatible | | Observability | Langfuse | | Web UI & apps | Open WebUI ChatGPT-style UI · n8n workflow automation · Flowise visual LLM app builder | | Agent | LangGraph retrieval & recommendation agent | Every model and service joins a private Docker network llmaker-net and is addressable there by name — service discovery without IPs, links, or a Compose file. Applications running on the host or in their own container reach the stack the same way: docker run --rm --network llmaker-net redis:7-alpine redis-cli -h redis ping → PONG Adding a service is a single entry in internal/service/catalog.go ; the CLI, fleet view, and declarative engine pick it up automatically. stack init generates one of these; it can also be authored by hand. apply reconciles the running stack to the file — provisioning services before the applications that depend on them — and --prune removes anything not declared. Give the file a top-level name: and --prune is scoped to that stack , so applying one stack never deletes another's containers scaffolded stacks are named automatically . An unnamed file prunes the whole managed fleet. stack.yaml → llmaker apply -f stack.yaml --prune defaults: { backend: ollama } instances: - { name: chat, model: llama3:8b, memory: 8g } → chat:8080 services: - use: qdrant → qdrant:6333 - { name: cache, use: redis } → cache:6379 - { name: embeddings, use: embeddings, env: { MODEL ID: BAAI/bge-small-en-v1.5 } } - use: agent → agent:8800 Unset ports are assigned automatically; a stack may be services-only. See examples/stack.yaml /raiyanyahya/llmaker/blob/master/examples/stack.yaml and . /raiyanyahya/llmaker/blob/master/examples/llm.yaml examples/llm.yaml ┌──────────────────────────────────────────────────────────────────────┐ │ llmaker CLI Go — single static binary │ │ orchestration · Docker SDK · private networking · declarative apply │ └───────────────────────────────┬──────────────────────────────────────┘ │ provision · start · stop · HTTP ▼ ════════════════ llmaker-net private network, DNS by name ════════════════ ┌── Model instance ───────────┐ ┌── Services ───────────────────────────┐ │ engine ⇄ facade FastAPI │ │ qdrant · embeddings · redis · pgvector │ │ Ollama · llama.cpp │ │ langfuse · … │ │ OpenAI /v1/ · web UI │ │ qdrant:6333 embeddings:80 │ │ chat:8080 │ └────────────────────────────────────────┘ └─────────────────────────────┘ ▲ ▲ │ └──────────────┬──────────────────┘ ┌── Agent FastAPI + LangGraph ───┐ │ rewrite → retrieve → rerank → │ agent:8800 │ generate · ingest · recommend │ └───────────────────────────────────┘ host ports 127.0.0.1:PORT mapped per container The llama.cpp backend is scaffolded but still maturing; Ollama is the verified default — see the roadmap. The control plane is a single Go binary; the data plane is containers on a private network. Orchestration logic is decoupled from Docker behind a Runtime interface, and the fleet is tracked entirely through container labels — there is no local state file to drift out of sync. Model facades and the agent are Python FastAPI , each communicating over the same HTTP contract. | Command | Description | |---|---| llmaker stack up <assistant|voice|rag|research|code|chatbot|faq|recommend|sql | Scaffold a stack and apply it in one command | llmaker stack init <template | Generate a ready-to-apply stack definition to edit | llmaker apply -f stack.yaml | Provision / reconcile a declarative stack — --prune | llmaker up preset | Provision a model instance — preset, flags, or interactive wizard | llmaker stop | start | restart | rm <name ... | Instance lifecycle — restart = stop+start, rm --force removes a running one | llmaker service catalog | List available services | llmaker service add <type name | Provision a service — --env , --port , --memory | llmaker service ls | rm | stop | start | restart | Manage services — --json | llmaker ls | List the fleet — models and services — --json , --quiet | llmaker top | Live resource dashboard across the fleet | llmaker status <name | Detailed instance status — --json | llmaker pull <model --on <name | Download a model with progress — --default | llmaker chat name | Interactive or one-shot chat — --message , stdin | llmaker open <name | Open a container's web UI — --print | llmaker logs <name -f | Stream logs from any container | llmaker doctor | Validate the environment Docker, GPU, platform caveats | | Setting | Where | Default | |---|---|---| | backend / model | --backend · --model · stack.yaml | ollama · backend default | | memory · cpus · gpu | flags · stack.yaml | host-derived | | port · host | --port · --host | auto · 127.0.0.1 | | service environment | service add --env · env: in stack.yaml | per-service defaults | API KEY · CORS ORIGINS · KEEP ALIVE | --api-key · --cors · --keep-alive | open · · 5m | Per-service and agent configuration model URLs, chunking, reranking, tracing keys is documented in agent/README.md /raiyanyahya/llmaker/blob/master/agent/README.md and . /raiyanyahya/llmaker/blob/master/facade/README.md facade/README.md Every container binds to 127.0.0.1 by default; nothing is exposed until you opt in, and exposure pairs with authentication: llmaker up --host 0.0.0.0 --api-key "$ openssl rand -hex 16 " When API KEY is set, every /v1/ and /api/ request requires a bearer token liveness probes excepted . The agent enforces its own API KEY identically. The Langfuse keys and database password in the catalog are development defaults — rotate them before exposing a stack beyond localhost. Docker on macOS cannot pass through the Apple GPU; a containerized engine runs CPU-only. llmaker doctor detects and reports this. On Linux with NVIDIA, --gpu reserves GPUs via the NVIDIA Container Toolkit. | Image | Size | Use | |---|---|---| llmaker-ollama:latest | ~8.5 GB | GPU-capable Linux + NVIDIA | llmaker-ollama:cpu | ~360 MB | CPU-only — laptops, CI, macOS | llmaker-agent:latest | ~510 MB | LangGraph agent — RAG, tools, eval, summarize/extract, transcribe | Images are resolved with a pull-if-missing policy, so locally built images make image-agent are used directly without contacting a registry. make build build ./bin/llmaker make check gofmt + vet + go test CI parity make facade-setup && make facade-test model facade pytest make agent-setup && make agent-test retrieval/recommendation agent pytest make images build backend + agent images The Go control plane is tested against an in-memory runtime no Docker required . The model facade and the agent — routes, the LangGraph pipeline, reranking, tracing, and recommendation — are tested against in-memory fakes. CI runs Go race tests, gofmt , a ruff-linted Python test matrix, and image builds on every push. cmd/llmaker/ CLI entrypoint internal/ backend/ inference engines and image references service/ the service catalog engine/ domain model, ports, labels, Runtime interface dockerrt/ Docker implementation and the private network enginetest/ in-memory Runtime for tests config/ stack.yaml parsing and dependency ordering cli/ · ui/ · tui/ Cobra commands and the terminal interface facade/ model facade FastAPI + per-model web UI agent/ retrieval & recommendation agent FastAPI + LangGraph images/ backend and agent Dockerfiles Status: alpha.Checked capabilities are implemented and covered by the test suite; the core stack is verified end-to-end against live Docker. - Model instances — OpenAI-compatible facade, per-model UI, fleet management - Service catalog — vector databases, cache, embeddings, search, observability - Private networking — automatic service discovery by name - Declarative stacks — stack init templates and reconciling apply --prune - Retrieval agent — LangGraph rewrite → retrieve → rerank → generate , multi-turn - Recommendation engine — semantic query and "more like this" - Integrated observability — Langfuse tracing - Tool-calling agent — calculator, knowledge base, time, web search, read-only SQL - Self-hosted web search — SearXNG service + a web search agent tool - Evaluation harness — /api/eval graded by LLM-as-judge, traced to Langfuse - Summarization & extraction — /api/summarize map-reduce , /api/extract typed JSON - Speech-to-text — Whisper service + /api/transcribe - Conversation memory — Redis-backed per-session history session id - More agent tooling — dedicated cross-encoder reranking; richer eval datasets - Additional backends — llama.cpp model management; Metal on macOS - Distribution — multi-architecture images, package managers, releases Contributions are welcome. Keep the suite green make check , make facade-test , make agent-test , match the surrounding style, and include tests. Adding a service is a single catalog entry; adding a model backend is a single facade adapter. Apache 2.0 /raiyanyahya/llmaker/blob/master/LICENSE © Raiyan Yahya.