cd /news/large-language-models/show-hn-self-hosting-a-modern-llm-st… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-42885] src=github.com β†— pub= topic=large-language-models verified=true sentiment=↑ positive

Show HN: Self hosting a modern LLM stack

Llmaker, an open-source platform for self-hosting a complete LLM stack including models, vector databases, embeddings, caching, observability, and an agent layer, launched on Hacker News. The platform provisions and manages the entire stack from a single CLI command, enabling private retrieval-augmented chatbots and recommendation engines without third-party APIs. It aims to eliminate the complexity of assembling and networking multiple containerized services for production LLM applications.

read18 min views1 publishedJun 29, 2026
Show HN: Self hosting a modern LLM stack
Image: source

llmaker is an open-source platform for running the complete modern LLM stack on your own infrastructure β€” large language models, vector databases, embeddings, caching, observability, and a built-in retrieval & agent layer β€” provisioned, networked, and production-shaped from a single command.

Build private retrieval-augmented chatbots, FAQ assistants, and recommendation engines locally. No third-party API keys. No data leaving your machine.

Quickstart Β· Why llmaker Β· Stacks Β· The agent Β· Architecture Β· CLI Β· Roadmap

Running a model locally is easy. Shipping an application is not. A production retrieval system needs a vector database, an embeddings service, a caching layer, an orchestration layer, and observability β€” each containerized, networked, and configured to discover the others. Assembling that is a recurring tax: a sprawl of docker run

flags, a brittle Compose file, and hundreds of lines of framework glue.

llmaker removes that tax. One CLI provisions the entire stack on a private network and operates it as a single fleet β€” live status, logs, and a resource dashboard across every model and service. Stacks are declarative and reconcilable (apply --prune

), models are OpenAI-compatible, and retrieval is traced out of the box. From a single model to a complete application:

llmaker stack up assistant      # one command β†’ a private ChatGPT-style UI over a local model
llmaker stack init rag          # …or scaffold any stack to edit, then apply it:
llmaker apply                   #   assistant Β· voice Β· rag Β· research Β· code Β· chatbot Β· faq Β· recommend Β· sql

llmaker up --model llama3:8b    # a local endpoint β€” explicit, or a preset:
llmaker up chat                 #   chat Β· code Β· small Β· embed Β· vision
llmaker chat <name>             # test it in the terminal
llmaker open <name>             # open its built-in web UI

llmaker service catalog         # browse what's available
llmaker service add qdrant      # vector database  β†’ qdrant:6333
llmaker service add redis       # cache / memory   β†’ redis:6379
llmaker service add langfuse    # observability    β†’ langfuse:3000

llmaker ls                      # every model + service, one view   (--json)
llmaker top                     # live resource dashboard (TUI)
llmaker status <name>           # gauges, loaded models, endpoints
llmaker logs <name> -f          # stream logs from any container
llmaker pull mistral --on chat  # download a model with progress
llmaker stop / start / rm       # lifecycle management

AGENT=$(llmaker service ls --json | jq -r '.[]|select(.service=="agent").url')
curl "$AGENT/api/ingest"    -F file=@handbook.pdf            # add knowledge
curl "$AGENT/api/chat"      -d '{"question":"refund policy?"}'   # grounded answer + sources
curl "$AGENT/api/recommend" -d '{"like":["sku1","sku2"]}'   # semantic recommendations

Everything lands on a private network where each container discovers the others by name β€” no Compose file and no glue code.

The complete stack, curated | Models and the infrastructure around them β€” vector databases (Qdrant, Chroma, pgvector, Weaviate), Redis, embeddings, Open WebUI, n8n, Flowise, Whisper, Langfuse β€” from one versioned catalog. | Automatic service discovery | Every model and service joins a private Docker network and resolves by name. Your application reaches chat:8080 and qdrant:6333 with zero IP wiring. | A retrieval & tool agent, built in | A FastAPI + LangGraph service: rewrite β†’ retrieve β†’ rerank β†’ generate (multi-turn, MMR), a tool-calling loop (calculator, knowledge base, self-hosted web search, SQL), and a semantic recommendation API. | Observability by default | Every instance exposes Prometheus /metrics (requests, tokens/sec, CPU/RAM/GPU) for scraping, and the RAG stack ships Langfuse β€” every query traced (retrieval hits and scores, model and token usage) with no setup. | Measurable quality | An evaluation harness (/api/eval ) grades answers for groundedness, relevance, and correctness with an LLM judge β€” retrieval quality you can track across changes, not guess at. | More than RAG | First-class endpoints for summarization (map-reduce over long docs), structured JSON extraction, and speech-to-text (Whisper), plus optional Redis-backed conversation memory. | Declarative, reconcilable | Define your stack in one file. llmaker apply brings it to the desired state in dependency order; --prune removes what's no longer declared. | OpenAI-compatible | Each model exposes a stable /v1/* API (chat, completions, embeddings, streaming) behind one contract β€” Ollama runs it today, with a llama.cpp backend | Private by design | Containers bind to 127.0.0.1 by default. Your documents, embeddings, and traces never leave your infrastructure. No per-token cost, no vendor lock-in. | Operable | A single static Go binary, a labeled-container model with no state file to drift, --json output everywhere, and a live top dashboard. |

Data ownership. Proprietary documents, customer data, and prompts stay on hardware you control. Nothing is sent to a third-party API.No assembly tax. The vector DB, embeddings, cache, agent, and tracing come pre-integrated and networked β€” not as a Compose file you maintain by hand.Predictable cost. Inference and retrieval run on infrastructure you already pay for. No per-token billing, no rate limits.Portability. The samestack.yaml

runs on a laptop, a CI runner, or a server. Swap the model or the vector database without touching your application.

| Model runners (Ollama, LM Studio) | DIY Docker Compose | Frameworks (LangChain) | llmaker | | |---|---|---|---|---| | Run local models, OpenAI-compatible | βœ“ | β€” | β€” | βœ“ | | Vector DB, embeddings, cache β€” curated | β€” | manual | β€” | βœ“ | | Service discovery between containers | β€” | manual | n/a | βœ“ | | One-command application (RAG, recsys) | β€” | β€” | β€” | βœ“ | | Built-in retrieval & recommendation agent | β€” | β€” | you code it | βœ“ | | Observability / tracing integrated | β€” | manual | manual | βœ“ | | Declarative provisioning & reconciliation | β€” | partial | β€” | βœ“ |

Requires

[Docker]. Runllmaker doctor

afterward to validate your environment.

curl -fsSL https://raw.githubusercontent.com/raiyanyahya/llmaker/master/scripts/install.sh | sh

go install github.com/raiyanyahya/llmaker/cmd/llmaker@latest

git clone https://github.com/raiyanyahya/llmaker && cd llmaker && make build

Homebrew and winget packages are on the roadmap. The agent image is built locally with make image-agent until it is published to a registry.

Provision and run a complete retrieval-augmented generation stack:

llmaker stack up assistant    # scaffold + apply in one step (assistant needs no agent image)
llmaker stack init rag        # generate stack.yaml (assistant | voice | rag | research | code | chatbot | faq | recommend | sql)
make image-agent              # build the agent image once (stacks that include the agent)
llmaker apply -f stack.yaml   # provision the stack β€” model + services, networked
llmaker ls                    # inspect models and services in one view

Resolve the agent endpoint and use it:

AGENT=$(llmaker service ls --json | jq -r '.[] | select(.service=="agent").url')

curl "$AGENT/api/ingest" -F file=@handbook.pdf                     # ingest documents
curl "$AGENT/api/chat"   -d '{"question":"…","history":[],"top_k":4}'   # query, with sources

llmaker also runs individual models β€” the easiest way to expose a local, OpenAI-compatible endpoint:

llmaker up --model llama3:8b          # provision a model instance
python
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:11500/v1", api_key="not-needed")
client.chat.completions.create(model="llama3:8b",
    messages=[{"role": "user", "content": "Hello"}])

A stack is a model plus the services around it, provisioned together. Scaffold and run one in a single step with llmaker stack up <name>

, or generate a stack.yaml

to edit with llmaker stack init <name>

and apply it with llmaker apply

.

Template Application Components
assistant
A private, ChatGPT-style assistant over a local model β€” chats, prompts, RAG in the UI. No agent image to build LLM Β· Open WebUI
voice
Talk to a model β€” speech-to-text in the browser via self-hosted Whisper. No agent image to build LLM Β· Open WebUI Β· Whisper
rag
Document Q&A β€” ingest files, query with grounded answers and sources, fully traced LLM Β· Qdrant Β· embeddings Β· agent Β· Langfuse Β· Postgres
research
A tool-using assistant that searches the live web and your documents, then synthesizes
LLM Β· SearXNG Β· Qdrant Β· embeddings Β· agent
code
A code assistant β€” ingest a repo, ask grounded questions and review code LLM Β· Qdrant Β· embeddings Β· agent
chatbot
A multi-turn assistant with a web UI and per-session memory LLM Β· Redis Β· agent
faq
A knowledge-base assistant tuned for short, grounded answers LLM Β· Qdrant Β· embeddings Β· agent
recommend
A semantic recommendation engine β€” "more like this", no LLM required Qdrant Β· embeddings Β· agent
sql
Ask your database in plain English β€” the agent runs read-only SQL (enforced) and grounds in docs LLM Β· Postgres Β· Qdrant Β· embeddings Β· agent

The catalog's agent

is a FastAPI + LangGraph service (agent/

) that turns a bare model and vector store into an application. It is a standard service on the network, configured by environment to discover the others by name.

Retrieval as an explicit graph β€” rewrite β†’ retrieve β†’ rerank β†’ generate

:

rewriteβ€” collapses multi-turn history into a standalone query, so follow-ups that depend on context ("and when wasitreleased?") resolve correctly. The model is only invoked when there is history to resolve.retrieveβ€” embeds the query and retrieves a candidate set from the vector store.** rerank**β€” appliesMaximal Marginal Relevancefor relevant, non-redundant context.generateβ€” produces the answer from that context and the conversation.

POST /api/ingest      multipart file or text  β†’  chunk, embed, store
POST /api/chat        { question, history?, top_k?, session_id? }  β†’  answer + sources
POST /api/agent       { question, history?, session_id? }  β†’  tool-using answer + tool calls
POST /api/summarize   { text, instructions?, max_words? }  β†’  summary (map-reduce for long text)
POST /api/extract     { text, fields: { name: description } }  β†’  JSON with exactly those keys
POST /api/transcribe  multipart audio file  β†’  { text }   (needs a whisper service)
POST /api/eval        { cases: [{ question, reference? }] }  β†’  graded answers + summary
POST /api/items       { items: [{ id, text, metadata? }] }  β†’  index for recommendation
POST /api/recommend   { query }  or  { like: [id, …] }  β†’  ranked items

Tool calling. Beyond retrieval, /api/agent

runs a tool-calling loop where the model decides which tools to invoke β€” a calculator, the knowledge base (retrieval as a tool), the current time, a self-hosted web search (SearXNG, no paid API), and an optional read-only SQL tool over your database β€” and the loop executes them until it has an answer. The response includes every tool call it made. Adding a tool is one entry in agent/app/tools.py

.

Tracing. The rag

stack provisions Langfuse and the agent traces every query to it, with zero configuration β€” each request (RAG or tool-using) appears as a trace with its retrieval, tool, and generation steps. Tracing is enabled by the template and is otherwise opt-in via two environment variables.

Evaluation. /api/eval

runs a question set through the same pipeline and grades each answer β€” groundedness and relevance by LLM-as-judge, plus correctness against a reference and context recall against expected sources when you supply them. You get per-case scores and an aggregate summary, and every case is traced to Langfuse alongside your live traffic β€” so retrieval quality is measurable, not a vibe.

Beyond chat. Two everyday tasks are first-class endpoints: /api/summarize

condenses text (map-reducing long inputs chunk by chunk so a whole report fits), and /api/extract

turns text into a typed JSON object from the fields you name β€” parsed defensively so a chatty model never breaks the contract. With a whisper

service on the network, /api/transcribe

adds speech-to-text.

Memory. The agent is stateless by default (the client passes history

). Set REDIS_URL

and it persists history server-side: send a session_id

with /api/chat

or /api/agent

and prior turns are loaded, prepended, and saved automatically β€” capped and expiring, and best-effort so Redis being down never fails a chat. llmaker stack init chatbot

wires it up.

Recommendations reuse the same embeddings and vector store, with no model involved: index items once, then retrieve by free-text intent (query

) or by example (like

, which averages the seed items into a profile and excludes them from the results).

Full agent contract and configuration: agent/README.md.

Compose a stack from the catalog directly, or let a template do it:

llmaker service catalog          # list available services
llmaker service add qdrant       # vector database     β†’ qdrant:6333
llmaker service add redis        # cache / memory      β†’ redis:6379
llmaker service add embeddings   # embeddings (HF TEI) β†’ embeddings:80
llmaker service add searxng      # web search          β†’ searxng:8080
llmaker service add whisper      # speech-to-text      β†’ whisper:8000
llmaker service add open-webui   # ChatGPT-style UI    β†’ open-webui:8080
llmaker service add langfuse     # observability       β†’ langfuse:3000
Category Services
Vector databases Qdrant Β· Chroma Β· pgvector (Postgres) Β· Weaviate
Cache / memory Redis (powers per-session agent memory)
Embeddings HuggingFace Text-Embeddings-Inference
Search SearXNG (self-hosted metasearch)
Speech-to-text Whisper (faster-whisper, OpenAI-compatible)
Observability Langfuse
Web UI & apps Open WebUI (ChatGPT-style UI) Β· n8n (workflow automation) Β· Flowise (visual LLM app builder)
Agent LangGraph retrieval & recommendation agent

Every model and service joins a private Docker network (llmaker-net

) and is addressable there by name β€” service discovery without IPs, links, or a Compose file. Applications running on the host or in their own container reach the stack the same way:

docker run --rm --network llmaker-net redis:7-alpine redis-cli -h redis ping   # β†’ PONG

Adding a service is a single entry in internal/service/catalog.go

; the CLI, fleet view, and declarative engine pick it up automatically.

stack init

generates one of these; it can also be authored by hand. apply

reconciles the running stack to the file β€” provisioning services before the applications that depend on them β€” and --prune

removes anything not declared. Give the file a top-level name:

and --prune

is scoped to that stack, so applying one stack never deletes another's containers (scaffolded stacks are named automatically). An unnamed file prunes the whole managed fleet.

defaults: { backend: ollama }
instances:
  - { name: chat, model: llama3:8b, memory: 8g }   # β†’ chat:8080
services:
  - use: qdrant                                    # β†’ qdrant:6333
  - { name: cache, use: redis }                    # β†’ cache:6379
  - { name: embeddings, use: embeddings, env: { MODEL_ID: BAAI/bge-small-en-v1.5 } }
  - use: agent                                     # β†’ agent:8800

Unset ports are assigned automatically; a stack may be services-only. See examples/stack.yaml and

.

examples/llm.yaml

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  llmaker CLI   (Go β€” single static binary)                            β”‚
β”‚  orchestration Β· Docker SDK Β· private networking Β· declarative apply  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚  provision Β· start Β· stop Β· HTTP
                                β–Ό
   ════════════════ llmaker-net  (private network, DNS by name) ════════════════
    β”Œβ”€β”€ Model instance ───────────┐   β”Œβ”€β”€ Services ───────────────────────────┐
    β”‚ engine ⇄ facade (FastAPI)   β”‚   β”‚ qdrant Β· embeddings Β· redis Β· pgvector β”‚
    β”‚ Ollama Β· llama.cpp*         β”‚   β”‚ langfuse Β· …                           β”‚
    β”‚ OpenAI /v1/* Β· web UI       β”‚   β”‚ qdrant:6333   embeddings:80            β”‚
    β”‚ chat:8080                   β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β–²
                    β–²                                 β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”Œβ”€β”€ Agent (FastAPI + LangGraph) ───┐
                    β”‚ rewrite β†’ retrieve β†’ rerank β†’     β”‚   agent:8800
                    β”‚ generate Β· ingest Β· recommend     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              host ports (127.0.0.1:PORT) mapped per container
  • The llama.cpp backend is scaffolded but still maturing; Ollama is the verified default β€” see the roadmap.

The control plane is a single Go binary; the data plane is containers on a private network. Orchestration logic is decoupled from Docker behind a Runtime

interface, and the fleet is tracked entirely through container labels β€” there is no local state file to drift out of sync. Model facades and the agent are Python (FastAPI), each communicating over the same HTTP contract.

Command Description
`llmaker stack up <assistant voice
Scaffold a stack and apply it in one command
llmaker stack init <template>
Generate a ready-to-apply stack definition to edit
llmaker apply -f stack.yaml
Provision / reconcile a declarative stack β€” --prune
llmaker up [preset]
Provision a model instance β€” preset, flags, or interactive wizard
`llmaker stop start
Instance lifecycle β€” restart = stop+start, rm --force removes a running one
llmaker service catalog
List available services
llmaker service add <type> [name]
Provision a service β€” --env , --port , --memory
`llmaker service ls rm
Manage services β€” --json
llmaker ls
List the fleet β€” models and services β€” --json , --quiet
llmaker top
Live resource dashboard across the fleet
llmaker status <name>
Detailed instance status β€” --json
llmaker pull <model> --on <name>
Download a model with progress β€” --default
llmaker chat [name]
Interactive or one-shot chat β€” --message , stdin
llmaker open <name>
Open a container's web UI β€” --print
llmaker logs <name> -f
Stream logs from any container
llmaker doctor
Validate the environment (Docker, GPU, platform caveats)
Setting Where Default
backend / model --backend Β· --model Β· stack.yaml
ollama Β· backend default
memory Β· cpus Β· gpu flags Β· stack.yaml
host-derived
port Β· host --port Β· --host
auto Β· 127.0.0.1
service environment service add --env Β· env: in stack.yaml
per-service defaults
API_KEY Β· CORS_ORIGINS Β· KEEP_ALIVE
--api-key Β· --cors Β· --keep-alive
open Β· * Β· 5m

Per-service and agent configuration (model URLs, chunking, reranking, tracing keys) is documented in agent/README.md and

.

facade/README.md

Every container binds to 127.0.0.1

by default; nothing is exposed until you opt in, and exposure pairs with authentication:

llmaker up --host 0.0.0.0 --api-key "$(openssl rand -hex 16)"

When API_KEY

is set, every /v1/*

and /api/*

request requires a bearer token (liveness probes excepted). The agent enforces its own API_KEY

identically. The Langfuse keys and database password in the catalog are development defaults β€” rotate them before exposing a stack beyond localhost.

Docker on macOS cannot pass through the Apple GPU; a containerized engine runs CPU-only. llmaker doctor

detects and reports this. On Linux with NVIDIA, --gpu

reserves GPUs via the NVIDIA Container Toolkit.

Image Size Use
llmaker-ollama:latest
~8.5 GB GPU-capable (Linux + NVIDIA)
llmaker-ollama:cpu
~360 MB CPU-only β€” laptops, CI, macOS
llmaker-agent:latest
~510 MB LangGraph agent β€” RAG, tools, eval, summarize/extract, transcribe

Images are resolved with a pull-if-missing policy, so locally built images (make image-agent

) are used directly without contacting a registry.

make build        # build ./bin/llmaker
make check        # gofmt + vet + go test (CI parity)

make facade-setup && make facade-test     # model facade (pytest)
make agent-setup  && make agent-test      # retrieval/recommendation agent (pytest)

make images       # build backend + agent images

The Go control plane is tested against an in-memory runtime (no Docker required). The model facade and the agent β€” routes, the LangGraph pipeline, reranking, tracing, and recommendation β€” are tested against in-memory fakes. CI runs Go race tests, gofmt

, a ruff-linted Python test matrix, and image builds on every push.

cmd/llmaker/            CLI entrypoint
internal/
  backend/              inference engines and image references
  service/              the service catalog
  engine/               domain model, ports, labels, Runtime interface
    dockerrt/           Docker implementation and the private network
    enginetest/         in-memory Runtime for tests
  config/               stack.yaml parsing and dependency ordering
  cli/ Β· ui/ Β· tui/     Cobra commands and the terminal interface
facade/                 model facade (FastAPI) + per-model web UI
agent/                  retrieval & recommendation agent (FastAPI + LangGraph)
images/                 backend and agent Dockerfiles

Status: alpha.Checked capabilities are implemented and covered by the test suite; the core stack is verified end-to-end against live Docker.

  • Model instances β€” OpenAI-compatible facade, per-model UI, fleet management
  • Service catalog β€” vector databases, cache, embeddings, search, observability
  • Private networking β€” automatic service discovery by name
  • Declarative stacks β€” stack init

templates and reconcilingapply --prune

  • Retrieval agent β€” LangGraph rewrite β†’ retrieve β†’ rerank β†’ generate

, multi-turn - Recommendation engine β€” semantic query

and "more like this" - Integrated observability β€” Langfuse tracing

  • Tool-calling agent β€” calculator, knowledge base, time, web search, read-only SQL
  • Self-hosted web search β€” SearXNG service + a web_search

agent tool - Evaluation harness β€” /api/eval

graded by LLM-as-judge, traced to Langfuse - Summarization & extraction β€” /api/summarize

(map-reduce),/api/extract

(typed JSON) - Speech-to-text β€” Whisper service + /api/transcribe

  • Conversation memory β€” Redis-backed per-session history ( session_id

) - More agent tooling β€” dedicated cross-encoder reranking; richer eval datasets

  • Additional backends β€” llama.cpp model management; Metal on macOS
  • Distribution β€” multi-architecture images, package managers, releases

Contributions are welcome. Keep the suite green (make check

, make facade-test

, make agent-test

), match the surrounding style, and include tests. Adding a service is a single catalog entry; adding a model backend is a single facade adapter.

Apache 2.0 Β© Raiyan Yahya.

── more in #large-language-models 4 stories Β· sorted by recency
── more on @llmaker 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/show-hn-self-hosting…] indexed:0 read:18min 2026-06-29 Β· β€”