{"slug": "show-hn-self-hosting-a-modern-llm-stack", "title": "Show HN: Self hosting a modern LLM stack", "summary": "Llmaker, an open-source platform for self-hosting a complete LLM stack including models, vector databases, embeddings, caching, observability, and an agent layer, launched on Hacker News. The platform provisions and manages the entire stack from a single CLI command, enabling private retrieval-augmented chatbots and recommendation engines without third-party APIs. It aims to eliminate the complexity of assembling and networking multiple containerized services for production LLM applications.", "body_md": "**llmaker is an open-source platform for running the complete modern LLM stack on\nyour own infrastructure** — large language models, vector databases, embeddings,\ncaching, observability, and a built-in retrieval & agent layer — provisioned,\nnetworked, and production-shaped from a single command.\n\nBuild private retrieval-augmented chatbots, FAQ assistants, and recommendation engines locally. No third-party API keys. No data leaving your machine.\n\n[Quickstart](#quickstart) · [Why llmaker](#why-self-host-your-llm-stack) · [Stacks](#stacks) · [The agent](#the-agent) · [Architecture](#architecture) · [CLI](#cli-reference) · [Roadmap](#roadmap)\n\nRunning a model locally is easy. Shipping an *application* is not. A production\nretrieval system needs a vector database, an embeddings service, a caching layer,\nan orchestration layer, and observability — each containerized, networked, and\nconfigured to discover the others. Assembling that is a recurring tax: a sprawl of\n`docker run`\n\nflags, a brittle Compose file, and hundreds of lines of framework glue.\n\nllmaker removes that tax. One CLI provisions the entire stack on a private network\nand **operates it as a single fleet** — live status, logs, and a resource\ndashboard across every model and service. Stacks are **declarative and\nreconcilable** (`apply --prune`\n\n), models are **OpenAI-compatible**, and retrieval\nis **traced out of the box**. From a single model to a complete application:\n\n```\n# ── Build a complete application stack ──────────────────────────\nllmaker stack up assistant      # one command → a private ChatGPT-style UI over a local model\nllmaker stack init rag          # …or scaffold any stack to edit, then apply it:\nllmaker apply                   #   assistant · voice · rag · research · code · chatbot · faq · recommend · sql\n\n# ── …or run a single model (OpenAI-compatible) ──────────────────\nllmaker up --model llama3:8b    # a local endpoint — explicit, or a preset:\nllmaker up chat                 #   chat · code · small · embed · vision\nllmaker chat <name>             # test it in the terminal\nllmaker open <name>             # open its built-in web UI\n\n# ── …or compose the stack à la carte, service by service ────────\nllmaker service catalog         # browse what's available\nllmaker service add qdrant      # vector database  → qdrant:6333\nllmaker service add redis       # cache / memory   → redis:6379\nllmaker service add langfuse    # observability    → langfuse:3000\n\n# ── Operate the fleet ───────────────────────────────────────────\nllmaker ls                      # every model + service, one view   (--json)\nllmaker top                     # live resource dashboard (TUI)\nllmaker status <name>           # gauges, loaded models, endpoints\nllmaker logs <name> -f          # stream logs from any container\nllmaker pull mistral --on chat  # download a model with progress\nllmaker stop / start / rm       # lifecycle management\n\n# ── Consume it — the agent's API, or any OpenAI client ──────────\nAGENT=$(llmaker service ls --json | jq -r '.[]|select(.service==\"agent\").url')\ncurl \"$AGENT/api/ingest\"    -F file=@handbook.pdf            # add knowledge\ncurl \"$AGENT/api/chat\"      -d '{\"question\":\"refund policy?\"}'   # grounded answer + sources\ncurl \"$AGENT/api/recommend\" -d '{\"like\":[\"sku1\",\"sku2\"]}'   # semantic recommendations\n```\n\nEverything lands on a private network where each container discovers the others by name — no Compose file and no glue code.\n\nThe complete stack, curated |\nModels and the infrastructure around them — vector databases (Qdrant, Chroma, pgvector, Weaviate), Redis, embeddings, Open WebUI, n8n, Flowise, Whisper, Langfuse — from one versioned catalog. |\nAutomatic service discovery |\nEvery model and service joins a private Docker network and resolves by name. Your application reaches `chat:8080` and `qdrant:6333` with zero IP wiring. |\nA retrieval & tool agent, built in |\nA FastAPI + LangGraph service: `rewrite → retrieve → rerank → generate` (multi-turn, MMR), a tool-calling loop (calculator, knowledge base, self-hosted web search, SQL), and a semantic recommendation API. |\nObservability by default |\nEvery instance exposes Prometheus `/metrics` (requests, tokens/sec, CPU/RAM/GPU) for scraping, and the RAG stack ships Langfuse — every query traced (retrieval hits and scores, model and token usage) with no setup. |\nMeasurable quality |\nAn evaluation harness (`/api/eval` ) grades answers for groundedness, relevance, and correctness with an LLM judge — retrieval quality you can track across changes, not guess at. |\nMore than RAG |\nFirst-class endpoints for summarization (map-reduce over long docs), structured JSON extraction, and speech-to-text (Whisper), plus optional Redis-backed conversation memory. |\nDeclarative, reconcilable |\nDefine your stack in one file. `llmaker apply` brings it to the desired state in dependency order; `--prune` removes what's no longer declared. |\nOpenAI-compatible |\nEach model exposes a stable `/v1/*` API (chat, completions, embeddings, streaming) behind one contract — Ollama runs it today, with a llama.cpp backend\n|\nPrivate by design |\nContainers bind to `127.0.0.1` by default. Your documents, embeddings, and traces never leave your infrastructure. No per-token cost, no vendor lock-in. |\nOperable |\nA single static Go binary, a labeled-container model with no state file to drift, `--json` output everywhere, and a live `top` dashboard. |\n\n**Data ownership.** Proprietary documents, customer data, and prompts stay on hardware you control. Nothing is sent to a third-party API.**No assembly tax.** The vector DB, embeddings, cache, agent, and tracing come pre-integrated and networked — not as a Compose file you maintain by hand.**Predictable cost.** Inference and retrieval run on infrastructure you already pay for. No per-token billing, no rate limits.**Portability.** The same`stack.yaml`\n\nruns on a laptop, a CI runner, or a server. Swap the model or the vector database without touching your application.\n\n| Model runners (Ollama, LM Studio) |\nDIY Docker Compose |\nFrameworks (LangChain) |\nllmaker |\n|\n|---|---|---|---|---|\n| Run local models, OpenAI-compatible | ✓ | — | — | ✓ |\n| Vector DB, embeddings, cache — curated | — | manual | — | ✓ |\n| Service discovery between containers | — | manual | n/a | ✓ |\n| One-command application (RAG, recsys) | — | — | — | ✓ |\n| Built-in retrieval & recommendation agent | — | — | you code it | ✓ |\n| Observability / tracing integrated | — | manual | manual | ✓ |\n| Declarative provisioning & reconciliation | — | partial | — | ✓ |\n\nRequires\n\n[Docker]. Run`llmaker doctor`\n\nafterward to validate your environment.\n\n```\n# Prebuilt binary (Linux / macOS)\ncurl -fsSL https://raw.githubusercontent.com/raiyanyahya/llmaker/master/scripts/install.sh | sh\n\n# Go toolchain\ngo install github.com/raiyanyahya/llmaker/cmd/llmaker@latest\n\n# From source\ngit clone https://github.com/raiyanyahya/llmaker && cd llmaker && make build\n```\n\nHomebrew and winget packages are on the roadmap. The agent image is built locally with make image-agent until it is published to a registry.\n\nProvision and run a complete retrieval-augmented generation stack:\n\n```\nllmaker stack up assistant    # scaffold + apply in one step (assistant needs no agent image)\nllmaker stack init rag        # generate stack.yaml (assistant | voice | rag | research | code | chatbot | faq | recommend | sql)\nmake image-agent              # build the agent image once (stacks that include the agent)\nllmaker apply -f stack.yaml   # provision the stack — model + services, networked\nllmaker ls                    # inspect models and services in one view\n```\n\nResolve the agent endpoint and use it:\n\n```\nAGENT=$(llmaker service ls --json | jq -r '.[] | select(.service==\"agent\").url')\n\ncurl \"$AGENT/api/ingest\" -F file=@handbook.pdf                     # ingest documents\ncurl \"$AGENT/api/chat\"   -d '{\"question\":\"…\",\"history\":[],\"top_k\":4}'   # query, with sources\n```\n\nllmaker also runs individual models — the easiest way to expose a local, OpenAI-compatible endpoint:\n\n```\nllmaker up --model llama3:8b          # provision a model instance\npython\nfrom openai import OpenAI\nclient = OpenAI(base_url=\"http://127.0.0.1:11500/v1\", api_key=\"not-needed\")\nclient.chat.completions.create(model=\"llama3:8b\",\n    messages=[{\"role\": \"user\", \"content\": \"Hello\"}])\n```\n\nA stack is a model plus the services around it, provisioned together. Scaffold\nand run one in a single step with `llmaker stack up <name>`\n\n, or generate a\n`stack.yaml`\n\nto edit with `llmaker stack init <name>`\n\nand apply it with\n`llmaker apply`\n\n.\n\n| Template | Application | Components |\n|---|---|---|\n`assistant` |\nA private, ChatGPT-style assistant over a local model — chats, prompts, RAG in the UI. No agent image to build | LLM · Open WebUI |\n`voice` |\nTalk to a model — speech-to-text in the browser via self-hosted Whisper. No agent image to build | LLM · Open WebUI · Whisper |\n`rag` |\nDocument Q&A — ingest files, query with grounded answers and sources, fully traced | LLM · Qdrant · embeddings · agent · Langfuse · Postgres |\n`research` |\nA tool-using assistant that searches the live web and your documents, then synthesizes |\nLLM · SearXNG · Qdrant · embeddings · agent |\n`code` |\nA code assistant — ingest a repo, ask grounded questions and review | code LLM · Qdrant · embeddings · agent |\n`chatbot` |\nA multi-turn assistant with a web UI and per-session memory | LLM · Redis · agent |\n`faq` |\nA knowledge-base assistant tuned for short, grounded answers | LLM · Qdrant · embeddings · agent |\n`recommend` |\nA semantic recommendation engine — \"more like this\", no LLM required | Qdrant · embeddings · agent |\n`sql` |\nAsk your database in plain English — the agent runs read-only SQL (enforced) and grounds in docs | LLM · Postgres · Qdrant · embeddings · agent |\n\nThe catalog's `agent`\n\nis a FastAPI + LangGraph service (`agent/`\n\n) that turns a\nbare model and vector store into an application. It is a standard service on the\nnetwork, configured by environment to discover the others by name.\n\n**Retrieval as an explicit graph** — `rewrite → retrieve → rerank → generate`\n\n:\n\n**rewrite**— collapses multi-turn history into a standalone query, so follow-ups that depend on context (\"and when was*it*released?\") resolve correctly. The model is only invoked when there is history to resolve.**retrieve**— embeds the query and retrieves a candidate set from the vector store.** rerank**— applies[Maximal Marginal Relevance](https://en.wikipedia.org/wiki/Maximal_marginal_relevance)for relevant, non-redundant context.**generate**— produces the answer from that context and the conversation.\n\n```\nPOST /api/ingest      multipart file or text  →  chunk, embed, store\nPOST /api/chat        { question, history?, top_k?, session_id? }  →  answer + sources\nPOST /api/agent       { question, history?, session_id? }  →  tool-using answer + tool calls\nPOST /api/summarize   { text, instructions?, max_words? }  →  summary (map-reduce for long text)\nPOST /api/extract     { text, fields: { name: description } }  →  JSON with exactly those keys\nPOST /api/transcribe  multipart audio file  →  { text }   (needs a whisper service)\nPOST /api/eval        { cases: [{ question, reference? }] }  →  graded answers + summary\nPOST /api/items       { items: [{ id, text, metadata? }] }  →  index for recommendation\nPOST /api/recommend   { query }  or  { like: [id, …] }  →  ranked items\n```\n\n**Tool calling.** Beyond retrieval, `/api/agent`\n\nruns a tool-calling loop where\nthe model decides which tools to invoke — a **calculator**, the **knowledge base**\n(retrieval as a tool), the **current time**, a self-hosted **web search**\n(SearXNG, no paid API), and an optional read-only **SQL** tool over your\ndatabase — and the loop executes them until it has an answer. The response\nincludes every tool call it made. Adding a tool is one entry in\n`agent/app/tools.py`\n\n.\n\n**Tracing.** The `rag`\n\nstack provisions Langfuse and the agent traces every query\nto it, with zero configuration — each request (RAG or tool-using) appears as a\ntrace with its retrieval, tool, and generation steps. Tracing is enabled by the\ntemplate and is otherwise opt-in via two environment variables.\n\n**Evaluation.** `/api/eval`\n\nruns a question set through the same pipeline and\ngrades each answer — *groundedness* and *relevance* by LLM-as-judge, plus\n*correctness* against a reference and *context recall* against expected sources\nwhen you supply them. You get per-case scores and an aggregate summary, and every\ncase is traced to Langfuse alongside your live traffic — so retrieval quality is\nmeasurable, not a vibe.\n\n**Beyond chat.** Two everyday tasks are first-class endpoints: `/api/summarize`\n\ncondenses text (map-reducing long inputs chunk by chunk so a whole report fits),\nand `/api/extract`\n\nturns text into a typed JSON object from the fields you name —\nparsed defensively so a chatty model never breaks the contract. With a `whisper`\n\nservice on the network, `/api/transcribe`\n\nadds speech-to-text.\n\n**Memory.** The agent is stateless by default (the client passes `history`\n\n). Set\n`REDIS_URL`\n\nand it persists history server-side: send a `session_id`\n\nwith\n`/api/chat`\n\nor `/api/agent`\n\nand prior turns are loaded, prepended, and saved\nautomatically — capped and expiring, and best-effort so Redis being down never\nfails a chat. `llmaker stack init chatbot`\n\nwires it up.\n\n**Recommendations** reuse the same embeddings and vector store, with no model\ninvolved: index items once, then retrieve by free-text intent (`query`\n\n) or by\nexample (`like`\n\n, which averages the seed items into a profile and excludes them\nfrom the results).\n\nFull agent contract and configuration: [ agent/README.md](/raiyanyahya/llmaker/blob/master/agent/README.md).\n\nCompose a stack from the catalog directly, or let a template do it:\n\n```\nllmaker service catalog          # list available services\nllmaker service add qdrant       # vector database     → qdrant:6333\nllmaker service add redis        # cache / memory      → redis:6379\nllmaker service add embeddings   # embeddings (HF TEI) → embeddings:80\nllmaker service add searxng      # web search          → searxng:8080\nllmaker service add whisper      # speech-to-text      → whisper:8000\nllmaker service add open-webui   # ChatGPT-style UI    → open-webui:8080\nllmaker service add langfuse     # observability       → langfuse:3000\n```\n\n| Category | Services |\n|---|---|\n| Vector databases | Qdrant · Chroma · pgvector (Postgres) · Weaviate |\n| Cache / memory | Redis (powers per-session agent memory) |\n| Embeddings | HuggingFace Text-Embeddings-Inference |\n| Search | SearXNG (self-hosted metasearch) |\n| Speech-to-text | Whisper (faster-whisper, OpenAI-compatible) |\n| Observability | Langfuse |\n| Web UI & apps | Open WebUI (ChatGPT-style UI) · n8n (workflow automation) · Flowise (visual LLM app builder) |\n| Agent | LangGraph retrieval & recommendation agent |\n\nEvery model and service joins a private Docker network (`llmaker-net`\n\n) and is\naddressable there by name — service discovery without IPs, links, or a Compose\nfile. Applications running on the host or in their own container reach the stack\nthe same way:\n\n```\ndocker run --rm --network llmaker-net redis:7-alpine redis-cli -h redis ping   # → PONG\n```\n\nAdding a service is a single entry in `internal/service/catalog.go`\n\n; the CLI,\nfleet view, and declarative engine pick it up automatically.\n\n`stack init`\n\ngenerates one of these; it can also be authored by hand. `apply`\n\nreconciles the running stack to the file — provisioning services before the\napplications that depend on them — and `--prune`\n\nremoves anything not declared.\nGive the file a top-level `name:`\n\nand `--prune`\n\nis **scoped to that stack**, so\napplying one stack never deletes another's containers (scaffolded stacks are\nnamed automatically). An unnamed file prunes the whole managed fleet.\n\n```\n# stack.yaml  →  llmaker apply -f stack.yaml [--prune]\ndefaults: { backend: ollama }\ninstances:\n  - { name: chat, model: llama3:8b, memory: 8g }   # → chat:8080\nservices:\n  - use: qdrant                                    # → qdrant:6333\n  - { name: cache, use: redis }                    # → cache:6379\n  - { name: embeddings, use: embeddings, env: { MODEL_ID: BAAI/bge-small-en-v1.5 } }\n  - use: agent                                     # → agent:8800\n```\n\nUnset ports are assigned automatically; a stack may be services-only. See\n[ examples/stack.yaml](/raiyanyahya/llmaker/blob/master/examples/stack.yaml) and\n\n[.](/raiyanyahya/llmaker/blob/master/examples/llm.yaml)\n\n`examples/llm.yaml`\n\n```\n┌──────────────────────────────────────────────────────────────────────┐\n│  llmaker CLI   (Go — single static binary)                            │\n│  orchestration · Docker SDK · private networking · declarative apply  │\n└───────────────────────────────┬──────────────────────────────────────┘\n                                │  provision · start · stop · HTTP\n                                ▼\n   ════════════════ llmaker-net  (private network, DNS by name) ════════════════\n    ┌── Model instance ───────────┐   ┌── Services ───────────────────────────┐\n    │ engine ⇄ facade (FastAPI)   │   │ qdrant · embeddings · redis · pgvector │\n    │ Ollama · llama.cpp*         │   │ langfuse · …                           │\n    │ OpenAI /v1/* · web UI       │   │ qdrant:6333   embeddings:80            │\n    │ chat:8080                   │   └────────────────────────────────────────┘\n    └─────────────────────────────┘                  ▲\n                    ▲                                 │\n                    └──────────────┬──────────────────┘\n                    ┌── Agent (FastAPI + LangGraph) ───┐\n                    │ rewrite → retrieve → rerank →     │   agent:8800\n                    │ generate · ingest · recommend     │\n                    └───────────────────────────────────┘\n              host ports (127.0.0.1:PORT) mapped per container\n```\n\n* The llama.cpp backend is scaffolded but still maturing; Ollama is the verified default — see the roadmap.\n\nThe control plane is a single Go binary; the data plane is containers on a private\nnetwork. Orchestration logic is decoupled from Docker behind a `Runtime`\n\ninterface, and the fleet is tracked entirely through container labels — there is\nno local state file to drift out of sync. Model facades and the agent are Python\n(FastAPI), each communicating over the same HTTP contract.\n\n| Command | Description |\n|---|---|\n`llmaker stack up <assistant|voice|rag|research|code|chatbot|faq|recommend|sql>` |\nScaffold a stack and apply it in one command |\n`llmaker stack init <template>` |\nGenerate a ready-to-apply stack definition to edit |\n`llmaker apply -f stack.yaml` |\nProvision / reconcile a declarative stack — `--prune` |\n`llmaker up [preset]` |\nProvision a model instance — preset, flags, or interactive wizard |\n`llmaker stop | start | restart | rm <name>...` |\nInstance lifecycle — `restart` = stop+start, `rm --force` removes a running one |\n`llmaker service catalog` |\nList available services |\n`llmaker service add <type> [name]` |\nProvision a service — `--env` , `--port` , `--memory` |\n`llmaker service ls | rm | stop | start | restart` |\nManage services — `--json` |\n`llmaker ls` |\nList the fleet — models and services — `--json` , `--quiet` |\n`llmaker top` |\nLive resource dashboard across the fleet |\n`llmaker status <name>` |\nDetailed instance status — `--json` |\n`llmaker pull <model> --on <name>` |\nDownload a model with progress — `--default` |\n`llmaker chat [name]` |\nInteractive or one-shot chat — `--message` , stdin |\n`llmaker open <name>` |\nOpen a container's web UI — `--print` |\n`llmaker logs <name> -f` |\nStream logs from any container |\n`llmaker doctor` |\nValidate the environment (Docker, GPU, platform caveats) |\n\n| Setting | Where | Default |\n|---|---|---|\n| backend / model | `--backend` · `--model` · `stack.yaml` |\n`ollama` · backend default |\n| memory · cpus · gpu | flags · `stack.yaml` |\nhost-derived |\n| port · host | `--port` · `--host` |\nauto · `127.0.0.1` |\n| service environment | `service add --env` · `env:` in `stack.yaml` |\nper-service defaults |\n`API_KEY` · `CORS_ORIGINS` · `KEEP_ALIVE` |\n`--api-key` · `--cors` · `--keep-alive` |\nopen · `*` · `5m` |\n\nPer-service and agent configuration (model URLs, chunking, reranking, tracing\nkeys) is documented in [ agent/README.md](/raiyanyahya/llmaker/blob/master/agent/README.md) and\n\n[.](/raiyanyahya/llmaker/blob/master/facade/README.md)\n\n`facade/README.md`\n\nEvery container binds to `127.0.0.1`\n\nby default; nothing is exposed until you opt\nin, and exposure pairs with authentication:\n\n```\nllmaker up --host 0.0.0.0 --api-key \"$(openssl rand -hex 16)\"\n```\n\nWhen `API_KEY`\n\nis set, every `/v1/*`\n\nand `/api/*`\n\nrequest requires a bearer token\n(liveness probes excepted). The agent enforces its own `API_KEY`\n\nidentically. The\nLangfuse keys and database password in the catalog are **development defaults** —\nrotate them before exposing a stack beyond localhost.\n\nDocker on macOS cannot pass through the Apple GPU; a containerized engine runs\nCPU-only. `llmaker doctor`\n\ndetects and reports this. On Linux with NVIDIA, `--gpu`\n\nreserves GPUs via the NVIDIA Container Toolkit.\n\n| Image | Size | Use |\n|---|---|---|\n`llmaker-ollama:latest` |\n~8.5 GB | GPU-capable (Linux + NVIDIA) |\n`llmaker-ollama:cpu` |\n~360 MB | CPU-only — laptops, CI, macOS |\n`llmaker-agent:latest` |\n~510 MB | LangGraph agent — RAG, tools, eval, summarize/extract, transcribe |\n\nImages are resolved with a pull-if-missing policy, so locally built images\n(`make image-agent`\n\n) are used directly without contacting a registry.\n\n```\nmake build        # build ./bin/llmaker\nmake check        # gofmt + vet + go test (CI parity)\n\nmake facade-setup && make facade-test     # model facade (pytest)\nmake agent-setup  && make agent-test      # retrieval/recommendation agent (pytest)\n\nmake images       # build backend + agent images\n```\n\nThe Go control plane is tested against an in-memory runtime (no Docker required).\nThe model facade and the agent — routes, the LangGraph pipeline, reranking,\ntracing, and recommendation — are tested against in-memory fakes. CI runs Go race\ntests, `gofmt`\n\n, a ruff-linted Python test matrix, and image builds on every push.\n\n```\ncmd/llmaker/            CLI entrypoint\ninternal/\n  backend/              inference engines and image references\n  service/              the service catalog\n  engine/               domain model, ports, labels, Runtime interface\n    dockerrt/           Docker implementation and the private network\n    enginetest/         in-memory Runtime for tests\n  config/               stack.yaml parsing and dependency ordering\n  cli/ · ui/ · tui/     Cobra commands and the terminal interface\nfacade/                 model facade (FastAPI) + per-model web UI\nagent/                  retrieval & recommendation agent (FastAPI + LangGraph)\nimages/                 backend and agent Dockerfiles\n```\n\nStatus: alpha.Checked capabilities are implemented and covered by the test suite; the core stack is verified end-to-end against live Docker.\n\n- Model instances — OpenAI-compatible facade, per-model UI, fleet management\n- Service catalog — vector databases, cache, embeddings, search, observability\n- Private networking — automatic service discovery by name\n- Declarative stacks —\n`stack init`\n\ntemplates and reconciling`apply --prune`\n\n- Retrieval agent — LangGraph\n`rewrite → retrieve → rerank → generate`\n\n, multi-turn - Recommendation engine — semantic\n`query`\n\nand \"more like this\" - Integrated observability — Langfuse tracing\n- Tool-calling agent — calculator, knowledge base, time, web search, read-only SQL\n- Self-hosted web search — SearXNG service + a\n`web_search`\n\nagent tool - Evaluation harness —\n`/api/eval`\n\ngraded by LLM-as-judge, traced to Langfuse - Summarization & extraction —\n`/api/summarize`\n\n(map-reduce),`/api/extract`\n\n(typed JSON) - Speech-to-text — Whisper service +\n`/api/transcribe`\n\n- Conversation memory — Redis-backed per-session history (\n`session_id`\n\n) - More agent tooling — dedicated cross-encoder reranking; richer eval datasets\n- Additional backends — llama.cpp model management; Metal on macOS\n- Distribution — multi-architecture images, package managers, releases\n\nContributions are welcome. Keep the suite green (`make check`\n\n, `make facade-test`\n\n,\n`make agent-test`\n\n), match the surrounding style, and include tests. Adding a\nservice is a single catalog entry; adding a model backend is a single facade\nadapter.\n\n[Apache 2.0](/raiyanyahya/llmaker/blob/master/LICENSE) © Raiyan Yahya.", "url": "https://wpnews.pro/news/show-hn-self-hosting-a-modern-llm-stack", "canonical_source": "https://github.com/raiyanyahya/llmaker", "published_at": "2026-06-29 03:18:27+00:00", "updated_at": "2026-06-29 03:58:24.456111+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "developer-tools", "ai-agents", "ai-products"], "entities": ["llmaker", "Qdrant", "Chroma", "pgvector", "Weaviate", "Redis", "Langfuse", "Open WebUI"], "alternates": {"html": "https://wpnews.pro/news/show-hn-self-hosting-a-modern-llm-stack", "markdown": "https://wpnews.pro/news/show-hn-self-hosting-a-modern-llm-stack.md", "text": "https://wpnews.pro/news/show-hn-self-hosting-a-modern-llm-stack.txt", "jsonld": "https://wpnews.pro/news/show-hn-self-hosting-a-modern-llm-stack.jsonld"}}