{"slug": "show-hn-forge-guardrails-take-an-8b-model-from-53-to-99-on-agentic-tasks", "title": "Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks", "summary": "Forge is a reliability layer for self-hosted LLM tool-calling that uses guardrails and context management to dramatically improve performance on multi-step agentic tasks, boosting an 8B model from 53% to 99% accuracy. It offers three integration methods: a full WorkflowRunner for structured agent loops, composable guardrails middleware for custom orchestration, and an OpenAI-compatible proxy server that transparently applies guardrails to any client. The system supports multiple backends including Ollama, llama-server, Llamafile, and Anthropic, with the top self-hosted configuration scoring 86.5% across 26 evaluation scenarios.", "body_md": "A reliability layer for self-hosted LLM tool-calling. You give forge a set of tools; the model calls whichever it wants in whatever order. Workflow structure is opt-in — `required_steps`\n\n, `prerequisites`\n\n, and `terminal_tool`\n\nlet you constrain the loop when you need to, but forge's guardrails (rescue parsing, retry nudges, response validation) apply with zero required steps too.\n\nForge takes an 8B local model from single digits to 84% across forge's 26-scenario v0.7.0 eval suite — and even lifts Sonnet 4.6 from 85% to 98% on the same workload (Anthropic numbers measured in v0.6.0; not re-run in v0.7.0 since the cost is non-trivial).\n\n**What forge isn't:**\n\n**Not an agent orchestrator.** Forge sits inside one agentic loop and makes its tool calls reliable. Multi-agent graphs, DAG planners, and cross-agent coordination are out of scope.**Not a coding harness.** Forge is domain-agnostic. If you're building a coding agent (or already using one like opencode, aider, Cline),[proxy mode](#proxy-server)lifts your existing harness with forge's guardrails — no rewrite.\n\n**Three ways to use it:**\n\n-\n**Proxy server**— Drop-in proxy (`python -m forge.proxy`\n\n) speaking both the OpenAI chat-completions and Anthropic Messages (`/v1/messages`\n\n) APIs, sitting between any client and a local model server. Point OpenAI-compatible tools (opencode, Continue, aider)**or Claude Code** at it and forge applies guardrails transparently — the client thinks it's talking to a smarter model. Most popular entry point. -\n**WorkflowRunner**— Define tools, pick a backend, run structured agent loops. Forge manages the full lifecycle: system prompts, tool execution, context compaction, and guardrails.**SlotWorker** adds priority-queued access to a shared inference slot with auto-preemption — for multi-agent architectures where specialist workflows share a GPU slot. Best when you're building on forge directly. -\n**Guardrails middleware**— Use forge's reliability stack ([composable middleware](/antoinezambelli/forge/blob/main/examples/foreign_loop.py)) inside your own orchestration loop. You control the loop; forge validates responses, rescues malformed tool calls, and enforces required steps.\n\nSupports Ollama, llama-server (llama.cpp), Llamafile, vLLM, and Anthropic as backends.\n\n- Python 3.12+\n- A running LLM backend (see below)\n\n```\npip install forge-guardrails                # core only\npip install \"forge-guardrails[anthropic]\"   # + Anthropic client\n```\n\nFor development:\n\n```\ngit clone https://github.com/antoinezambelli/forge.git\ncd forge\npip install -e \".[dev]\"\n```\n\n**llama-server** (recommended — top 10 eval configs all run on llama-server):\n\n```\n# Install from https://github.com/ggml-org/llama.cpp/releases\nllama-server -m path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf --jinja -ngl 999 --port 8080\n```\n\n**Ollama** (alternative — easier setup, slightly weaker on harder workloads):\n\n```\n# Install from https://ollama.com/download\nollama pull ministral-3:8b-instruct-2512-q4_K_M\n```\n\n**Anthropic** (API, no local GPU needed):\n\n```\npip install -e \".[anthropic]\"\nexport ANTHROPIC_API_KEY=sk-...\n```\n\nSee [Backend Setup](/antoinezambelli/forge/blob/main/docs/BACKEND_SETUP.md) for full instructions and [Model Guide](/antoinezambelli/forge/blob/main/docs/MODEL_GUIDE.md) for which model fits your hardware.\n\nStart llama-server however you normally do (e.g. in a separate shell):\n\n```\nllama-server -m path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf --jinja -ngl 999 --port 8080\n```\n\nThen the Python you'll run (e.g. from another shell):\n\n``` python\nimport asyncio\nfrom pydantic import BaseModel, Field\nfrom forge import (\n    Workflow, ToolDef, ToolSpec,\n    WorkflowRunner, LlamafileClient,\n    ContextManager, TieredCompact,\n)\n\ndef get_weather(city: str) -> str:\n    return f\"72°F and sunny in {city}\"\n\nclass GetWeatherParams(BaseModel):\n    city: str = Field(description=\"City name\")\n\nworkflow = Workflow(\n    name=\"weather\",\n    description=\"Look up weather for a city.\",\n    tools={\n        \"get_weather\": ToolDef(\n            spec=ToolSpec(\n                name=\"get_weather\",\n                description=\"Get current weather\",\n                parameters=GetWeatherParams,\n            ),\n            callable=get_weather,\n        ),\n    },\n    required_steps=[],\n    terminal_tool=\"get_weather\",\n    system_prompt_template=\"You are a helpful assistant. Use the available tools to answer the user.\",\n)\n\nasync def main():\n    client = LlamafileClient(\n        gguf_path=\"path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf\",\n        mode=\"native\",\n        recommended_sampling=True,\n    )\n    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)\n    runner = WorkflowRunner(client=client, context_manager=ctx)\n    await runner.run(workflow, \"What's the weather in Paris?\")\n\nasyncio.run(main())\n```\n\nFor multi-step workflows, multi-turn conversations, and backend auto-management, see the [User Guide](/antoinezambelli/forge/blob/main/docs/USER_GUIDE.md). If you're building a long-running session (CLI, chat server, voice assistant), see the [long-running session advisory](/antoinezambelli/forge/blob/main/docs/USER_GUIDE.md#long-running-sessions-filtering-transient-messages) for important guidance on filtering transient messages.\n\nDrop-in proxy that sits between any client and a local model server, speaking both the OpenAI chat-completions API and the Anthropic Messages API (`/v1/messages`\n\n). Point your client at the proxy (e.g. `http://localhost:8081/v1`\n\n) and forge applies its guardrails transparently — the client thinks it's talking to a smarter model.\n\nThis is the path for **using forge with an existing harness** (opencode, Continue, aider, Cline, anything that speaks the OpenAI chat-completions schema — or Claude Code, which speaks the Anthropic Messages API). No Python rewrite.\n\n```\n# External mode — you manage the backend, forge proxies it\npython -m forge.proxy --backend-url http://localhost:8080 --port 8081\n\n# Managed mode — forge starts the backend and the proxy together\npython -m forge.proxy --backend llamaserver --gguf path/to/model.gguf --port 8081\n\n# Managed vLLM — pass a model directory or HF repo id via --model-path\npython -m forge.proxy --backend vllm --model-path /path/to/awq-dir --port 8081\n```\n\nThen configure your client to use `http://localhost:8081/v1`\n\nas the API base URL.\n\n**Claude Code:** the proxy also serves the Anthropic Messages API on `POST /v1/messages`\n\n, so you can point Claude Code at a forge-guarded local model — set `ANTHROPIC_BASE_URL=http://localhost:8081`\n\nand `ANTHROPIC_AUTH_TOKEN=anything`\n\nfor the `claude`\n\nprocess. See [Using forge with Claude Code](/antoinezambelli/forge/blob/main/docs/USER_GUIDE.md#using-forge-with-claude-code) for the full setup (native-vs-prompt FC, Anthropic-shape downstreams, `cache_control`\n\n).\n\n**Backend compatibility:**\n\n**Managed mode** spins up the backend for you. Supported backends:`llamaserver`\n\n,`llamafile`\n\n,`ollama`\n\n,`vllm`\n\n(use`--backend <name>`\n\nwith`--gguf`\n\nfor the GGUF-based backends,`--model-path`\n\nfor vllm, or`--model`\n\nfor ollama).**External mode** is backend-agnostic — forge talks`POST /v1/chat/completions`\n\nto whatever you point`--backend-url`\n\nat, as long as it speaks the OpenAI schema. Tool calls must come back in OpenAI`tool_calls`\n\nformat or in one of forge's rescue-parsed formats (Mistral`[TOOL_CALLS]`\n\n, Qwen`<tool_call>`\n\nXML, fenced JSON). For a vLLM server, add`--backend vllm`\n\nso the proxy adopts vLLM's`--served-model-name`\n\n(vLLM 404s on a mismatched`model`\n\nfield, unlike llama.cpp).\n\nOn every `POST /v1/chat/completions`\n\n, forge applies (in order):\n\n**Response validation**— each tool call in the model's response is checked against the`tools`\n\narray in the request. Calls to unknown tool names or with malformed shapes are caught before the response returns to your client.**Rescue parsing**— when the model emits tool calls in the wrong format (JSON in a code fence, Mistral's`[TOOL_CALLS]name{args}`\n\n, Qwen's`<tool_call>...</tool_call>`\n\nXML), forge extracts the structured call and re-emits it in the canonical OpenAI`tool_calls`\n\nschema. Biggest practical lift for Mistral-family models.**Retry loop with error tracking**— if validation fails, forge retries inference up to`--max-retries`\n\n(default 3) with a corrective tool-result message on the canonical channel, rather than returning a malformed response. From your client's perspective the proxy looks like a single request that just took a few extra ms.**Synthetic**— when tools are present in the request, forge injects a synthetic`respond`\n\ntool injection`respond`\n\ntool the model calls instead of producing bare text. The`respond`\n\ncall is stripped from the outbound response — the client sees a normal text response (`finish_reason: \"stop\"`\n\n) and never knows the tool exists. Essential for small local models (~8B) that can't be trusted to choose correctly between text and tool calls. See[ADR-013](/antoinezambelli/forge/blob/main/docs/decisions/013-text-response-intent.md)for the full analysis.\n\nProxy mode is single-shot per request; some forge features need multi-turn workflow state that the OpenAI chat-completions schema doesn't carry:\n\n**Prerequisite enforcement and step-ordering**— these need a workflow definition spanning turns. Available in`WorkflowRunner`\n\n.**Context compaction and session memory**— proxy mode forwards the inbound message list as-is; managing the rolling window is the client's job.** VRAM-aware budget detection**— opt in with`--budget-mode forge-full`\n\nor`--budget-mode forge-fast`\n\n; otherwise proxy uses the backend's reported budget.\n\nFor the full guardrail surface, use `WorkflowRunner`\n\ndirectly. The proxy trades depth for \"use forge with your existing setup, no rewrite.\"\n\n| Flag | Default | Purpose |\n|---|---|---|\n`--max-retries N` |\n3 | Retry budget per validation failure |\n`--no-rescue` |\n(rescue on) | Disable rescue parsing (debugging only) |\n`--budget-mode {backend,manual,forge-full,forge-fast}` |\n`backend` |\nContext budget source |\n`--budget-tokens N` |\n— | Manual token budget (requires `--budget-mode manual` ) |\n`--serialize` / `--no-serialize` |\nauto | Force request serialization (single-slot backends) |\n\nYou can run the forge proxy as a Docker container.\n\n**Build the image:**\n\n```\ndocker build -t forge-proxy .\n```\n\n**Run the container:**\n\n```\n# Connect to an external backend (e.g. vLLM hosted on the same machine)\ndocker run -p 8081:8081 forge-proxy --backend-url http://host.docker.internal:8000 --backend vllm --budget-mode manual --budget-tokens 8192\n```\n\nNote: If your backend is running on `localhost`\n\nof the host machine, use `http://host.docker.internal:PORT`\n\n(on macOS/Windows) or the host's IP address to allow the container to reach it.\n\n| Backend | Best for | Native FC? |\n|---|---|---|\nOllama |\nEasiest setup, model management built-in | Yes |\nllama-server |\nBest performance, full control | Yes (with `--jinja` ) |\nLlamafile |\nSingle binary, zero dependencies | No (prompt-injected) |\nvLLM |\nHigh-throughput serving, AWQ/GPTQ weights | Yes (server-side parser) |\nAnthropic |\nFrontier baseline, hybrid workflows | Yes |\n\nSee [Backend Setup](/antoinezambelli/forge/blob/main/docs/BACKEND_SETUP.md) for installation and [Model Guide](/antoinezambelli/forge/blob/main/docs/MODEL_GUIDE.md) for which model to pick.\n\n```\npython -m pytest tests/ -v --tb=short\npython -m pytest tests/ --cov=forge --cov-report=term-missing\n```\n\n26 scenarios measuring how reliably a model + backend combo navigates multi-step tool-calling workflows — split into an OG-18 baseline tier and an 8-scenario advanced_reasoning tier for top-end separation. See [Eval Guide](/antoinezambelli/forge/blob/main/docs/EVAL_GUIDE.md) for full CLI reference.\n\n```\n# llama-server (start in another terminal first; see Eval Guide)\npython -m tests.eval.eval_runner --backend llamafile --llamafile-mode prompt --gguf \"path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf\" --runs 10 --stream --verbose\n\n# Batch eval (JSONL output, automatic resume)\npython -m tests.eval.batch_eval --config all --runs 50\n\n# Reports — ASCII table by default; --html / --markdown export views\npython -m tests.eval.report eval_results.jsonl\npython -m tests.eval.report eval_results.jsonl --html docs/results/dashboard.html\npython -m tests.eval.report eval_results.jsonl --markdown docs/results/raw/\nsrc/forge/\n  __init__.py          # Public API exports\n  errors.py            # ForgeError hierarchy\n  server.py            # setup_backend(), ServerManager, BudgetMode\n  core/\n    messages.py        # Message, MessageRole, MessageType, MessageMeta\n    workflow.py        # ToolSpec, ToolDef, ToolCall, TextResponse, Workflow\n    inference.py       # run_inference() — shared front half (compact, fold, validate, retry)\n    runner.py          # WorkflowRunner — the agentic loop\n    slot_worker.py     # SlotWorker — priority-queued slot access\n    steps.py           # StepTracker\n  guardrails/\n    guardrails.py      # Guardrails facade — applies the full stack in foreign loops\n    nudge.py           # Nudge dataclass\n    response_validator.py  # ResponseValidator, ValidationResult\n    step_enforcer.py   # StepEnforcer, StepCheck\n    error_tracker.py   # ErrorTracker\n  clients/\n    base.py            # ChunkType, StreamChunk, LLMClient protocol\n    ollama.py          # OllamaClient (native FC)\n    llamafile.py       # LlamafileClient (native FC or prompt-injected)\n    anthropic.py       # AnthropicClient (frontier baseline)\n  context/\n    manager.py         # ContextManager, CompactEvent\n    strategies.py      # CompactStrategy, NoCompact, TieredCompact, SlidingWindowCompact\n    hardware.py        # HardwareProfile, detect_hardware()\n  prompts/\n    templates.py       # Tool prompt builders (prompt-injected path)\n    nudges.py          # Retry and step-enforcement nudge templates\n  tools/\n    respond.py         # Synthetic respond tool (respond_tool(), respond_spec())\n  proxy/\n    __main__.py        # CLI entry point: python -m forge.proxy\n    proxy.py           # ProxyServer — programmatic start/stop API\n    server.py          # Raw asyncio HTTP server, SSE streaming\n    handler.py         # Request handler — bridge between HTTP and run_inference\n    convert.py         # OpenAI messages ↔ forge Messages conversion\ntests/\n  unit/                # 865 deterministic tests — no LLM backend required\n  eval/                # Eval harness — model qualification against real backends\n```\n\n[User Guide](/antoinezambelli/forge/blob/main/docs/USER_GUIDE.md)— Usage patterns, multi-turn, context management, guardrails, slot worker, long-running session advisory[Model Guide](/antoinezambelli/forge/blob/main/docs/MODEL_GUIDE.md)— Which model and backend for your hardware[Backend Setup](/antoinezambelli/forge/blob/main/docs/BACKEND_SETUP.md)— Backend installation and server setup[Eval Guide](/antoinezambelli/forge/blob/main/docs/EVAL_GUIDE.md)— Eval harness CLI reference, batch eval[Architecture](/antoinezambelli/forge/blob/main/docs/ARCHITECTURE.md)— Full design document[Workflow Internals](/antoinezambelli/forge/blob/main/docs/WORKFLOW.md)— Workflow design and runner internals[Contributing](/antoinezambelli/forge/blob/main/CONTRIBUTING.md)— How to set up, test, and add new backends or scenarios\n\nThe forge guardrail framework and ablation study are published as:\n\nZambelli, A.\n\nForge: A Reliability Layer for Self-Hosted LLM Tool-Calling.[https://doi.org/10.1145/3786335.3813193]\n\nA pre-publication preprint is also available at [docs/forge_ieee_preprint.pdf](/antoinezambelli/forge/blob/main/docs/forge_ieee_preprint.pdf) — kept as a historical artifact. Cite the published version above; the DOI link may not resolve immediately depending on the publisher's release timing.\n\n[MIT](/antoinezambelli/forge/blob/main/LICENSE) — Copyright (c) 2025-2026 Antoine Zambelli", "url": "https://wpnews.pro/news/show-hn-forge-guardrails-take-an-8b-model-from-53-to-99-on-agentic-tasks", "canonical_source": "https://github.com/antoinezambelli/forge", "published_at": "2026-05-19 12:23:07+00:00", "updated_at": "2026-05-19 22:04:16.768578+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "open-source", "developer-tools"], "entities": ["Forge", "Ministral-3 8B Instruct", "llama-server", "Ollama", "Anthropic"], "alternates": {"html": "https://wpnews.pro/news/show-hn-forge-guardrails-take-an-8b-model-from-53-to-99-on-agentic-tasks", "markdown": "https://wpnews.pro/news/show-hn-forge-guardrails-take-an-8b-model-from-53-to-99-on-agentic-tasks.md", "text": "https://wpnews.pro/news/show-hn-forge-guardrails-take-an-8b-model-from-53-to-99-on-agentic-tasks.txt", "jsonld": "https://wpnews.pro/news/show-hn-forge-guardrails-take-an-8b-model-from-53-to-99-on-agentic-tasks.jsonld"}}