{"slug": "llm-apis-with-built-in-chatbot-in-1-line-of-code", "title": "LLM APIs with built-in chatbot in 1 line of code", "summary": "Flama 2.0 introduces a CLI tool that allows users to download, package, and serve large language models from HuggingFace with a single command, including a built-in chat interface and production-ready API. The tool supports models like Google's Gemma 4 and enables local interaction and agentic workflows without boilerplate code.", "body_md": "##### Publication\n\n##### Reading Time\n\nServing LLMs with the Flama CLI\n\nFlama 2.0 brings first-class support for generative AI: downloading, packaging, and serving large language models (LLMs) is now as simple as running a few commands in your terminal. No boilerplate code, no custom serving infrastructure, no configuration files. Just the CLI and a model.\n\nIn this post, we walk through the entire workflow: fetching a model from HuggingFace, interacting with it locally in your terminal, and serving it over HTTP with a production-ready API and a built-in chat interface. We will also show how a locally served model can power agentic workflows, using Claude CLI as a practical example.\n\nBefore we dive into the details, we recommend you to have the following resources at hand:\n\n- Official Flama documentation:\n[Flama documentation](https://flama.dev/docs/) - Generative AI section:\n[Generative AI docs](https://flama.dev/docs/generative-ai/overview/) - Flama GitHub repository:\n[Flama on GitHub](https://github.com/vortico/flama)\n\nTable of contents\n\nFetching a model with `flama get`\n\nThe first step in serving an LLM with Flama is downloading and packaging a model into a `.flm`\n\nartifact (a *Flama\nLightweight Model* file). The `flama get`\n\ncommand handles this in a single step: it downloads the model weights and\nconfiguration from a supported source and serialises them into the portable `.flm`\n\nformat.\n\nAll examples in this post assume Flama has been installed with the LLM extras via\n[uv](https://docs.astral.sh/uv/):\n\n```\nuv pip install \"flama[llm,pydantic]\"\n```\n\nAlternatively, you can run any command without a prior install by using\n`uvx --from \"flama[llm,pydantic]\" flama ...`\n\n, but for brevity we assume Flama is already installed throughout.\n\nLet us fetch a quantised version of Google's Gemma 4 model, optimised for Apple Silicon via the MLX Community:\n\n```\nflama get --family llm --source huggingface mlx-community/gemma-4-E2B-it-qat-4bitDownloading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2.3 GB 28.7 MB/s 0:00:00Packaging...Model saved to mlx-community_gemma-4-E2B-it-qat-4bit.flm\n```\n\nTwo options are required: `--source`\n\ntells Flama where to download from (currently HuggingFace), and `--family`\n\ndeclares\nwhether the artifact is a traditional machine-learning model (`ml`\n\n) or a generative model (`llm`\n\n). For large language\nmodels, you always pass `--family llm`\n\n.\n\nThe output path defaults to `<model-name>.flm`\n\nwith slashes replaced by underscores. If you prefer a custom path, pass\n`--output`\n\n:\n\n```\nflama get --family llm --source huggingface mlx-community/gemma-4-E2B-it-qat-4bit --output models/gemma.flm\n```\n\nWhat happens under the hood\n\nWhen you run `flama get`\n\n, the following happens:\n\n- Flama resolves the model identifier against the HuggingFace Hub and discovers the files that make up the model (weights, tokenizer, configuration).\n- Files are downloaded concurrently (up to 8 parallel downloads by default, configurable with\n`--max-concurrent`\n\n). - Once all files are on disk, Flama packages them into a single\n`.flm`\n\narchive alongside a manifest that records the model family, the originating library, and metadata such as the model name and creation timestamp.\n\nThe result is a self-contained, portable artifact. The `.flm`\n\nformat is framework-agnostic: the same file runs on vLLM\n(Linux with CUDA) or MLX (Apple Silicon), with Flama selecting the appropriate backend at load time based on what is\navailable in the environment.\n\nInteracting with the model locally\n\nOnce you have a packaged `.flm`\n\nartifact, you can interact with it directly from your terminal using the\n`flama model`\n\ncommand. No server, no HTTP, no code. This is invaluable for quick testing, prompt experimentation, and\npipeline scripting.\n\nOne-shot queries with `flama model run`\n\nThe `run`\n\nsub-command sends a prompt to the model, waits for the full response, and prints it:\n\n```\necho \"What is Flama?\" | flama model mlx-community_gemma-4-E2B-it-qat-4bit.flm run --system \"Be concise.\"Flama is a Python framework for building production-ready APIs with a focus on machine learningand generative AI, enabling one-line model serving behind HTTP endpoints.\n```\n\nYou can tune generation with `--param`\n\nflags:\n\n```\necho \"Explain dependency injection in three sentences.\" | \\  flama model mlx-community_gemma-4-E2B-it-qat-4bit.flm run \\    --system \"You are a software engineering instructor.\" \\    --param temperature=0.7 \\    --param max_tokens=256\n```\n\nFor multi-turn conversations, use the `--transport conversation`\n\nflag and pass a JSON message list:\n\n```\necho '[{\"role\": \"user\", \"content\": \"Hi!\"}, {\"role\": \"assistant\", \"content\": \"Hello! How can I help?\"}, {\"role\": \"user\", \"content\": \"What is an API?\"}]' | \\  flama model mlx-community_gemma-4-E2B-it-qat-4bit.flm run --transport conversation\n```\n\nStreaming responses with `flama model stream`\n\nFor an interactive, token-by-token experience (especially useful with larger responses), use `stream`\n\ninstead of `run`\n\n.\nTokens are printed as they are generated, giving you immediate feedback:\n\n```\necho \"What is flama (python package)?\" | \\  flama model mlx-community_gemma-4-E2B-it-qat-4bit.flm stream --system \"Be concise.\"Flama is a Python framework for building production-ready ML and generative AI APIs.It lets you serve models behind HTTP endpoints with minimal code, supportingvLLM and MLX backends, OpenAI/Anthropic/Ollama-compatible protocols,built-in chat interfaces, and the Model Context Protocol for agent tooling.\n```\n\nThe streaming output appears progressively in your terminal, character by character, making it feel like a real conversation. This is especially satisfying when working with models that produce longer, more detailed responses.\n\nYou can also ask the model about itself and the framework it runs on:\n\n```\necho \"How do I serve an ML model with Flama?\" | \\  flama model mlx-community_gemma-4-E2B-it-qat-4bit.flm stream \\    --system \"You are a helpful assistant that knows about the Flama Python framework.\"\n```\n\nIf you want to see multiple output channels (for instance, reasoning and output), pass `--channel`\n\n:\n\n```\necho \"Solve step by step: what is 23 * 47?\" | \\  flama model mlx-community_gemma-4-E2B-it-qat-4bit.flm stream \\    --channel thinking --channel output\n```\n\nServing the model over HTTP\n\nThe true power of Flama lies in going from a local model to a production-ready HTTP API in a single command. No Python\ncode, no configuration files, no Docker images. Just `flama serve`\n\n.\n\nThe `flama serve`\n\ncommand\n\nTo serve the model we downloaded earlier:\n\n```\nflama serve --model file=mlx-community_gemma-4-E2B-it-qat-4bit.flm,url=/,name=gemmaINFO:     Started server process [52341]INFO:     Waiting for application startup.INFO:     Model starting (name: gemma)INFO:     Model ready (name: gemma)INFO:     Application startup complete.INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)\n```\n\nThat is it. A single command and your model is live behind a full HTTP API. Let us unpack what `--model`\n\naccepts:\n\n(required): Path to the`file`\n\n`.flm`\n\nartifact.: The URL prefix under which the model's endpoints are mounted (default:`url`\n\n`/`\n\n).: The resource name, used for OpenAPI tags and dependency injection.`name`\n\n: Comma-separated list of dialects to enable (e.g.,`serving`\n\n`native,openai,anthropic,ollama`\n\n). When omitted, all dialects are mounted.: Default generation parameters (e.g.,`params`\n\n`temperature=0.7`\n\n).\n\nYou can serve multiple models in a single application:\n\n```\nflama serve \\  --model file=gemma.flm,url=/gemma,name=gemma,serving=native+openai \\  --model file=qwen.flm,url=/qwen,name=qwen,serving=native+anthropic\n```\n\nAnd you can configure the server with the usual options:\n\n```\nflama serve \\  --model file=mlx-community_gemma-4-E2B-it-qat-4bit.flm,url=/,name=gemma \\  --server-host 0.0.0.0 \\  --server-port 9000 \\  --app-title \"My LLM API\" \\  --app-docs /docs/\n```\n\nThe built-in chat interface\n\nWhen the native serving dialect is enabled (which it is by default), your model comes with a built-in chat interface\naccessible at the `/chat/`\n\nroute (relative to the model's URL prefix). If you served the model at `/`\n\n, then navigate\nto `http://127.0.0.1:8000/chat/`\n\nYou will be greeted with a polished, production-quality chat interface where you can type prompts and watch the model's responses stream in token by token. The interface renders Markdown, LaTeX math (via KaTeX), and Mermaid diagrams out of the box, so technical answers look exactly as intended.\n\nThe chat interface requires no frontend code, no build step, and no external dependencies. It is a self-contained\nsingle-page application served directly from the framework. Every model you serve gets its own chat window (e.g.,\n`/gemma/chat/`\n\n, `/qwen/chat/`\n\n), each connected to its respective model's streaming endpoint.\n\nPowering agentic workflows\n\nOne of the most compelling use cases for a locally served LLM is powering agentic workflows. Because Flama exposes your model through industry-standard protocols (OpenAI, Anthropic, Ollama), any tool that speaks those protocols can use your local model as its backend.\n\nUsing Claude CLI with a local model\n\nA practical example is using the [Claude CLI](https://docs.anthropic.com/en/docs/claude-code) with a local model served\nby Flama. Instead of sending your prompts to Anthropic's servers, you can route them through your own locally-running\nmodel.\n\nFirst, make sure your model is serving with the Anthropic dialect enabled:\n\n```\nflama serve --model file=mlx-community_gemma-4-E2B-it-qat-4bit.flm,url=/,name=gemma,serving=native+anthropic\n```\n\nThen, configure Claude CLI to use your local Flama endpoint by adding the following to your\n`~/.claude/settings.json`\n\n(or `.claude/settings.local.json`\n\nfor a project-specific override):\n\n```\n{  \"env\": {    \"ANTHROPIC_BASE_URL\": \"http://127.0.0.1:8000/anthropic\",    \"ANTHROPIC_DEFAULT_SONNET_MODEL\": \"gemma\",    \"ANTHROPIC_DEFAULT_OPUS_MODEL\": \"gemma\",    \"ANTHROPIC_DEFAULT_HAIKU_MODEL\": \"gemma\"  }}\n```\n\nWith this configuration in place, simply launch Claude CLI as usual:\n\n```\nclaude\n```\n\nClaude CLI will now route all its requests through your local Flama-served model instead of Anthropic's cloud. Here is what a typical interaction looks like:\n\n```\n❯ claudeWhat is this repo about?∴ Thinking…  The user is asking what the repository is about.  I should use the Agent tool to perform a general-purpose search or exploration,  as this is a broad question about the project's context.⏺ Agent(Find out what the repository is about)  ⎿  Done (2 tool uses · 14.5k tokens · 36s)∴ Thinking…  The agent returned a detailed summary of the project, which is a Python library  called \"Flama\" for developing and deploying APIs, especially ML APIs, focusing  on simplifying boilerplate code while providing robust features like schema  validation, dependency injection, and automatic documentation.⏺ The repository is a Python library called Flama, which provides a framework  for developing and deploying APIs, with a special focus on Machine Learning  (ML) applications.  It aims to make deploying ML APIs \"ridiculously simple\" by simplifying  boilerplate code while offering robust features such as:  - API development for GraphQL, REST, and ML endpoints.  - Resource handling with CRUD methods over SQLAlchemy tables.  - Schema validation for inputs and outputs.  - Dependency Injection via Components.  - Automatic generation of OpenAPI documentation (Swagger UI, ReDoc).  What specific part of the framework or implementation would you like to  explore next?✻ Cogitated for 1m 13s\n```\n\nYour agentic tasks (code generation, file editing, research) run entirely on your local hardware. This gives you:\n\n**Privacy**: Your prompts and code never leave your machine.** Cost**: No API usage charges for development and experimentation.** Speed**: No network latency to cloud providers (especially valuable for iterative agent loops).** Control**: You choose the model, the quantisation, and the generation parameters.\n\nThis same pattern works with any agent framework that supports custom API base URLs: LangChain, CrewAI, AutoGen, or any custom tool that accepts a base URL configuration.\n\nConclusions\n\nFlama 2.0 makes the journey from \"I want to use an LLM\" to \"I have a production-ready API\" as short as possible. The CLI provides three levels of interaction:\n\n: Download and package any model from HuggingFace into a portable`flama get`\n\n`.flm`\n\nartifact.: Interact with the model directly in your terminal, for quick testing and scripting.`flama model`\n\n: Serve the model over HTTP with OpenAI/Anthropic/Ollama compatibility, a built-in chat interface, and streaming support.`flama serve`\n\nNo boilerplate Python code, no YAML configuration, no container orchestration. Just the CLI.\n\nThe fact that Flama speaks the protocols your tools already understand (OpenAI, Anthropic, Ollama) means that adopting a local model in your workflow requires changing nothing but a base URL. Your existing SDKs, agent frameworks, and chat interfaces work without modification.\n\nIn upcoming posts, we will explore how to build MCP servers with Flama to expose tools and resources to AI agents, and how to combine LLM serving with the Model Context Protocol for truly powerful agentic applications.\n\nReferences\n\nSupport our work\n\nIf you find Flama useful for building robust Machine Learning and Generative AI APIs, we'd be thrilled if you showed\nyour support by giving us a ⭐ on [GitHub](https://github.com/vortico/flama). Your stars are the best fuel for our\ndevelopment efforts!\n\nYou can also stay updated with the latest news and development threads by following us on\n[𝕏](https://x.com/VorticoTech).\n\nAbout the authors\n\n[Vortico](https://vortico.tech/): We specialize in software development, helping businesses enhance and expand their AI and technology capabilities.", "url": "https://wpnews.pro/news/llm-apis-with-built-in-chatbot-in-1-line-of-code", "canonical_source": "https://flama.dev/blog/serving_llms_with_flama_cli/", "published_at": "2026-06-25 11:08:01+00:00", "updated_at": "2026-06-25 11:14:08.707114+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-infrastructure", "ai-tools", "generative-ai"], "entities": ["Flama", "HuggingFace", "Google", "Gemma 4", "MLX Community", "Claude CLI", "vLLM", "Apple Silicon"], "alternates": {"html": "https://wpnews.pro/news/llm-apis-with-built-in-chatbot-in-1-line-of-code", "markdown": "https://wpnews.pro/news/llm-apis-with-built-in-chatbot-in-1-line-of-code.md", "text": "https://wpnews.pro/news/llm-apis-with-built-in-chatbot-in-1-line-of-code.txt", "jsonld": "https://wpnews.pro/news/llm-apis-with-built-in-chatbot-in-1-line-of-code.jsonld"}}