LLM APIs with built-in chatbot in 1 line of code Flama 2.0 introduces a CLI tool that allows users to download, package, and serve large language models from HuggingFace with a single command, including a built-in chat interface and production-ready API. The tool supports models like Google's Gemma 4 and enables local interaction and agentic workflows without boilerplate code. Publication Reading Time Serving LLMs with the Flama CLI Flama 2.0 brings first-class support for generative AI: downloading, packaging, and serving large language models LLMs is now as simple as running a few commands in your terminal. No boilerplate code, no custom serving infrastructure, no configuration files. Just the CLI and a model. In this post, we walk through the entire workflow: fetching a model from HuggingFace, interacting with it locally in your terminal, and serving it over HTTP with a production-ready API and a built-in chat interface. We will also show how a locally served model can power agentic workflows, using Claude CLI as a practical example. Before we dive into the details, we recommend you to have the following resources at hand: - Official Flama documentation: Flama documentation https://flama.dev/docs/ - Generative AI section: Generative AI docs https://flama.dev/docs/generative-ai/overview/ - Flama GitHub repository: Flama on GitHub https://github.com/vortico/flama Table of contents Fetching a model with flama get The first step in serving an LLM with Flama is downloading and packaging a model into a .flm artifact a Flama Lightweight Model file . The flama get command handles this in a single step: it downloads the model weights and configuration from a supported source and serialises them into the portable .flm format. All examples in this post assume Flama has been installed with the LLM extras via uv https://docs.astral.sh/uv/ : uv pip install "flama llm,pydantic " Alternatively, you can run any command without a prior install by using uvx --from "flama llm,pydantic " flama ... , but for brevity we assume Flama is already installed throughout. Let us fetch a quantised version of Google's Gemma 4 model, optimised for Apple Silicon via the MLX Community: flama get --family llm --source huggingface mlx-community/gemma-4-E2B-it-qat-4bitDownloading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2.3 GB 28.7 MB/s 0:00:00Packaging...Model saved to mlx-community gemma-4-E2B-it-qat-4bit.flm Two options are required: --source tells Flama where to download from currently HuggingFace , and --family declares whether the artifact is a traditional machine-learning model ml or a generative model llm . For large language models, you always pass --family llm . The output path defaults to