Best Small Language Models on Hugging Face Right Now!

A 4-billion-parameter AI model released in early 2025 is now outperforming models seven times its size on standard reasoning benchmarks, with Google's Gemma 3 4B scoring 89.2% on GSM8K math reasoning and Microsoft's Phi-4-mini achieving 83.7% on ARC-C. These small language models, defined as under 7 billion parameters, can run on a single consumer GPU, laptop, or smartphone, eliminating the need for cloud infrastructure or API rate limits. The shift is driven by better training data, distillation from frontier models, and architectural improvements like mixture-of-experts, making small models viable for real-world deployment.

Best Small Language Models on Hugging Face Right Now Take a curated look at the best small language models currently available on Hugging Face, what each one is actually good at, the benchmark numbers that back those claims up, and the code to get started with each one. Introduction Here is something that should shift how you think about AI model size: a 4-billion-parameter model released in early 2025 is now outscoring models that were 7x larger on standard reasoning benchmarks. Google's Gemma 3 4B posts an 89.2% on GSM8K math reasoning. at 3.8B hits 83.7% on ARC-C, the highest score in its entire size class. These numbers used to belong to 30B+ models. So the question " Microsoft's Phi-4-mini https://huggingface.co/microsoft/Phi-4-mini-instruct do I really need a 70B model for this? " deserves a second look. For the purposes of this article, " small " means under 7 billion parameters — models that can run on a single consumer GPU, a laptop, or even a modern smartphone with the right setup. That threshold matters because it marks the boundary between models that require serious infrastructure and models that anyone can actually deploy. No cloud bill. No waiting on API rate limits. Just a model running locally, doing real work. What you will get from this article: a curated look at the best small language models currently available on Hugging Face, what each one is actually good at, the benchmark numbers that back those claims up, and the code to get started with each one. Why Small Language Models Are Worth Your Attention Right Now The honest reason most people ignored small models until recently is that they were not good enough. A 3B model from 2022 would struggle with multi-step reasoning, fall apart on code generation, and produce generic, forgettable outputs on anything nuanced. That reputation stuck even as the models quietly got much better. Three things changed the trajectory: Better training data, not more of it. Microsoft trained Phi-4-mini https://huggingface.co/microsoft/Phi-4-mini-instruct on 5 trillion tokens, but the emphasis was on quality. Synthetic data generated to be reasoning-dense, filtered public web content, and structured educational material. The bet paid off. A 3.8B model trained carefully on the right data outperforms a 13B model trained carelessly on everything. Qwen3-0.6B, at just 600 million parameters, supports over 100 languages because its training corpus was built with that goal in mind, not as an afterthought. Distillation from frontier models. DeepSeek-R1-Distill-Qwen-1.5B https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B is a 1.5B model that learned to reason by being trained on outputs from a much larger reasoning model. The result is a tiny model that can walk through problems step-by-step in a way that felt impossible at that size two years ago. Distillation is now a standard playbook: take a massive capable teacher, compress its behavior into a fraction of the parameters. Architectural improvements. Mixture-of-Experts MoE changed what "parameter count" even means. Google's Gemma 3n E4B https://huggingface.co/google/gemma-3n-E4B-it has 8 billion total parameters but activates only 4 billion per token; it runs with the memory footprint of a 4B model while drawing on the capacity of an 8B one. Hybrid attention mechanisms and longer context windows 128K is now common even in sub-5B models pushed capabilities even further without bloating the model size. If you have spent time on Hugging Face model pages, you know they can be dense. Before diving into the model list, here is a quick breakdown of the terms that will come up repeatedly. Parameters. Parameters are the numerical weights inside a model that determine how it responds to input. More parameters generally mean more capacity to store knowledge and handle complex reasoning, but not always better outputs. The benchmarks you will see referenced. is a harder version of the classic Massive Multitask Language Understanding MMLU test. It covers 57 academic subjects — law, medicine, history, physics, and more — with answer choices designed to be genuinely tricky. A score of 50+ on MMLU-Pro from a sub-5B model is notable. A score above 70 is exceptional. MMLU-Pro https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro Grade School Math 8K is a set of 8,500 grade-school math word problems that require multi-step reasoning to solve. It sounds simple but consistently separates models that reason from models that pattern-match. Scores are reported as a percentage of problems solved correctly. GSM8K https://huggingface.co/datasets/openai/gsm8k tests code generation. The model is given a Python function signature and a docstring, and it has to write the code that passes the hidden test suite. Scores above 60% from a sub-5B model are genuinely impressive. HumanEval https://huggingface.co/datasets/openai humaneval/openai humaneval AI2 Reasoning Challenge is a collection of science questions from standardized exams, specifically the ones that stumped other AI systems. It tests common-sense and scientific reasoning. ARC-C https://huggingface.co/datasets/allenai/ai2 arc Base models vs. instruct models vs. thinking models. A base model is trained to predict the next token — it generates text but does not follow instructions reliably. An instruct model has been fine-tuned to respond helpfully to prompts in a conversational format. That is what you want for most applications. Thinking or reasoning models like Qwen3's "thinking mode" or DeepSeek-R1 distills go a step further: they generate a chain-of-thought reasoning process before answering, which improves accuracy on complex problems at the cost of slower response times. Most models in this list are instruct variants. Quantization and GGUF. A model fresh off training stores its weights in 16-bit or 32-bit floating point format — precise but large. Quantization compresses those weights to fewer bits. Q4 means 4-bit quantization: each weight uses 4 bits instead of 16, cutting memory usage by roughly 75%. According to community testing https://www.promptlayer.com/models/microsoftphi-4-mini-instruct-gguf/ , Q4 K M quantization retains around 90–95% of the original model's output quality while requiring only a fraction of the memory. GGUF is the file format that packages these quantized models for use with, the most widely used local inference engine. If you see a model listed as "X GB Q4 ," that is the approximate RAM you need to load the quantized version. llama.cpp https://github.com/ggerganov/llama.cpp 1. Qwen3.5-4B Alibaba If there is one model on this list that covers the most ground, it is Qwen3.5-4B . Released by Alibaba in March 2026, it sits at the center of the Qwen3.5 small series — a lineup that goes from 0.8B all the way to 9B, all sharing the same architecture and all carrying an Apache 2.0 license, which means you can use them in commercial products without worrying about usage restrictions. The headline number is the context window. According to the official model card https://huggingface.co/Qwen/Qwen3.5-4B , Qwen3.5-4B supports a native context length of 262,144 tokens, extensible to over one million. For a 4B model, that is extraordinary. Most models this size cap out at 128K. The model operates in thinking mode by default, generating a reasoning chain before it responds. You can turn this off for faster, direct answers when you do not need the depth. Best for: General-purpose tasks across languages, instruction following, long-document processing, and any application where multimodal input might come up down the line. Code: Load and run inference python Install: pip install transformers torch accelerate from transformers import AutoModelForCausalLM, AutoTokenizer Specify the model ID from Hugging Face Hub model id = "Qwen/Qwen3.5-4B" Load the tokenizer -- handles text encoding and chat formatting tokenizer = AutoTokenizer.from pretrained model id Load the model; torch dtype="auto" picks the best precision device map="auto" places layers across available hardware automatically model = AutoModelForCausalLM.from pretrained model id, torch dtype="auto", device map="auto" Build the conversation as a list of message dicts messages = {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain the difference between supervised and unsupervised learning in simple terms."} Apply the model's built-in chat template to format the messages correctly text = tokenizer.apply chat template messages, tokenize=False, add generation prompt=True, Setting enable thinking=False skips the reasoning chain for faster output Remove this line if you want the model to reason step by step before answering enable thinking=False Tokenize and move inputs to the same device as the model model inputs = tokenizer text , return tensors="pt" .to model.device Generate the response -- max new tokens caps output length generated ids = model.generate model inputs, max new tokens=512 Decode only the newly generated tokens not the input prompt output ids = generated ids 0 len model inputs.input ids 0 : response = tokenizer.decode output ids, skip special tokens=True print response What this code does: It loads the model and tokenizer from Hugging Face, formats a conversation using the model's built-in chat template, generates a response, and decodes only the new tokens so you do not get the prompt repeated back at you. The enable thinking=False flag puts the model in direct response mode — remove it if you want it to reason through the problem first. 2. Microsoft Phi-4-mini-instruct 3.8B Phi-4-mini is Microsoft's bet that the right training data beats raw scale. At 3.8B parameters trained on 5 trillion tokens https://huggingface.co/unsloth/Phi-4-mini-instruct-GGUF of carefully filtered and synthetic data, it posts an ARC-C score of 83.7% — the highest of any model under 10 billion parameters https://awesomeagents.ai/leaderboards/small-language-model-leaderboard/ on that benchmark. Its GSM8K score of 88.6% and SimpleQA factual accuracy of 91.1% sit comfortably alongside models that are two to three times its size. The Q4 K M GGUF file comes in at 2.49 GB https://www.promptlayer.com/models/microsoftphi-4-mini-instruct-gguf/ , which means it runs on machines with as little as 4 GB of RAM. For anyone wanting capable AI on a mid-range laptop without GPU requirements, Phi-4-mini is probably the most practical option on this list. What it gives up is multilingual depth and multimodal input. It was trained primarily on English text, so it will underperform on non-English tasks. If your use case is English-language reasoning, knowledge retrieval, or structured tasks, that trade-off is fine. Best for: Reasoning-heavy tasks, knowledge-intensive Q&A, and anyone running on tight hardware with an English-language workload. Code: Basic inference call with transformers python Install: pip install transformers torch from transformers import AutoModelForCausalLM, AutoTokenizer import torch model id = "microsoft/Phi-4-mini-instruct" Load the tokenizer for Phi-4-mini tokenizer = AutoTokenizer.from pretrained model id Load model in bfloat16 for memory efficiency on GPU Use torch dtype=torch.float32 if running on CPU only model = AutoModelForCausalLM.from pretrained model id, torch dtype=torch.bfloat16, device map="auto" Phi-4-mini uses a system/user/assistant chat format messages = {"role": "system", "content": "You are a helpful assistant focused on clear, accurate answers."}, {"role": "user", "content": "What is the difference between a list and a tuple in Python?"} Apply the model's chat template -- Phi-4-mini expects this specific formatting inputs = tokenizer.apply chat template messages, tokenize=True, add generation prompt=True, return tensors="pt" .to model.device Generate the response outputs = model.generate inputs, max new tokens=300, Keep responses focused temperature=0.7, Slight randomness for natural output do sample=True Required when temperature 0 Decode and print only the generated portion response = tokenizer.decode outputs 0 inputs.shape -1 : , skip special tokens=True print response What this code does: Loads Phi-4-mini in bfloat16 format roughly half the memory of float32 , formats the conversation using the model's built-in chat template, and prints only the new response by slicing off the input tokens. The temperature=0.7 setting keeps outputs natural without being too unpredictable. 3. Google Gemma 3 4B IT Gemma 3 4B IT is the model that surprises people once they actually run it. On code and math, it punches well above what you would expect from 4 billion parameters. A 71.3% on HumanEval https://awesomeagents.ai/leaderboards/small-language-model-leaderboard/ is competitive with models twice its size, and 89.2% on GSM8K math reasoning puts it in genuinely strong territory for grade-level and early undergraduate math problems. It supports multimodal input text and images and comes with a 128K context window — long enough to feed it a full paper or a sizable codebase for analysis. The IT in the name stands for Instruction Tuned, which just means this is the version fine-tuned to follow instructions in conversation rather than the raw pre-trained base. Best for: Code generation, math-heavy tasks, and projects where you want multimodal input without going above 4B parameters. python Install: pip install transformers torch from transformers import AutoModelForCausalLM, AutoTokenizer import torch model id = "google/gemma-3-4b-it" Load tokenizer -- handles Gemma's specific chat format tokenizer = AutoTokenizer.from pretrained model id Load model; bfloat16 cuts memory roughly in half vs float32 model = AutoModelForCausalLM.from pretrained model id, torch dtype=torch.bfloat16, device map="auto" Gemma uses a role-based chat template -- always pass messages this way messages = {"role": "user", "content": "Write a Python function that checks if a string is a palindrome."} Tokenize using the model's built-in chat template inputs = tokenizer.apply chat template messages, return tensors="pt", add generation prompt=True .to model.device Run generation with torch.no grad : Disables gradient tracking -- speeds up inference outputs = model.generate inputs, max new tokens=400, do sample=True, temperature=0.7 Strip the input tokens and decode just the response response = tokenizer.decode outputs 0 inputs.shape -1 : , skip special tokens=True print response What this code does: Loads Gemma 3 4B IT, wraps a coding prompt in the expected chat format, and generates a response. The torch.no grad context manager tells PyTorch not to track gradients during inference, which saves memory and speeds things up — always worth including at inference time. 4. Google Gemma 3n E4B The Mobile One Gemma 3n E4B is a different kind of model. Google built it specifically for on-device deployment — phones, edge hardware, local apps — and the architecture reflects that priority in ways that other models on this list do not. The key innovation is MatFormer , a nested transformer architecture that embeds a smaller model E2B inside the larger one E4B . The E4B has 8 billion raw parameters but only needs 3 GB of memory to run, because Per-Layer Embeddings PLE keep a large portion of the weights on CPU while only the core transformer layers sit in accelerator memory. The net result: you get 4B-class performance at 4B-class memory requirements, but the underlying model has twice the capacity. Best for: On-device and mobile deployment, multimodal apps text + image + audio in one model , and any scenario where memory efficiency is the top priority. 5. Meta Llama 3.2 3B Instruct Llama 3.2 3B Instruct does not have the flashiest benchmark numbers on this list, but it has something most of the others do not: a massive, active community behind it. With over 2.18 million downloads on Hugging Face https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct , it is the most widely deployed small model here, which means more fine-tunes, more integrations, more community tooling, and more real-world testing than most alternatives. At just 2 GB in Q4 quantization, it is also the lightest fully capable model on this list. It handles tool calling and structured outputs cleanly — Meta built it with agentic use cases in mind — making it a natural fit for pipelines where the model needs to call external APIs or produce JSON that another system consumes. Best for: Tool calling, structured output pipelines, mobile apps, and any project that benefits from broad community support. Install: pip install transformers torch Note: You need to accept the Llama 3.2 license on Hugging Face before downloading from transformers import AutoModelForCausalLM, AutoTokenizer import torch model id = "meta-llama/Llama-3.2-3B-Instruct" Load tokenizer -- Llama 3.2 uses its own special chat tokens tokenizer = AutoTokenizer.from pretrained model id Load in bfloat16 to keep memory usage low ~2GB at this precision model = AutoModelForCausalLM.from pretrained model id, torch dtype=torch.bfloat16, device map="auto" Define the conversation -- system prompt sets the model's behavior messages = {"role": "system", "content": "You are a helpful assistant. Be concise and accurate."}, {"role": "user", "content": "Summarize the key differences between REST and GraphQL APIs."} Apply chat template -- critical for Llama models, controls special tokens inputs = tokenizer.apply chat template messages, tokenize=True, add generation prompt=True, return tensors="pt" .to model.device Generate the response with torch.no grad : output = model.generate inputs, max new tokens=300, temperature=0.6, Lower temp = more focused, deterministic output do sample=True, pad token id=tokenizer.eos token id Prevents padding warnings Decode only the model's response not the input response = tokenizer.decode output 0 inputs.shape -1 : , skip special tokens=True print response What this code does: The key thing to note here is pad token id=tokenizer.eos token id . Llama models often produce a warning during generation because the tokenizer does not define a separate pad token. Setting it to the end-of-sequence token suppresses that warning cleanly without changing output quality. 6. HuggingFaceTB SmolLM3-3B SmolLM3 is Hugging Face's own model, and what sets it apart is transparency. The weights are open. The training data mixture is publicly documented https://huggingface.co/HuggingFaceTB/SmolLM3-3B . The training config is published. The evaluation code is shared. For researchers, educators, or teams building on top of models and needing to understand exactly what they are working with, that openness is rare. The model itself is built on a three-stage curriculum: the first stage covers general web text across its 11.2 trillion training tokens, the second introduces higher-quality math and code data, and the third focuses on reasoning. This staged approach mirrors how human education actually works, and based on the SmolLM3 blog post https://huggingface.co/blog/smollm3 , it produces a model that places first or second on knowledge and reasoning benchmarks within the 3B class, including HellaSwag and ARC. When reasoning mode is enabled, AIME 2025 performance jumps from 9.3% to 36.7%. It also supports tool calling out of the box, handles 6 European languages natively, and extends to 128K context via YARN. The modeling code requires transformers v4.53.0 or later. Best for: Research, reproducible experiments, open-source projects where transparency matters, and European multilingual deployments. Install: pip install "transformers =4.53.0" torch accelerate SmolLM3 requires transformers v4.53.0+ -- older versions will fail from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "HuggingFaceTB/SmolLM3-3B" Use "cuda" for GPU or "cpu" for CPU-only inference device = "cuda" Load the tokenizer tokenizer = AutoTokenizer.from pretrained checkpoint Load the model -- for multi-GPU setups, use device map="auto" instead model = AutoModelForCausalLM.from pretrained checkpoint .to device Build and apply the chat template messages = {"role": "user", "content": "Explain the concept of attention in transformer models."} SmolLM3 uses a standard chat template -- apply it before tokenizing inputs = tokenizer.apply chat template messages, tokenize=True, add generation prompt=True, return tensors="pt" .to device Generate the response outputs = model.generate inputs, max new tokens=400, do sample=True, temperature=0.7 Decode only the newly generated tokens response = tokenizer.decode outputs 0 inputs.shape -1 : , skip special tokens=True print response What this code does: Straightforward load and generate. The one thing to watch here is the transformers version — SmolLM3's architecture requires v4.53.0 or higher. Running an older version will throw an error, not produce bad output, so it is easy to catch. 7. DeepSeek-R1-Distill-Qwen-1.5B Most 1.5B models are roughly good for autocomplete, simple chat, and not much else. DeepSeek-R1-Distill-Qwen-1.5B is a notable exception. It was trained on outputs from DeepSeek-R1, a much larger frontier reasoning model, meaning it learned to reason by watching a far more capable teacher. The result is a 1.5B model that can produce multi-step reasoning chains on math and logic problems where other models its size give up and guess. At around 1 GB in Q4 quantization, it is the smallest model on this list with genuine reasoning capability. It fits on almost any hardware — a Raspberry Pi with enough RAM, an old laptop, embedded devices. That footprint combined with the reasoning behavior makes it useful for any scenario where you need lightweight inference on structured problems and cannot afford a larger model. The trade-off: it is not a general-purpose chatbot. Its strengths are math, logic, and reasoning. For creative tasks or open-ended conversation, it will underperform relative to its size class. Best for: Edge devices, embedded systems, lightweight reasoning pipelines, and any project where 1 GB model size is a hard requirement. 8. Qwen3-0.6B Qwen3-0.6B sits at the edge of what is currently worth calling a language model. At 600 million parameters, it runs on hardware that most people would not even consider using for AI — and it still manages to do useful things. The 19.1 million downloads https://huggingface.co/Qwen/Qwen3-0.6B on Hugging Face tell you that a lot of people have found a real purpose for it. It carries the same dual-mode architecture as the rest of the Qwen3 family: thinking mode for problems that need reasoning, non-thinking mode for fast direct responses. Over 100 languages are supported. For tasks like text classification, short-form autocomplete, basic summarization, or lightweight on-device features in mobile apps, it is genuinely capable relative to its size. Do not expect it to write complex code, handle multi-step reasoning across long inputs, or compete with 3B+ models on benchmarks. That is not what it was made for. It was made to run anywhere — and it does. Best for: Autocomplete, text classification, simple on-device features, ultra-constrained hardware, and rapid prototyping where a larger model is overkill. Conclusion The story this article keeps coming back to is simple: small no longer means limited. A 3.8B model is hitting benchmark numbers that looked like 30B territory a year ago. A model running in 2 GB of RAM is handling reasoning tasks that used to require enterprise infrastructure. That is not marketing — it is what the benchmark data actually shows, and it is reproducible on hardware most people already have. The practical implication is that the decision to reach for a frontier API as a default is worth questioning for a growing range of tasks. If your workload is English-language reasoning, code generation, or structured outputs, Phi-4-mini https://huggingface.co/microsoft/Phi-4-mini-instruct or Gemma 3 4B IT https://huggingface.co/google/gemma-3-4b-it will cover most of it on a laptop. If you are building something multilingual, Qwen3.5-4B https://huggingface.co/Qwen/Qwen3.5-4B is a commercial-friendly Apache 2.0 model with a 262K context window and native image understanding. If you are targeting mobile or edge hardware, Gemma 3n E4B https://huggingface.co/google/gemma-3n-E4B-it was purpose-built for exactly that — and nothing on this list touches it in that category. And if you want to know exactly what you are shipping — every data source, every training decision — SmolLM3-3B https://huggingface.co/HuggingFaceTB/SmolLM3-3B is the only fully transparent option in this class. is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Shittu Olumide https://www.linkedin.com/in/olumide-shittu/