# Foundation vs. Instruct vs. Chat Models: One Question, Three Answers

> Source: <https://dev.to/vishalmysore/foundation-vs-instruct-vs-chat-models-one-question-three-answers-3gi>
> Published: 2026-06-16 23:08:32+00:00

*A hands-on tutorial you can run for free in Google Colab.*

Run it yourself:open[in Google Colab and run every cell top to bottom. It uses the]`foundation_instruct_chat_tutorial.ipynb`

SmolLM2-135Mfamily — small enough for a free CPU runtime, no GPU needed.

People say "LLM," "GPT," "an AI model," and "ChatGPT" as if they were the same thing. They aren't. There's a ladder of training stages between "a model that read the internet" and "an assistant you can chat with," and the words **foundation**, **instruct**, and **chat** mark the rungs.

The cleanest way to feel the difference is to do something deliberately unfair: ask the **exact same question** to three versions of the **same model family** and watch how differently they behave. Our question is deliberately boring so the *behavior* stands out:

"What is the capital of France?"

We use three checkpoints from Hugging Face's SmolLM2 family:

| Model type | Hugging Face ID | One-line summary |
|---|---|---|
| Foundation (base) | `HuggingFaceTB/SmolLM2-135M` |
Predicts the next token. Knows things, isn't helpful. |
| Instruct | `HuggingFaceTB/SmolLM2-135M-Instruct` |
Fine-tuned to follow a single instruction. |
| Chat |
`HuggingFaceTB/SmolLM2-135M-Instruct` (used conversationally) |
Same weights, driven through a multi-turn message list. |

Notice that the chat row reuses the instruct checkpoint. That's not a shortcut — it's the honest reality, and we'll come back to why.

A **foundation model** (also called a *base* or *pretrained* model) is trained on exactly one objective: given a stretch of text, **predict the next token**. Nothing else. It reads a huge slice of the internet and gets very good at continuing text in a statistically plausible way.

What it is *never* taught is that a question deserves an answer. So when you feed it:

```
What is the capital of France?
```

it doesn't think *"I should answer that."* It thinks *"On the internet, what usually **comes after** a line like this?"* And the answer is often… **more quiz questions**, a worksheet, or a tangent:

```
What is the capital of France? What is the capital of Germany? What is the
capital of Italy? ...
```

In the notebook we pass the raw string straight into the pipeline with no formatting:

```
base_pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-135M")
base_raw_out = base_pipe(test_query, max_new_tokens=30, do_sample=False)
print(base_raw_out[0]['generated_text'])
```

**Takeaway:** a foundation model is a **text completer**, not an assistant. It contains enormous knowledge but has no concept of being *helpful*. It's the raw clay everything else is shaped from.

An **instruct model** starts from that same base model and goes through a second stage of training — **fine-tuning on (instruction → response) pairs**. Thousands to millions of examples of the shape *"Here's a request. Here's a good response."* This teaches the model a new contract: **when the user asks for something, actually do it and then stop.**

But there's a crucial detail people miss: an instruct model only behaves correctly when you wrap your text in the **exact special format it was trained on.** That format uses control tokens — for SmolLM2 they look like this:

```
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
```

You don't type those tokens by hand. Every instruct model ships with a **chat template** baked into its tokenizer that builds them for you:

```
tokenizer = AutoTokenizer.from_pretrained(instruct_id)
formatted_prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": test_query}],
    tokenize=False,
    add_generation_prompt=True,  # appends the 'assistant' cue
)
```

Feed *that* to the same-sized model and you get a clean, direct answer:

```
The capital of France is Paris.
```

The notebook prints the formatted prompt **before** generating, so you can literally see the hidden scaffolding the model receives. That "aha" — *oh, there's a whole structure under the hood* — is the most important thing in the tutorial.

**Takeaway:** an instruct model = a base model **+ instruction tuning + a required prompt format**. Skip the format and even a well-trained instruct model can fall back to rambling.

Here's the part that surprises people: a **chat model is usually the same weights as the instruct model.** The difference isn't *what* the model is — it's *how you drive it.*

Instead of one instruction in, one response out, you maintain a **running list of role-tagged messages**:

```
chat_history = [
    {"role": "user", "content": "What is the capital of France?"},
]
chat_out = chat_pipe(chat_history, max_new_tokens=30)
```

The pipeline applies the chat template for you and returns the **whole conversation** with the assistant's reply appended. For a single turn, that looks identical to the instruct example. The magic only appears when the conversation **continues**.

So in the notebook we append the reply and ask a deliberately vague follow-up:

```
conversation = chat_out[0]['generated_text']        # user + assistant so far
conversation.append({"role": "user",
                     "content": "And what is a famous landmark there?"})
follow_up = chat_pipe(conversation, max_new_tokens=40)
```

The word **"there"** is meaningless on its own. But because we passed the *entire history*, the model resolves "there" → **Paris** and names a landmark. That carried-over context is what turns a one-shot Q&A into something that feels like a conversation.

**Takeaway:** a chat model is an instruct model **driven through a multi-turn message list**, so each new turn can use the previous turns as context. The system prompt, the `user`

/`assistant`

roles, and the growing history are the "chat" part.

| Model | Trained to… | You give it… | Reply to "What is the capital of France?"
|
|---|---|---|---|
Foundation |
continue text | a raw string | echoes / continues the document — may never answer |
Instruct |
follow one instruction | a chat-templated string | a direct answer: "The capital of France is Paris."
|
Chat |
converse over many turns | a list of messages | a direct answer + remembers context for follow-ups |

Read top to bottom, it's a progression, not three unrelated things:

When you talk to a commercial assistant, you're using stage 3, sitting on stage 2, built on stage 1.

SmolLM2-135M is **tiny** — about 135 million parameters, versus the tens or hundreds of *billions* in frontier models. At this size the model will sometimes get a fact wrong, repeat itself, or trail off. **That's expected, and it's not the point.** The tutorial is designed to make the *behavioral* gap between the three modes visible on a free laptop or Colab CPU — not to win a trivia contest. The exact same three-stage structure scales all the way up to the largest models in production.

`foundation_instruct_chat_tutorial.ipynb`

`File → Open notebook → Upload`

, or push it to GitHub and use the Colab badge).`Runtime → Run all`

). The first run downloads the models — give it a minute.`test_query`

to something open-ended like `"Write a haiku about the sea."`

and watch how the three modes diverge even more.`do_sample=True`

with `temperature=0.7`

for more varied, creative output.`HuggingFaceTB/SmolLM2-360M-Instruct`

and feel the quality jump.Once you've *seen* the three behaviors with your own eyes, the vocabulary — base, instruct, chat, chat template, system prompt — stops being jargon and starts being obvious.

*Happy experimenting!* 🚀
