# AI Metrics Decoded: From Parameters to TOPS

> Source: <https://dev.to/sreeraj_sreenivasan_2b932/ai-metrics-decoded-from-parameters-to-tops-58k6>
> Published: 2026-05-26 05:47:25+00:00

Picture this: your team picks a 70B parameter model for a new feature. It runs great on your MacBook. You push to production. The GPU bill arrives. Your manager is not happy.

Or this: your AI API costs explode halfway through the month and nobody knows why.

These are not horror stories. They happen to real engineers — usually the ones who skipped learning the core units of measurement behind AI systems.

As a junior engineer, you're going to face questions like:

Understanding the seven core metrics below gives you the language — and the instincts — to answer confidently.

Let's break them down.

**What it is:** The learned weights inside a neural network. Think of them as the "memory" of the model — numbers that get adjusted during training to capture patterns in data.

**The unit:** Just a raw count. We usually express it in:

**Why it matters to you:**

| Parameter Count | Approx. VRAM Needed (fp16) | Typical Use Case |
|---|---|---|
| 1B–3B | ~4–6 GB | Mobile / edge apps |
| 7B–8B | ~16 GB | Single consumer GPU |
| 13B–14B | ~28 GB | Single pro GPU (A100 40GB) |
| 70B | ~140 GB | Multi-GPU setup |
| 405B+ | ~800 GB+ | Cluster of H100s |

Rule of thumb:1 billion parameters ≈ 2 GB of VRAM in half-precision (fp16). Double it for full precision (fp32).

More parameters = more capable model *and* more expensive to run. Always.

**What it is:** The unit of text that a model reads and generates. Not words — fragments.

**Quick visual:**

```
Input text:  "Learning AI is fun!"
             ↓ Tokenizer
Tokens:      ["Learn"] ["ing"] [" AI"] [" is"] [" fun"] ["!"]
Token count: 6 tokens
```

**Why it matters to you:**

```
# Quick check: how many tokens is your prompt?
# Using tiktoken (OpenAI's tokenizer, also used by many OSS models)
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
text = "Learning AI is fun!"
tokens = enc.encode(text)

print(f"Token count: {len(tokens)}")   # → 6
print(f"Tokens: {tokens}")             # → [71668, 287, 15592, 374, 2523, 0]
```

Quick cheat sheet:

- 1 token ≈ 0.75 English words
- 1,000 tokens ≈ 750 words ≈ ~1.5 pages
- Non-English text (Hindi, Mandarin, Arabic) uses 30–70% more tokens for the same content

This is where a lot of junior engineers get confused. FLOPS and TOPS *sound* similar. They are not the same thing.

**What it is:** A measure of raw compute power for **floating point arithmetic** — the kind of math needed for training and running neural networks.

**The scale:**

| Unit | Value | Context |
|---|---|---|
| GFLOPS | 10⁹ FLOPS | Your laptop GPU |
| TFLOPS | 10¹² FLOPS | Cloud GPUs (A100: ~312 TFLOPS) |
| PFLOPS | 10¹⁵ FLOPS | Entire GPU clusters |

**Used for:** Server-scale training and inference. When someone says *"the H100 delivers 989 TFLOPS of FP16 performance"*, this is what they mean.

**Common GPUs you'll actually use:**

| GPU | FP16 TFLOPS | Best For |
|---|---|---|
| RTX 4090 | ~165 | Local dev / fine-tuning |
| A100 40GB | ~312 | Production inference |
| H100 SXM | ~989 | Large-scale training |

**What it is:** Similar idea, but used for **integer or mixed-precision operations** on **edge hardware and NPUs (Neural Processing Units)**.

**The key difference:**

```
FLOPS  →  Floating point math  →  GPUs / server chips  →  Training & inference at scale
TOPS   →  Integer / INT8 math  →  NPUs / edge chips    →  On-device inference
```

**Real-world examples:**

| Device | TOPS | Use Case |
|---|---|---|
| Apple M4 Neural Engine | ~38 TOPS | On-device ML on MacBook |
| Qualcomm Snapdragon X Elite | ~45 TOPS | AI PCs / laptops |
| NVIDIA Jetson Orin | ~275 TOPS | Edge AI / robotics |
| Google TPU v5e | ~393 TOPS | Cloud inference at scale |

When do you care about TOPS?When you're deploying a model to a phone, a laptop, or an embedded device — not a data centre. If you're picking a chip for on-device inference, TOPS is your number.

Yes, confusingly, **FLOPs** (with a capital F, no "per second") is a *different* metric from FLOPS.

**What it is:** The **total number of floating point operations** performed during an entire training run. It's a measure of compute budget, not hardware speed.

**The unit:** Usually expressed as:

**Real-world examples:**

| Model | Estimated Training FLOPs |
|---|---|
| GPT-3 (175B) | ~3.14 × 10²³ |
| LLaMA 2 70B | ~2.9 × 10²³ |
| Gemini Ultra | ~5 × 10²⁴ (estimated) |

**Why it matters to you:** Directly as a junior engineer, probably not yet. But understanding it helps you reason about:

Quick analogy:FLOPS (the hardware rate) is your car's horsepower. FLOPs (training cost) is the total miles driven on a road trip. One is speed, one is distance.

These three are the metrics you'll track the most in production. They live in your dashboards, your SLAs, and your post-mortems.

**What it is:** How long (in milliseconds) from sending your request to receiving the **first token** of the response.

**Why it matters:** This is what determines if your app *feels* fast. Even if the full response takes 10 seconds, a 200ms TTFT makes the experience feel responsive. It's the AI equivalent of "First Contentful Paint" in web dev.

```
User sends prompt
        ↓
  [ ... processing ... ]   ← this duration is TTFT
        ↓
First token arrives → streaming begins → user sees output
```

**Good TTFT benchmarks:**

| Scenario | Target TTFT |
|---|---|
| Real-time chat | < 300ms |
| Interactive coding assistant | < 500ms |
| Background document processing | < 2,000ms (acceptable) |

**What it is:** How many tokens the model generates per second during the response. Also called **generation speed** or **throughput**.

**Why it matters:** TPS determines whether your streaming response feels smooth or painfully slow.

**What affects TPS:**

**What it is:** Your **rate limit** from the API provider. The maximum number of tokens your account can process per minute.

**Why it matters:** Hit your TPM limit and your requests start getting throttled or rejected with `429 Too Many Requests`

. This is a very common production issue for junior engineers on their first real deployment.

```
# A common mistake: not accounting for TPM in batch jobs

prompts = load_10000_prompts()   # Each ~500 tokens

for prompt in prompts:
    response = call_llm_api(prompt)   # 🚨 You'll hit TPM limit fast
    process(response)

# Better approach: add rate limiting
import time

TPM_LIMIT = 40000   # tokens per minute (check your plan)
tokens_this_minute = 0
minute_start = time.time()

for prompt in prompts:
    estimated_tokens = len(prompt.split()) * 1.3   # rough estimate

    if tokens_this_minute + estimated_tokens > TPM_LIMIT:
        sleep_time = 60 - (time.time() - minute_start)
        if sleep_time > 0:
            time.sleep(sleep_time)
        tokens_this_minute = 0
        minute_start = time.time()

    response = call_llm_api(prompt)
    tokens_this_minute += estimated_tokens
    process(response)
```

Let me show you a real decision you'll face: **"Should we use an 8B or 70B model?"**

Here's how the metrics interact:

```
                    8B Model          70B Model
─────────────────────────────────────────────────
Parameters          8 billion         70 billion
VRAM Required       ~16 GB            ~140 GB
GPU Setup           1× A100 40GB      4× A100 40GB
Est. TPS            ~80–120 TPS       ~15–30 TPS
TTFT (A100)         ~150ms            ~400ms
API Cost (est.)     ~$0.15/M tokens   ~$0.90/M tokens
Quality             Good              Excellent
─────────────────────────────────────────────────
```

**The real-world math:** Say your app handles 1,000 users/day, each generating ~2,000 tokens per session.

```
Daily tokens = 1,000 users × 2,000 tokens = 2,000,000 tokens

8B model cost:  2M × $0.00015 = $0.30/day  → $9/month
70B model cost: 2M × $0.00090 = $1.80/day  → $54/month
```

That's a 6× cost difference. For a startup, that matters.

**The senior engineer's question isn't "which model is better?" It's *"which model is good enough for this use case at this scale?"***

Start with the smaller model. Benchmark it against your quality requirements. Scale up only if you have to.

| Metric | Full Name | Measures | Typical Unit |
|---|---|---|---|
| Parameters | — | Model size / capacity | M, B, T |
| Tokens | — | Text unit for I/O and cost | count |
| FLOPS | Floating Point Ops/sec | Hardware speed (server) | TFLOPS |
| TOPS | Tera Operations/sec | Hardware speed (edge/NPU) | TOPS |
| FLOPs | Floating Point Ops (total) | Training compute cost | PetaFLOPs |
| TTFT | Time To First Token | Latency / responsiveness | milliseconds |
| TPS | Tokens Per Second | Generation speed | tokens/sec |
| TPM | Tokens Per Minute | API rate limit | tokens/min |

You now have the vocabulary. Here's how to build on it:

`llama.cpp`

or `Ollama`

locallyThe engineers who understand these numbers don't just write code. They make better architectural decisions, avoid expensive surprises, and earn trust faster.

That's the real reason to care.

*Got questions? Drop them in the comments.*
