AI Metrics Decoded: From Parameters to TOPS

wpnews.pro

Picture this: your team picks a 70B parameter model for a new feature. It runs great on your MacBook. You push to production. The GPU bill arrives. Your manager is not happy.

Or this: your AI API costs explode halfway through the month and nobody knows why.

These are not horror stories. They happen to real engineers — usually the ones who skipped learning the core units of measurement behind AI systems.

As a junior engineer, you're going to face questions like:

Understanding the seven core metrics below gives you the language — and the instincts — to answer confidently.

Let's break them down.

What it is: The learned weights inside a neural network. Think of them as the "memory" of the model — numbers that get adjusted during training to capture patterns in data.

The unit: Just a raw count. We usually express it in:

Why it matters to you:

Parameter Count	Approx. VRAM Needed (fp16)	Typical Use Case
1B–3B	~4–6 GB	Mobile / edge apps
7B–8B	~16 GB	Single consumer GPU
13B–14B	~28 GB	Single pro GPU (A100 40GB)
70B	~140 GB	Multi-GPU setup
405B+	~800 GB+	Cluster of H100s

Rule of thumb:1 billion parameters ≈ 2 GB of VRAM in half-precision (fp16). Double it for full precision (fp32).

More parameters = more capable model and more expensive to run. Always.

What it is: The unit of text that a model reads and generates. Not words — fragments.

Quick visual:

Input text:  "Learning AI is fun!"
             ↓ Tokenizer
Tokens:      ["Learn"] ["ing"] [" AI"] [" is"] [" fun"] ["!"]
Token count: 6 tokens

Why it matters to you:

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
text = "Learning AI is fun!"
tokens = enc.encode(text)

print(f"Token count: {len(tokens)}")   # → 6
print(f"Tokens: {tokens}")             # → [71668, 287, 15592, 374, 2523, 0]

Quick cheat sheet:

1 token ≈ 0.75 English words
1,000 tokens ≈ 750 words ≈ ~1.5 pages
Non-English text (Hindi, Mandarin, Arabic) uses 30–70% more tokens for the same content

This is where a lot of junior engineers get confused. FLOPS and TOPS sound similar. They are not the same thing.

What it is: A measure of raw compute power for floating point arithmetic — the kind of math needed for training and running neural networks.

The scale:

Unit	Value	Context
GFLOPS	10⁹ FLOPS	Your laptop GPU
TFLOPS	10¹² FLOPS	Cloud GPUs (A100: ~312 TFLOPS)
PFLOPS	10¹⁵ FLOPS	Entire GPU clusters

Used for: Server-scale training and inference. When someone says "the H100 delivers 989 TFLOPS of FP16 performance", this is what they mean.

Common GPUs you'll actually use:

GPU	FP16 TFLOPS	Best For
RTX 4090	~165	Local dev / fine-tuning
A100 40GB	~312	Production inference
H100 SXM	~989	Large-scale training

What it is: Similar idea, but used for integer or mixed-precision operations on edge hardware and NPUs (Neural Processing Units).

The key difference:

FLOPS  →  Floating point math  →  GPUs / server chips  →  Training & inference at scale
TOPS   →  Integer / INT8 math  →  NPUs / edge chips    →  On-device inference

Real-world examples:

Device	TOPS	Use Case
Apple M4 Neural Engine	~38 TOPS	On-device ML on MacBook
Qualcomm Snapdragon X Elite	~45 TOPS	AI PCs / laptops
NVIDIA Jetson Orin	~275 TOPS	Edge AI / robotics
Google TPU v5e	~393 TOPS	Cloud inference at scale

When do you care about TOPS?When you're deploying a model to a phone, a laptop, or an embedded device — not a data centre. If you're picking a chip for on-device inference, TOPS is your number.

Yes, confusingly, FLOPs (with a capital F, no "per second") is a different metric from FLOPS.

What it is: The total number of floating point operations performed during an entire training run. It's a measure of compute budget, not hardware speed.

The unit: Usually expressed as:

Real-world examples:

Model	Estimated Training FLOPs
GPT-3 (175B)	~3.14 × 10²³
LLaMA 2 70B	~2.9 × 10²³
Gemini Ultra	~5 × 10²⁴ (estimated)

Why it matters to you: Directly as a junior engineer, probably not yet. But understanding it helps you reason about:

Quick analogy:FLOPS (the hardware rate) is your car's horsepower. FLOPs (training cost) is the total miles driven on a road trip. One is speed, one is distance.

These three are the metrics you'll track the most in production. They live in your dashboards, your SLAs, and your post-mortems.

What it is: How long (in milliseconds) from sending your request to receiving the first token of the response.

Why it matters: This is what determines if your app feels fast. Even if the full response takes 10 seconds, a 200ms TTFT makes the experience feel responsive. It's the AI equivalent of "First Contentful Paint" in web dev.

User sends prompt
        ↓
  [ ... processing ... ]   ← this duration is TTFT
        ↓
First token arrives → streaming begins → user sees output

Good TTFT benchmarks:

Scenario	Target TTFT
Real-time chat	< 300ms
Interactive coding assistant	< 500ms
Background document processing	< 2,000ms (acceptable)

What it is: How many tokens the model generates per second during the response. Also called generation speed or throughput.

Why it matters: TPS determines whether your streaming response feels smooth or painfully slow.

What affects TPS:

What it is: Your rate limit from the API provider. The maximum number of tokens your account can process per minute.

Why it matters: Hit your TPM limit and your requests start getting throttled or rejected with 429 Too Many Requests

. This is a very common production issue for junior engineers on their first real deployment.


prompts = load_10000_prompts()   # Each ~500 tokens

for prompt in prompts:
    response = call_llm_api(prompt)   # 🚨 You'll hit TPM limit fast
    process(response)

import time

TPM_LIMIT = 40000   # tokens per minute (check your plan)
tokens_this_minute = 0
minute_start = time.time()

for prompt in prompts:
    estimated_tokens = len(prompt.split()) * 1.3   # rough estimate

    if tokens_this_minute + estimated_tokens > TPM_LIMIT:
        sleep_time = 60 - (time.time() - minute_start)
        if sleep_time > 0:
            time.sleep(sleep_time)
        tokens_this_minute = 0
        minute_start = time.time()

    response = call_llm_api(prompt)
    tokens_this_minute += estimated_tokens
    process(response)

Let me show you a real decision you'll face: "Should we use an 8B or 70B model?"

Here's how the metrics interact:

                    8B Model          70B Model
─────────────────────────────────────────────────
Parameters          8 billion         70 billion
VRAM Required       ~16 GB            ~140 GB
GPU Setup           1× A100 40GB      4× A100 40GB
Est. TPS            ~80–120 TPS       ~15–30 TPS
TTFT (A100)         ~150ms            ~400ms
API Cost (est.)     ~$0.15/M tokens   ~$0.90/M tokens
Quality             Good              Excellent
─────────────────────────────────────────────────

The real-world math: Say your app handles 1,000 users/day, each generating ~2,000 tokens per session.

Daily tokens = 1,000 users × 2,000 tokens = 2,000,000 tokens

8B model cost:  2M × $0.00015 = $0.30/day  → $9/month
70B model cost: 2M × $0.00090 = $1.80/day  → $54/month

That's a 6× cost difference. For a startup, that matters.

The senior engineer's question isn't "which model is better?" It's "which model is good enough for this use case at this scale?"

Start with the smaller model. Benchmark it against your quality requirements. Scale up only if you have to.

Metric	Full Name	Measures	Typical Unit
Parameters	—	Model size / capacity	M, B, T
Tokens	—	Text unit for I/O and cost	count
FLOPS	Floating Point Ops/sec	Hardware speed (server)	TFLOPS
TOPS	Tera Operations/sec	Hardware speed (edge/NPU)	TOPS
FLOPs	Floating Point Ops (total)	Training compute cost	PetaFLOPs
TTFT	Time To First Token	Latency / responsiveness	milliseconds
TPS	Tokens Per Second	Generation speed	tokens/sec
TPM	Tokens Per Minute	API rate limit	tokens/min

You now have the vocabulary. Here's how to build on it:

llama.cpp

or Ollama

locallyThe engineers who understand these numbers don't just write code. They make better architectural decisions, avoid expensive surprises, and earn trust faster.

That's the real reason to care.

Got questions? Drop them in the comments.

source & further reading

dev.to — original article Anthropic wants to grade AI jailbreaks like CVEs. Here's the framework. OpenAI just found ~30% of SWE-Bench Pro is broken — and retracted their own recommendation GPT-5.6 Sol matches Claude Fable 5 intelligence at one third the cost

AI Metrics Decoded: From Parameters to TOPS

Run your AI side-project on zahid.host