AI Metrics Decoded: From Parameters to TOPS A developer explains that understanding seven core AI metrics—parameters, tokens, FLOPS, TOPS, and FLOPs—is essential for avoiding costly deployment mistakes, such as choosing a 70B-parameter model that runs on a laptop but incurs massive GPU bills in production. The guide breaks down each metric with practical examples, including a rule of thumb that 1 billion parameters requires about 2 GB of VRAM in half-precision, and clarifies the difference between FLOPS (floating-point operations per second for GPUs) and TOPS (integer operations per second for edge devices). Picture this: your team picks a 70B parameter model for a new feature. It runs great on your MacBook. You push to production. The GPU bill arrives. Your manager is not happy. Or this: your AI API costs explode halfway through the month and nobody knows why. These are not horror stories. They happen to real engineers — usually the ones who skipped learning the core units of measurement behind AI systems. As a junior engineer, you're going to face questions like: Understanding the seven core metrics below gives you the language — and the instincts — to answer confidently. Let's break them down. What it is: The learned weights inside a neural network. Think of them as the "memory" of the model — numbers that get adjusted during training to capture patterns in data. The unit: Just a raw count. We usually express it in: Why it matters to you: | Parameter Count | Approx. VRAM Needed fp16 | Typical Use Case | |---|---|---| | 1B–3B | ~4–6 GB | Mobile / edge apps | | 7B–8B | ~16 GB | Single consumer GPU | | 13B–14B | ~28 GB | Single pro GPU A100 40GB | | 70B | ~140 GB | Multi-GPU setup | | 405B+ | ~800 GB+ | Cluster of H100s | Rule of thumb:1 billion parameters ≈ 2 GB of VRAM in half-precision fp16 . Double it for full precision fp32 . More parameters = more capable model and more expensive to run. Always. What it is: The unit of text that a model reads and generates. Not words — fragments. Quick visual: Input text: "Learning AI is fun " ↓ Tokenizer Tokens: "Learn" "ing" " AI" " is" " fun" " " Token count: 6 tokens Why it matters to you: Quick check: how many tokens is your prompt? Using tiktoken OpenAI's tokenizer, also used by many OSS models import tiktoken enc = tiktoken.get encoding "cl100k base" text = "Learning AI is fun " tokens = enc.encode text print f"Token count: {len tokens }" → 6 print f"Tokens: {tokens}" → 71668, 287, 15592, 374, 2523, 0 Quick cheat sheet: - 1 token ≈ 0.75 English words - 1,000 tokens ≈ 750 words ≈ ~1.5 pages - Non-English text Hindi, Mandarin, Arabic uses 30–70% more tokens for the same content This is where a lot of junior engineers get confused. FLOPS and TOPS sound similar. They are not the same thing. What it is: A measure of raw compute power for floating point arithmetic — the kind of math needed for training and running neural networks. The scale: | Unit | Value | Context | |---|---|---| | GFLOPS | 10⁹ FLOPS | Your laptop GPU | | TFLOPS | 10¹² FLOPS | Cloud GPUs A100: ~312 TFLOPS | | PFLOPS | 10¹⁵ FLOPS | Entire GPU clusters | Used for: Server-scale training and inference. When someone says "the H100 delivers 989 TFLOPS of FP16 performance" , this is what they mean. Common GPUs you'll actually use: | GPU | FP16 TFLOPS | Best For | |---|---|---| | RTX 4090 | ~165 | Local dev / fine-tuning | | A100 40GB | ~312 | Production inference | | H100 SXM | ~989 | Large-scale training | What it is: Similar idea, but used for integer or mixed-precision operations on edge hardware and NPUs Neural Processing Units . The key difference: FLOPS → Floating point math → GPUs / server chips → Training & inference at scale TOPS → Integer / INT8 math → NPUs / edge chips → On-device inference Real-world examples: | Device | TOPS | Use Case | |---|---|---| | Apple M4 Neural Engine | ~38 TOPS | On-device ML on MacBook | | Qualcomm Snapdragon X Elite | ~45 TOPS | AI PCs / laptops | | NVIDIA Jetson Orin | ~275 TOPS | Edge AI / robotics | | Google TPU v5e | ~393 TOPS | Cloud inference at scale | When do you care about TOPS?When you're deploying a model to a phone, a laptop, or an embedded device — not a data centre. If you're picking a chip for on-device inference, TOPS is your number. Yes, confusingly, FLOPs with a capital F, no "per second" is a different metric from FLOPS. What it is: The total number of floating point operations performed during an entire training run. It's a measure of compute budget, not hardware speed. The unit: Usually expressed as: Real-world examples: | Model | Estimated Training FLOPs | |---|---| | GPT-3 175B | ~3.14 × 10²³ | | LLaMA 2 70B | ~2.9 × 10²³ | | Gemini Ultra | ~5 × 10²⁴ estimated | Why it matters to you: Directly as a junior engineer, probably not yet. But understanding it helps you reason about: Quick analogy:FLOPS the hardware rate is your car's horsepower. FLOPs training cost is the total miles driven on a road trip. One is speed, one is distance. These three are the metrics you'll track the most in production. They live in your dashboards, your SLAs, and your post-mortems. What it is: How long in milliseconds from sending your request to receiving the first token of the response. Why it matters: This is what determines if your app feels fast. Even if the full response takes 10 seconds, a 200ms TTFT makes the experience feel responsive. It's the AI equivalent of "First Contentful Paint" in web dev. User sends prompt ↓ ... processing ... ← this duration is TTFT ↓ First token arrives → streaming begins → user sees output Good TTFT benchmarks: | Scenario | Target TTFT | |---|---| | Real-time chat | < 300ms | | Interactive coding assistant | < 500ms | | Background document processing | < 2,000ms acceptable | What it is: How many tokens the model generates per second during the response. Also called generation speed or throughput . Why it matters: TPS determines whether your streaming response feels smooth or painfully slow. What affects TPS: What it is: Your rate limit from the API provider. The maximum number of tokens your account can process per minute. Why it matters: Hit your TPM limit and your requests start getting throttled or rejected with 429 Too Many Requests . This is a very common production issue for junior engineers on their first real deployment. A common mistake: not accounting for TPM in batch jobs prompts = load 10000 prompts Each ~500 tokens for prompt in prompts: response = call llm api prompt 🚨 You'll hit TPM limit fast process response Better approach: add rate limiting import time TPM LIMIT = 40000 tokens per minute check your plan tokens this minute = 0 minute start = time.time for prompt in prompts: estimated tokens = len prompt.split 1.3 rough estimate if tokens this minute + estimated tokens TPM LIMIT: sleep time = 60 - time.time - minute start if sleep time 0: time.sleep sleep time tokens this minute = 0 minute start = time.time response = call llm api prompt tokens this minute += estimated tokens process response Let me show you a real decision you'll face: "Should we use an 8B or 70B model?" Here's how the metrics interact: 8B Model 70B Model ───────────────────────────────────────────────── Parameters 8 billion 70 billion VRAM Required ~16 GB ~140 GB GPU Setup 1× A100 40GB 4× A100 40GB Est. TPS ~80–120 TPS ~15–30 TPS TTFT A100 ~150ms ~400ms API Cost est. ~$0.15/M tokens ~$0.90/M tokens Quality Good Excellent ───────────────────────────────────────────────── The real-world math: Say your app handles 1,000 users/day, each generating ~2,000 tokens per session. Daily tokens = 1,000 users × 2,000 tokens = 2,000,000 tokens 8B model cost: 2M × $0.00015 = $0.30/day → $9/month 70B model cost: 2M × $0.00090 = $1.80/day → $54/month That's a 6× cost difference. For a startup, that matters. The senior engineer's question isn't "which model is better?" It's "which model is good enough for this use case at this scale?" Start with the smaller model. Benchmark it against your quality requirements. Scale up only if you have to. | Metric | Full Name | Measures | Typical Unit | |---|---|---|---| | Parameters | — | Model size / capacity | M, B, T | | Tokens | — | Text unit for I/O and cost | count | | FLOPS | Floating Point Ops/sec | Hardware speed server | TFLOPS | | TOPS | Tera Operations/sec | Hardware speed edge/NPU | TOPS | | FLOPs | Floating Point Ops total | Training compute cost | PetaFLOPs | | TTFT | Time To First Token | Latency / responsiveness | milliseconds | | TPS | Tokens Per Second | Generation speed | tokens/sec | | TPM | Tokens Per Minute | API rate limit | tokens/min | You now have the vocabulary. Here's how to build on it: llama.cpp or Ollama locallyThe engineers who understand these numbers don't just write code. They make better architectural decisions, avoid expensive surprises, and earn trust faster. That's the real reason to care. Got questions? Drop them in the comments.