cd /news/artificial-intelligence/when-8-gpus-is-all-you-need · home topics artificial-intelligence article
[ARTICLE · art-21329] src=dev.to pub= topic=artificial-intelligence verified=true sentiment=· neutral

When 8 GPUs Is All You Need

A developer found that 4 to 8 dedicated GPUs, such as the H200 NVLink, are sufficient for most production inference workloads on 70B to 200B parameter models, debunking the need for multi-node clusters. The analysis shows that a 4x GPU server covers most needs, with 8 GPUs handling larger models and redundancy, while multi-node clusters are only necessary for pre-training from scratch or hyperscale serving.

read3 min publishedJun 4, 2026

TL;DR: 4 GPUs covers most 70B-200B production inference needs. 8 GPUs handles larger models and redundancy. You only need a multi-node cluster if you're pre-training from scratch or serving at hyperscale.

Most AI teams I talk to start the same way: they see what hyperscalers are selling, assume they need a cluster, and either overspend on compute they don't fully use, or underspec their first server and hit a wall three months in.

The wall is always the same. The model grows. Latency climbs. The team realizes the single GPU they started on was a proof of concept, not a production spec. Mid-project, mid-budget, rethinking everything.

For most inference workloads, 4 to 8 dedicated GPUs is where the math works. AI-based search platforms are the clearest case. If you're embedding an LLM into a search product, you're serving queries continuously, at low latency, with a model in the 70B to 200B parameter range. That workload needs memory bandwidth and consistency. A 4x or 8x H200 NVLink server holds the model in full VRAM, keeps GPU-to-GPU communication off the PCIe bus, and gives you predictable latency regardless of what else runs nearby.

AI media analytics has the same profile: processing video metadata, running multimodal inference pipelines, classifying content at scale. Continuous throughput workloads that run around the clock. Dedicated hardware economics beat cloud once these pipelines stop being intermittent.

Redundant dual DC setups belong in the conversation earlier than most teams think. Two 4x GPU servers across two EU datacenters gives you active-active inference with geographic redundancy. For teams with uptime requirements or data residency obligations, this architecture is simpler to operate than a single large cluster, with data staying in the EU locations you specify.

On shared cloud infrastructure, GPU memory bandwidth degrades under load. Your workload competes with whatever else runs on that physical node. For inference, where time-to-first-token and tokens-per-second determine whether your product feels fast or broken, that unpredictability compounds.

On dedicated bare metal:

Spec Detail
Memory bandwidth
H200 provides 4.8 TB/s of HBM3e memory bandwidth
GPU interconnect
NVLink keeps GPU-to-GPU traffic off the PCIe bus
Hardware sizing
CPU, RAM, and NVMe matched to the GPU config from day one

For teams with EU data residency requirements, dedicated infrastructure in EU datacenters means your training data and inference logs stay where your compliance team needs them. You don't have to start at 8. For 70B to 200B models, a 4x H200 NVLink server covers most production inference needs. With FP8 quantization and careful sharding, the same configuration can handle 405B-class workloads at moderate concurrency. That gives you room to validate your serving stack before expanding.

The DL385 Gen11 supports configurations with up to 8 GPUs, so teams that plan slot and power headroom from day one can grow from 4 to 8 on the same server without a chassis change.

GPU Right for
H200 NVLink
70B to 405B models, production inference, memory-heavy workloads
H100
Teams where ecosystem stability matters: vLLM and TensorRT-LLM have years of H100 optimization
RTX Pro 6000
Parallel inference on smaller models, visual computing, VDI, rendering alongside AI workloads

Pre-training a frontier model from scratch requires more than 8 GPUs. The multi-node cluster conversation is real and the interconnect requirements are different.

True hyperscale inference, serving hundreds of millions of daily requests across many model variants, outgrows a single server.

Most teams building new AI products are in a different phase: proving latency targets, validating the model in production, getting the inference stack right. That work fits on 4 to 8 dedicated GPUs.

The right configuration depends on your model, your precision target, and your concurrency requirements. If you're speccing out an EU-based deployment, start here: Leaseweb GPU Servers

Disclaimer: I'm on the infrastructure team at Leaseweb. EU-native, Netherlands-owned.

── more in #artificial-intelligence 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/when-8-gpus-is-all-y…] indexed:0 read:3min 2026-06-04 ·