What Actually Happens When You Call an LLM API

A developer explains the real-world latency behind LLM API calls, tracing the journey from user prompt through submarine cables to data centers and GPUs. The post highlights how geographic distance, especially for users in Africa, adds 100-200ms of unavoidable latency due to the speed of light, before any inference begins. It also notes the vast disparity in data center infrastructure, with Nigeria having 17 facilities compared to over 5,500 in the United States.

you've felt it. you type a prompt, hit send, and the response starts streaming in under a second. smooth. instant. you feel like you're thinking out loud with a machine. then the next day — same model, same prompt — you wait. three seconds. five. the cursor blinks. nothing. then it all comes at once. you probably blamed your wifi. it wasn't your wifi. what actually happened in those extra seconds is a story that starts in a building you'll never visit, runs through a cable at the bottom of an ocean, and ends on a gpu that was busy doing someone else's thinking before it got to yours. and if you're building in africa or anywhere that isn't virginia, ireland, or frankfurt — that story has a chapter in it specifically about you. let's follow a single request from the moment you hit send. your prompt leaves your device and travels as packets of data through your ISP, hits a submarine fibre cable, crosses an ocean, arrives at a data centre, gets routed to the right server, waits for a gpu to become available, gets processed, and the response travels back the same way. that whole round trip happens in what feels like nothing. except it isn't nothing. every step costs time. and some of those steps cost more depending on where you're sitting on the planet. before we get to the interesting parts, ground this. a data centre is a building — sometimes the size of several football pitches — filled with servers. those servers are computers without screens. stacked in metal racks. thousands of them. running twenty-four hours a day, seven days a week, never switching off. every api call you make, every message you send on whatsapp, every google search, every youtube video — all of it is touching a server in a building like this somewhere. the building needs three things to function: power, cooling, and connectivity. the power runs the servers. the cooling stops them melting — servers generate enormous heat at this density. the connectivity is the fibre cable that connects the building to the rest of the internet. nigeria has 17 of these buildings. the united states has over 5,500. that gap matters. we'll come back to it. latency is the time it takes for data to travel from point A to point B and back. it is bounded by physics. data moves through fibre optic cable at roughly two-thirds the speed of light. you cannot make it faster. you can only make the distance shorter. lagos to london is approximately 5,000 kilometres. at two-thirds the speed of light, the minimum possible round-trip time is around 50 milliseconds just from the distance alone. add routing, congestion, processing and you're looking at 100 to 150ms before your request has even reached the server. then the model has to think. then the response travels back. most developers building in nigeria are hitting llm servers in us-east-1 virginia or eu-west ireland or frankfurt . that's not a complaint — those are where the servers are. but it means every api call carries 100 to 200ms of latency just from geography, before inference even begins. for a streaming chatbot, you feel this. that pause before the first token appears isn't the model being slow. it's the speed of light, applied to distance. when your prompt arrives at the server, it doesn't get processed the way you might imagine — like a search engine matching keywords. the model runs your prompt through billions of mathematical operations, layer by layer, to predict what the most likely next token should be. then the next. then the next. each token generated one at a time, sequentially, until the response is complete. this is inference. a token is roughly three-quarters of a word. "hello" is one token. "infrastructure" is two. the response you're reading right now would be several hundred tokens. why does this matter? because every token costs compute. a longer prompt costs more compute on the input side. a longer response costs more on the output side. and all of that compute is happening on a gpu inside a data centre consuming real electricity. your laptop has a cpu — central processing unit. it's designed for general tasks: running your browser, compiling your code, handling your operating system. very fast at one thing at a time. a gpu — graphics processing unit — was originally designed to render video games. thousands of smaller cores that can do many calculations simultaneously. it turns out this parallel architecture is exactly what llm inference needs: running the same mathematical operations across billions of parameters at once. a single high-end gpu used for llm inference — an nvidia h100 — costs around $30,000. a data centre running a frontier model has thousands of them. when you call an llm api, your request is routed to one of these gpus. if that gpu is busy processing another user's request, yours waits. that wait is real. it shows up as latency on your end. this is what rate limits are actually enforcing: the physical capacity of the hardware. you've noticed that sometimes the very first call in a while takes noticeably longer. this isn't imaginary. it's a cold start. models are large. a frontier model can be hundreds of gigabytes of weights — the numbers that encode what the model knows. those weights need to be loaded into gpu memory before inference can happen. if no request has come in for a while, the system may have partially unloaded the model to free up memory for other things. the first request has to wait for the model to load back in. subsequent requests hit the already-warm model and feel faster. serverless llm deployments are especially prone to this. you pay less when traffic is low. but your users feel the first request after a quiet period. nigeria's 17 data centres — 14 of them in lagos — run almost entirely on diesel generators. the national grid provides on average four hours of power per day. every data centre makes up the difference with generators burning diesel around the clock. this is expensive. it's also why local cloud infrastructure hasn't scaled the way it has in markets with stable power. the consequence for you as a developer: every llm api call you make routes to a server that is not in nigeria. not in west africa. often not even on the continent. you are paying the latency cost of that distance on every single request, for every single user you have. this isn't a software problem. it's a geography and infrastructure problem. and it has a direct effect on how your ai-powered products feel to the people using them. three practical things: stream the response. don't wait for the full response before showing anything. streaming tokens as they arrive makes the experience feel faster even when it isn't. the perceived latency drops dramatically because the user sees something happening. cache aggressively. if you're calling the same prompt or near-identical prompts repeatedly, cache the response. inference is expensive. latency is expensive. caching eliminates both for repeated queries. pick the right model for the job. a 70 billion parameter model is slower and more expensive than a 7 billion parameter model. for many tasks — classification, extraction, short-form generation — the smaller model is sufficient and returns results significantly faster. frontier models are not always the right tool. data centres exist because computation has to live somewhere physical. it takes power, water, land, and connectivity to run the infrastructure that makes ai feel effortless. africa accounts for less than 1% of global data centre capacity while housing 18% of the world's population. the gap between what the continent generates as digital demand and what it owns as infrastructure is where the latency comes from, where the dependency comes from, where the value extraction happens. knowing it's a physics problem, not a code problem, changes where you look. knowing that equinix, aws, and microsoft own most of the continent's usable capacity changes what you think about it. it's probably not your code. it's a building somewhere running on diesel. AI helped me research, structure, and edit this piece. The arguments, the examples, and the opinions are mine. So is whatever's wrong with them.