Alessio Dalla Piazza, Simone Businaro, Paolo Maccacaro, Giorgio Roffo
Table of contents #
GPU scarcity is not a temporary supply-chain hiccup. Itβs a structural problem. Hyperscalers are stockpiling capacity, and the primary drivers are simple: demand for AI compute far outpaces what fabs can produce, and the biggest players lock up supply years in advance. On top of that, energy costs keep climbing, pushed higher by geopolitical instability: tensions between the US, Israel, and Iran, port closures disrupting supply chains, and the broader energy market uncertainty that follows. It all adds up to per-hour GPU pricing that keeps going up.
In practice, H100s are already hard to get on short notice, and large B200 clusters in Europe are harder still: expensive, scarce, and often tied to capacity commitments negotiated well in advance.
The dynamics feel familiar if youβve watched hardware markets shift before. Look at DDR5 memory. In 2025, DDR5 prices rose sharply, with spot markets seeing extreme spikes. The transition from DDR4 didnβt just bring faster speeds. It brought a sustained price premium driven by retooling costs, new fabrication processes, and demand that outpaced supply. GPU pricing follows the same pattern, but amplified. When OpenAI, Google, Meta, and Microsoft are competing for the same wafers, the rest of us feel the squeeze.
Not time to short NVIDIA yet π (not financial advice).
But if youβre building AI-powered products, as we do at Equixly with our AI-driven Penetration Testing Agent, you canβt just wait for prices to normalize. You have to be smart about how you use the GPUs you can get. And that starts with understanding what makes LLM inference fundamentally different from traditional workloads. The core idea is simple: LLM inference shouldnβt be routed like stateless web traffic, because the KV cache makes each inference node stateful.
Why round robin doesnβt work for LLM inference #
If youβre not familiar with round robin, itβs the simplest load balancing strategy out there. You have multiple backend servers, and you send each new request to the next in line, cycling through them one by one. Request 1 goes to Server A, request 2 to Server B, request 3 to Server C, then back to A, and so on. It distributes work evenly and doesnβt care about what each server is doing. Simple, effective, and it has worked great for stateless web services for decades.
But LLM inference is not stateless.
When a large language model processes a prompt, it builds an internal data structure called the KV cache (key-value cache). Weβll go deeper on this in a moment, but the short version is: It stores intermediate computations so the model doesnβt have to redo them. This cache is local to the GPU that ran the computation. If the next request in the same conversation lands on a different node, that cache is gone. The new node has to recompute everything from scratch, a process called prefill, which is an expensive phase of inference.
Round robin is completely blind to this. It treats every request as independent and routes them without knowing which node holds useful cached state. The result: redundant computation, wasted GPU cycles, higher latency, and lower throughput.
What about session affinity? #
The natural follow-up is: Why not use sticky sessions?
Session affinity (also called sticky sessions) means you pin a user to a specific backend server. Once a user connects to Server B, all subsequent requests go to Server B as well. This is common in web apps that store session state server-side. It sounds like it should solve the caching problem, right?
Itβs better than round robin, but itβs still too coarse. It makes assumptions at the wrong level:
It pins users, not prefixes. Two different users might share the same system prompt or document context. Session affinity canβt recognize that and route them to the node that already has that prefix cached.It creates hot spots. Power users or long conversations pile up on a single node, while other nodes sit idle. You end up with unbalanced GPU utilization across the fleet.It doesnβt handle node failures well. If a node goes down, every session pinned to it loses its KV cache and must cold-start elsewhere.It ignores multi-turn cache dynamics. In a multi-step agentic workflow, the relevant cache might span multiple prefixes that donβt map cleanly to a single session ID.
In short, each strategy sees one level deeper than the last. Round robin sees nothing: Every request is interchangeable. Session affinity sees the user, but not whatβs actually cached for them. Cache-aware routing identifies the node that actually determines the cost: The one that already holds the relevant prefix in its KV cache. That last level is the one that fits how LLM inference behaves, and itβs what the rest of this post builds toward.
The KV cache: The hidden variable in LLM serving #
To understand why routing matters so much, we need to talk about the KV cache. And to understand the KV cache, we need a quick detour into how transformers actually work.
What is the attention mechanism?
The attention mechanism is the core idea behind modern LLMs. In simple terms, itβs how the model figures out which words (tokens) in a sentence are relevant to each other.
Imagine youβre reading the sentence βThe cat sat on the mat because it was tired.β When you read βitβ, your brain instantly links it back to βcatβ. Thatβs similar to what attention does for the model. In the decoder-only LLMs we use for inference, each token computes a score over all previous tokens to determine whatβs important in context. This is called causal attention: the model can only look backward, never forward, because it generates one token at a time, left to right.
This computation uses what we call tensors. If you havenβt worked with them before, think of a tensor as a multi-dimensional array of numbers. A simple list of numbers is a 1D tensor. A table of numbers is a 2D tensor (a matrix). The attention mechanism projects each token into query, key, and value tensors, compact numerical representations of that token. A tokenβs query is scored against the keys of all the previous tokens, and those scores decide how much of each tokenβs value gets mixed in. The key and value tensors are the ones worth caching, since every later token reuses them.
How the KV cache works
In a transformer-based LLM, every layer of the model computes these key (K) and value (V) tensors for each token in the input sequence. Once computed, they can be reused for subsequent tokens without recomputing them. That reuse is the KV cache.
Important note on architecture: what weβre describing here is the classic quadratic attention mechanism, where the cost grows as O(nΒ²) with sequence length. In hybrid architectures, where some layers use linear attention instead, this picture changes somewhat. But for now, the vast majority of production LLMs still use standard quadratic attention, with the matrix multiplication that makes the KV cache so important.
Hereβs a simplified view of what happens during inference:
Prompt: "You are a security assistant. Analyze this API endpoint..."
βββββββββββββββββββββββ PREFILL PHASE ββββββββββββββββββββββββ
β β
β Token 1 ββ> Attention ββ> K1, V1 ββ β
β Token 2 ββ> Attention ββ> K2, V2 ββ€ β
β Token 3 ββ> Attention ββ> K3, V3 ββΌββ> KV Cache (GPU HBM)β
β ... β β
β Token n ββ> Attention ββ> Kn, Vn ββ β
β β
β Cost: O(n^2) attention across all tokens β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββ DECODE PHASE βββββββββββββββ
β β
β New token ββ> Attend to [K1..Kn, V1..Vn]β
β (read from KV Cache) β
β β
β Cost: O(n), only the new token vs cache β
β β
ββββββββββββββββββββββββββββββββββββββββββββ
Cache HIT: Skip prefill entirely, decode from cached state
Cache MISS: Redo full prefill, O(n^2) recomputation
Four consequences follow from this:
Prefill is expensive. Processing the initial prompt (the prefill phase) means computing attention across all input tokens. For a 4,096-token prompt with standard quadratic attention, thatβs O(nΒ²) work across all layers. The KV cache stores the result, so generating the next token requires attention only between the new token and the cached keys/values: O(n) instead of O(nΒ²).Cache hits eliminate the dominant cost. If a node already holds the KV cache for a given prefix, generating the next token needs only a single forward pass for the new token, no prefill recomputation. Decode is not free (each token still requires a forward pass and is memory-bandwidth-bound), but skipping prefill removes the most expensive phase, especially for long prompts. This is often the difference between milliseconds and seconds of added latency.Cache misses can become extremely expensive at scale. Every cache miss triggers a full prefill. Depending on the architecture, precision, and batching configuration, prefilling a 4K-token prompt on a 70B model (which spans at least two H100s) can add a second or more of latency. Multiply that across thousands of requests per minute, and youβre burning GPU cycles on recomputation that could have been avoided with smarter routing.KV cache memory is finite. Each cached conversation takes up GPU HBM (high-bandwidth memory). A single 70B model serving a 4K-context conversation can use on the order of 1-4 GB of KV cache, depending on precision, GQA configuration, and the number of layers. Nodes have a limited budget, and eviction policies determine what stays and what gets thrown away.**This is why the routing layer also needs to consider metrics such as cache utilization, hit rates, and eviction frequency.**The system has to track which nodes are near capacity and route around them before they start evicting useful state.
The multi-cloud KV cache problem
Multi-cloud makes this sharply worse. When your GPU fleet spans multiple cloud providers (and in the current scarcity landscape, it has to), the KV cache becomes an even more painful constraint.
βββββββββββββββββββββββ THE PROBLEM ββββββββββββββββββββββββ
β β
β Cloud Provider A Cloud Provider B β
β ββββββββββββββββ ββββββββββββββββ β
β β Node A1 β β Node B1 β β
β β KV Cache: β X β KV Cache: β β
β β [######..] β<ββββββββββββββ [........] β β
β β β Too slow β β β
β β Node A2 β to share β Node B2 β β
β β KV Cache: β β KV Cache: β β
β β [####....] β β [##......] β β
β ββββββββββββββββ ββββββββββββββββ β
β β
β * KV cache lives in GPU HBM, expensive to transfer β
β * Cross-provider WAN latency makes migration impracticalβ
β * Each cloud is effectively its own cache island β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
KV cache tensors live in GPU HBM. Theyβre large and tightly coupled to the specific model weights and runtime state. Can you technically externalize or transfer them? Yes. Projects like NIXL and llm-dβs tiered caching prove that KV transfer is possible, especially over RDMA or specialized high-speed interconnects within a datacenter. But across cloud providers, over general-purpose WAN with different interconnects, regions, and latency profiles? Itβs usually not economical for latency-sensitive inference. The transfer time often exceeds the cost of just recomputing the prefill from scratch.
This means that in a multi-cloud setup, each cloud provider is effectively its own cache island. A request that was served on a node in Provider A built up a valuable KV cache there, but if the next request lands on Provider B, that cache might as well not exist. The routing layer has to be aware of these boundaries. It canβt treat the fleet as a flat pool. It has to understand that routing a request to a different cloud provider is almost always a cache miss, and factor that into its decisions.
This constraint shaped a fundamental design choice for us: Route within a cache island first, and cross cloud boundaries only as a last resort or for fresh conversations that have no cache to preserve.
Within a single node, you can also mitigate HBM pressure by off the KV cache to CPU RAM instead of evicting it β vLLMβs KV off connector asynchronously spills cache to DRAM and reloads it on a hit, which helps with preemption recovery and shared-prefix reuse. We chose not to use it: it adds transfer latency and operational complexity, and our bet is on routing each request back to the node that still holds its cache in HBM rather than paging cache around.
Key takeaway:Β LLM inference is stateful. If routing ignores KV-cache locality, it wastes GPU cycles by recomputing prefixes that may already exist elsewhere in the fleet.
Inspiration: vLLM Semantic Router and llm-d #
We didnβt start from scratch. Two open-source projects shaped our thinking.
vLLM Semantic Router
The vLLM Semantic Router is a signal-driven decision routing framework developed by engineers at Red Hat, IBM, and the broader vLLM community. It introduces the concept of routing as an intelligence layer, using encoder-based signals (domain classification, semantic similarity, complexity estimation) to decide which model, path, or policy should handle a given request.
Key ideas we drew from:
Routing as an intelligence layer. The Semantic Routerβs core thesis is that routing decisions should be informed by content-level signals, not just network-level metrics. This shaped how we think about the proxy. Itβs less about prefix-cache-aware scheduling (thatβs closer to llm-dβs domain) and more about the principle that the router shouldunderstandthe request before dispatching it.Semantic caching with category-aware thresholds. Not all queries are equal. The router adjusts similarity thresholds, TTLs, and cache quotas per query category, which avoids the one-size-fits-all problem of naive caching.The 1/W law. Their research shows that tokens per watt roughly halve when the serving context window doubles, thereby making context-length-aware routing a real energy-efficiency lever on top of its latency benefits.
llm-d
llm-d takes a more infrastructure-focused approach. Itβs a distributed inference serving stack built on Kubernetes that provides:
Prefix-cache-aware routing. The inference scheduler knows which nodes hold which prefix caches and routes accordingly, maximizing cache hits and minimizing redundant prefill.Disaggregated prefill/decode. Splitting the prefill phase (compute-heavy) from the decode phase (memory-bandwidth-heavy) onto different node types, optimizing hardware utilization for each.Tiered KV caching. Off KV cache entries from GPU HBM to CPU memory, local SSD, or remote storage, extending the effective cache capacity beyond what a single GPU can hold.Workload autoscaling. SLO-aware scaling that adjusts fleet size based on actual inference metrics (queue depth, time-to-first-token, throughput) rather than generic CPU/memory metrics.
Both projects confirmed our intuition: The load balancer for LLM inference must understand the KV cache topology. Treat the fleet as interchangeable boxes, and you pay for it in recomputed prefills.
Our solution: a cache-aware routing proxy with auto-scaling #
At Equixly, we built a routing proxy that sits between our API gateway and the GPU inference nodes. It combines cache-aware routing with automatic fleet scaling.
Architecture overview
The system has three core components:
Prefix index. Each inference node periodically reports its KV cache state: which tokenized prefix blocks it currently holds, organized as a set of block-level hashes. The proxy maintains an in-memory index mapping these block hashes to nodes and their cache occupancy. This is critical: Matching must happen at thetokenized level, not at the raw text level, because the KV cache blocks are keyed on token IDs, not characters. Two prompts that look almost identical as text can diverge once tokenized (a leading space or a different whitespace run shifts the token boundaries), so only a token-level comparison lines up with the cache blocks a node actually holds.Cache-aware router. For each incoming request, the router tokenizes the prompt prefix (typically the system prompt plus shared context), splits it into fixed-size token blocks, and hashes each block. It then queries the index to compute aprefix overlap score for each candidate node, basically how many contiguous prefix blocks the node already holds. The node with the highest overlap wins. If multiple nodes tie, the one with the lowest current load is selected. If no node has meaningful overlap, the router falls back to least-loaded routing. The router also factors innode-level metrics like cache utilization and eviction rate. A node thatβs 95% full on KV cache memory is a bad target even if it has a high prefix overlap, because your cache entry is likely to get evicted soon anyway.Auto-scaler. The proxy monitors aggregate fleet metrics: average queue depth, cache hit ratio, p99 latency, and per-node cache utilization. In practice, this is one background control loop: Every node exposes a lightweight metrics endpoint, and the loop polls the fleet at fixed intervals, aggregates the samples, and compares them against configured thresholds. When thresholds are breached, it fires asynchronous API calls to cloud providers to provision new GPU nodes. When load subsides, it drains and terminates excess nodes.
Routing decision flow
The routing algorithm follows this logic:
FUNCTION route_request(request):
tokens β tokenize(request.prefix)
blocks β split_into_blocks(tokens, block_size)
hashes β [compute_hash(block) FOR block IN blocks]
// score each node by contiguous prefix block overlap
FOR each node IN fleet:
node.overlap β count_matching_prefix_blocks(node, hashes)
candidates β nodes WHERE overlap > 0
ββ Any candidate with overlap > 0?
β
βββ YES β Sort by overlap (descending), then load (ascending)
β Route to first candidate
β
βββ NO β Route to least-loaded node (cold start)
The tokenize and split_into_blocks
steps are important. Matching happens on token blocks, not raw text. Two slightly different prompts might share the same token-level prefix up to a divergence point. By scoring overlap at the block level, the router can partially reuse a cache even when the full prefix is not an exact match. In many LLM workloads, the system prompt and tool definitions are shared across requests, so even requests from different users can benefit from KV cache reuse. Session affinity canβt achieve that.
Auto-scaling trigger logic
The auto-scaler runs as a background loop:
EVERY scaling_interval:
metrics β collect_fleet_metrics()
// scale up: demand exceeds capacity
IF metrics.avg_queue_depth > queue_threshold
OR metrics.cache_hit_ratio < hit_ratio_floor
OR metrics.p99_latency > latency_ceiling:
desired_nodes β compute_desired_capacity(metrics)
current_nodes β count_active_nodes()
deficit β desired_nodes - current_nodes
IF deficit > 0:
FOR i IN 1..deficit:
// fire async API call to cloud provider
cloud_api.provision_node_async(
gpu_type = preferred_gpu,
region = select_cheapest_region(),
callback = on_node_ready
)
// scale down: excess capacity
ELSE IF metrics.avg_queue_depth < drain_threshold
AND metrics.utilization < utilization_floor:
excess_nodes β select_drainable_nodes(count = scale_down_step)
FOR node IN excess_nodes:
node.drain() // stop accepting new requests
AWAIT node.idle() // wait for in-flight requests to complete
cloud_api.terminate_node(node)
A few design decisions worth calling out:
Region selection at scale-up time. When spinning up new nodes, the proxy queries available regions and picks the best option that meets latency requirements. Multi-cloud buys us redundancy, but it also opens up pricing arbitrage across providers.Async provisioning. Node provisioning is fire-and-forget with a callback. The proxy doesnβt block on cloud API responses. When a new node comes online, it registers itself in the prefix index and starts accepting traffic.Graceful drain on scale-down. Nodes are never killed mid-request. The drain process stops routing new requests, waits for in-flight completions, and only then terminates the instance. This avoids wasted computation and client-visible errors.
Putting it together
The full life cycle looks like this:
βββββββββββββββ
β Request β
ββββββββ¬βββββββ
β
v
βββββββββββββββββββ
β Compute prefix β
β hash β
ββββββββββ¬βββββββββ
β
ββββββββββββββΌβββββββββββββ
β β β
v v v
ββββββββββββ ββββββββββββ ββββββββββββ
β Node A β β Node B β β Node C β
β cache:## β β cache:# β β cache:_ β
ββββββββββββ ββββββββββββ ββββββββββββ
β
v (best cache overlap)
ββββββββββββ
β Route to β
β Node A β
ββββββββββββ
- - - - - - - - - - - - - - - - - - - -
Background:
βββββββββββββββββββββββββ
β Auto-scaler loop β
β β
β queue_depth > T ? βββ> Provision Node D
β cache_hits < F ? βββ> (async cloud API)
β p99 > ceiling ? β
βββββββββββββββββββββββββ
From theory to practice
The pseudocode above describes the high-level design. Now letβs get concrete about how the routing key actually works, since thatβs where cache affinity is won or lost.
Important caveat first: What follows is a simplified slice of a much larger system. The real routing layer also considers node health, current concurrency per host, queue depth, GPU memory pressure, rate limits, request priority tiers, whether a node is mid-drain for scale-down, geographic proximity to the client, and more. Weβre zooming in on the cache-affinity heuristic specifically because itβs the most interesting piece and the hardest to get right. But donβt mistake this for the whole picture.
The idea: For every API request, we extract a stable routing key that acts as a cache-affinity heuristic. It doesnβt precisely identify which KV cache blocks live on which node (thatβs the prefix indexβs job). Instead, it approximates the following: βRequests with similar prefixes should land on the same node.β Itβs probabilistic, not exact, and itβs one signal among many.
FUNCTION extract_routing_key(request):
model β request.model // e.g. "equixly-70b"
budget β 512 characters, split evenly
// grab the system prompt (first half of budget)
sys_content β first_message_where(role = "system")
.content[0 .. budget/2]
// grab the FIRST user message only (second half)
usr_content β first_message_where(role = "user")
.content[0 .. budget/2]
// system + first user because they NEVER CHANGE across turns.
// Turn 1: [system, user]
// Turn 2: [system, user, assistant, user]
// Turn 3: [system, user, assistant, user, assistant, user]
// ^^^^^^^^^^^^^^^^^^^^
// always the same prefix
//
// This gives us a stable heuristic: same conversation
// tends to land on the same node, keeping the KV cache warm.
// It's not perfect (two chats with identical system+first-user
// will collide), but it's a good approximation for routing.
RETURN model + ":" + sys_content + usr_content
Once we have the key, we hash it to pick a preferred node. The simplified version uses hash MOD hosts.length
, but in production with autoscaling (where hosts are added and removed), a plain modulo would remap most keys every time the fleet size changes. The real system uses consistent hashing or rendezvous hashing to minimize disruption when nodes join or leave. Hereβs the simplified view:
FUNCTION route_completion(request, hosts):
key β extract_routing_key(request)
primary_idx β fast_hash(key) MOD hosts.length
// try the cache-affine host first
// if it fails (5xx, timeout, no replica), try next in order
FOR attempt IN 0 .. hosts.length:
idx β (primary_idx + attempt) MOD hosts.length
resp β forward_request(hosts[idx], request)
IF resp is OK or client error (4xx):
RETURN resp // done, cache-affine hit (or miss)
IF resp is server error (5xx):
CONTINUE // try next host
RETURN 502 "All upstream instances failed"
Again, this is the simplified version. The production system weighs multiple signals before picking a host:
Cache affinity(the hash above) is the starting point** Concurrent request countper node, so we donβt pile onto a busy host Queue depth**, because a node with a deep queue will be slow even with a warm cache** GPU memory pressure**, since a node near its KV cache capacity limit might evict your prefix before you benefit from it** Health status**, including recent error rates and latency percentiles** Geographic proximity**, when routing across regions within a cloud provider** Priority tiers**, so high-priority requests get preference on less loaded nodes** Drain state**, so we donβt send traffic to a node thatβs about to be terminated
The cache-affinity hash gives us a good default. The other signals override it when they need to. The result is a system thatβs cache-aware by default but load-aware, health-aware, and cost-aware when it matters.
Why does this approach work in practice? Two reasons:
The routing key is stable across turns. If your key changes every time the user sends a new message, you lose cache affinity. By using only the system + first user messages, the key stays the same throughout the conversation. Itβs a heuristic, not a guarantee, but it works well for the common case.The fallback is ordered, not random. If the preferred host is down, we donβt pick a random fallback. We walk the list in a deterministic order. This means that when a host recovers, requests naturally flow back to it without any coordination.
The production implementation does this without ever calling JSON.parse
on the request body, just fast string scanning. When youβre processing thousands of requests per second at the edge, avoiding a full JSON parse on every request adds up.
Why the performance of the routing layer itself matters
Thereβs a subtle trap here. Youβre building a routing layer to save GPU time, but the router itself runs on every single request. If the routing logic is slow, youβre adding latency to every request before it even reaches the GPU. That defeats the purpose.
This is why we care a lot about the performance of the routing algorithm itself. A few principles we follow:
Never unmarshal the full request body. A chat completion request can be large, especially with long conversation histories. Parsing the entire JSON into an object means allocating memory for every field, every message, every token. We donβt need any of that for routing. We only need the model name, the system prompt prefix, and the first user message prefix. So instead of parsing, we scan the raw bytes looking for the specific keys we need. ThinkindexOf / strstr
style scanning. Itβs O(n) in the worst case, but in practice it finds what it needs within the first few hundred bytes and stops.
// slow:
FUNCTION route_naive(raw_body):
parsed β JSON.parse(raw_body) // allocates entire object tree
model β parsed.model // walks nested structure
system β parsed.messages[0].content
...
// equixly - fast:
FUNCTION route_fast(raw_body):
// scan for "model" key, extract value directly from bytes
model β scan_for_key(raw_body, "model", max_len = 128)
// scan for first "system" role, then its "content"
system β scan_for_key_after(raw_body, "system", "content", max_len = 256)
// scan for first "user" role, then its "content"
user β scan_for_key_after(raw_body, "user", "content", max_len = 256)
...
// No allocation, no object tree, no GC pressure
A note on correctness: Raw byte scanning makes assumptions about the shape of incoming requests. It works because we control the API contract and know what our clients send. If youβre dealing with arbitrary inputs, deeply nested objects, escaped content inside strings, or multimodal payloads with base64 blobs, a streaming parser or validated schema is the safer choice. The tradeoff is deliberate: we accept a constrained input shape in exchange for routing latency under 1ms.
Use a fast hash with good distribution. The hash function runs on every request, so it has to be cheap. We use FNV-1a (a few XORs and multiplies per byte) with a Murmur3 finalizer. The finalizer is important: it avalanches high-bit differences into low bits, which matters a lot when your modulus is small (like % 2 or % 3 for a small fleet). Without the finalizer, youβd get poor distribution, and some nodes would get much more traffic than others.Cap the routing key budget. We donβt hash the entire system prompt or the entire first user message. We cap at 512 characters total (256 per field). This bounds the hash computation time regardless of how long the prompt is. In practice, 256 characters of the system prompt is more than enough to differentiate between workloads, and 256 characters of the first user message is enough to differentiate between conversations.Zero-copy where possible. The request body is read once into a buffer. That same buffer is forwarded to the upstream host. We never copy it, never transform it, never re-encode it. The routing key extraction works on a read-only view of the same byte.
Lessons learned #
A few things weβd hand to anyone building inference infrastructure under these constraints:
LLM inference is stateful. The node that served the last turn holds a warm KV cache; send the follow-up anywhere else, and you eat a full prefill you didnβt need.Cache locality can matter as much as raw capacity. A cache hit skips the prefill phase entirely, often the difference between milliseconds and seconds of latency. More GPUs donβt help if you keep landing on cold ones.Routing has to be token- and model-aware. The cache is keyed on token IDs from a specific tokenizer, so routing inference like generic HTTP traffic throws away cache hits youβve already paid to compute.Small routing choices have outsized infrastructure impact. A sub-millisecond decision made for every request changes how many GPUs you have to rent, and at current prices, that compounds quickly.
Being smart is the only option #
GPU scarcity is not going away. The demand curve for inference compute is steep, and the supply side, constrained by fabrication capacity, energy infrastructure, and geopolitics, canβt keep up. Throwing money at the problem (renting more GPUs, reserving larger clusters) is a losing strategy if youβre not using those GPUs efficiently.
At Equixly, the cache-aware routing proxy has become core infrastructure rather than a nice-to-have. By understanding the KV cache topology of our inference fleet and routing requests accordingly, we extract more useful work from every GPU-second we pay for. By auto-scaling across multiple cloud providers based on real inference metrics, we avoid both over-provisioning (wasted spend) and under-provisioning (degraded service).
Renting more GPUs stopped being a strategy once supply tightened. Whatβs left is using the ones you can actually get, and using them well.
[ ]
Alessio Dalla Piazza
CTO & FOUNDER
Former Founder & CTO of CYS4, he embarked on active digital surveillance work in 2014, collaborating with global and local law enforcement to combat terrorism and organized crime. He designed and utilized advanced eavesdropping technologies, identifying Zero-days in products like Skype, VMware, Safari, Docker, and IBM WebSphere. In June 2016, he transitioned to a research role at an international firm, where he crafted tools for automated offensive security and vulnerability detection. He discovered multiple vulnerabilities that, if exploited, would grant complete control. His expertise served the banking, insurance, and industrial sectors through Red Team operations, Incident Management, and Advanced Training, enhancing client security.
[ ]
Simone Businaro
Head of Solution Architecture
Simone is a seasoned Cloud and Solution Architect with extensive experience in leading transformation projects for major enterprise customers such as UniCredit and Generali. He specializes in Private and Public Cloud adoption, as well as designing Cloud Native platforms including Kubernetes and OpenShift. He is adept at crafting comprehensive automation solutions that seamlessly integrate service lifecycle with customers' internal ITIL and financial processes, ensuring streamlined, efficient, and scalable solutions.
[ ]
Paolo Maccacaro
Staff Cloud DevOps Engineer
Paolo is an experienced Cloud Native DevOps Engineer with a specialized focus on infrastructure design and automation. He possesses a robust background in operations, with extensive experience in both on-premise environments and major cloud hyperscalers. Having worked in both Italy and the UK, Paolo has collaborated with various startups and large international corporations. Throughout his career, he has held diverse roles, contributing effectively as an Individual Contributor, Team Lead, and Architect.
[ ]
Giorgio Roffo
Head of AI
Giorgio is an AI leader with a Ph.D. in Computer Science, focused on machine learning and pattern recognition. His expertise spans computer vision, scalable AI systems, and applied machine learning. He has worked across industry and academia, translating advanced research into reliable production technology. His work includes medical AI, computer vision, and security-focused intelligent systems, with publications in top-tier international venues and multiple research awards.