We fight GPU scarcity without compromise

wpnews.pro

Alessio Dalla Piazza, Simone Businaro, Paolo Maccacaro, Giorgio Roffo

Table of contents #

GPU scarcity is not a temporary supply-chain hiccup. It’s a structural problem. Hyperscalers are stockpiling capacity, and the primary drivers are simple: demand for AI compute far outpaces what fabs can produce, and the biggest players lock up supply years in advance. On top of that, energy costs keep climbing, pushed higher by geopolitical instability: tensions between the US, Israel, and Iran, port closures disrupting supply chains, and the broader energy market uncertainty that follows. It all adds up to per-hour GPU pricing that keeps going up.

In practice, H100s are already hard to get on short notice, and large B200 clusters in Europe are harder still: expensive, scarce, and often tied to capacity commitments negotiated well in advance.

The dynamics feel familiar if you’ve watched hardware markets shift before. Look at DDR5 memory. In 2025, DDR5 prices rose sharply, with spot markets seeing extreme spikes. The transition from DDR4 didn’t just bring faster speeds. It brought a sustained price premium driven by retooling costs, new fabrication processes, and demand that outpaced supply. GPU pricing follows the same pattern, but amplified. When OpenAI, Google, Meta, and Microsoft are competing for the same wafers, the rest of us feel the squeeze.

Not time to short NVIDIA yet 😉 (not financial advice).

But if you’re building AI-powered products, as we do at Equixly with our AI-driven Penetration Testing Agent, you can’t just wait for prices to normalize. You have to be smart about how you use the GPUs you can get. And that starts with understanding what makes LLM inference fundamentally different from traditional workloads. The core idea is simple: LLM inference shouldn’t be routed like stateless web traffic, because the KV cache makes each inference node stateful.

Why round robin doesn’t work for LLM inference #

If you’re not familiar with round robin, it’s the simplest load balancing strategy out there. You have multiple backend servers, and you send each new request to the next in line, cycling through them one by one. Request 1 goes to Server A, request 2 to Server B, request 3 to Server C, then back to A, and so on. It distributes work evenly and doesn’t care about what each server is doing. Simple, effective, and it has worked great for stateless web services for decades.

But LLM inference is not stateless.

When a large language model processes a prompt, it builds an internal data structure called the KV cache (key-value cache). We’ll go deeper on this in a moment, but the short version is: It stores intermediate computations so the model doesn’t have to redo them. This cache is local to the GPU that ran the computation. If the next request in the same conversation lands on a different node, that cache is gone. The new node has to recompute everything from scratch, a process called prefill, which is an expensive phase of inference.

Round robin is completely blind to this. It treats every request as independent and routes them without knowing which node holds useful cached state. The result: redundant computation, wasted GPU cycles, higher latency, and lower throughput.

What about session affinity? #

The natural follow-up is: Why not use sticky sessions?

Session affinity (also called sticky sessions) means you pin a user to a specific backend server. Once a user connects to Server B, all subsequent requests go to Server B as well. This is common in web apps that store session state server-side. It sounds like it should solve the caching problem, right?

It’s better than round robin, but it’s still too coarse. It makes assumptions at the wrong level:

It pins users, not prefixes. Two different users might share the same system prompt or document context. Session affinity can’t recognize that and route them to the node that already has that prefix cached.It creates hot spots. Power users or long conversations pile up on a single node, while other nodes sit idle. You end up with unbalanced GPU utilization across the fleet.It doesn’t handle node failures well. If a node goes down, every session pinned to it loses its KV cache and must cold-start elsewhere.It ignores multi-turn cache dynamics. In a multi-step agentic workflow, the relevant cache might span multiple prefixes that don’t map cleanly to a single session ID.

In short, each strategy sees one level deeper than the last. Round robin sees nothing: Every request is interchangeable. Session affinity sees the user, but not what’s actually cached for them. Cache-aware routing identifies the node that actually determines the cost: The one that already holds the relevant prefix in its KV cache. That last level is the one that fits how LLM inference behaves, and it’s what the rest of this post builds toward.

The KV cache: The hidden variable in LLM serving #

To understand why routing matters so much, we need to talk about the KV cache. And to understand the KV cache, we need a quick detour into how transformers actually work.

What is the attention mechanism?

The attention mechanism is the core idea behind modern LLMs. In simple terms, it’s how the model figures out which words (tokens) in a sentence are relevant to each other.

Imagine you’re reading the sentence “The cat sat on the mat because it was tired.” When you read “it”, your brain instantly links it back to “cat”. That’s similar to what attention does for the model. In the decoder-only LLMs we use for inference, each token computes a score over all previous tokens to determine what’s important in context. This is called causal attention: the model can only look backward, never forward, because it generates one token at a time, left to right.

This computation uses what we call tensors. If you haven’t worked with them before, think of a tensor as a multi-dimensional array of numbers. A simple list of numbers is a 1D tensor. A table of numbers is a 2D tensor (a matrix). The attention mechanism projects each token into query, key, and value tensors, compact numerical representations of that token. A token’s query is scored against the keys of all the previous tokens, and those scores decide how much of each token’s value gets mixed in. The key and value tensors are the ones worth caching, since every later token reuses them.

How the KV cache works

In a transformer-based LLM, every layer of the model computes these key (K) and value (V) tensors for each token in the input sequence. Once computed, they can be reused for subsequent tokens without recomputing them. That reuse is the KV cache.

Important note on architecture: what we’re describing here is the classic quadratic attention mechanism, where the cost grows as O(n²) with sequence length. In hybrid architectures, where some layers use linear attention instead, this picture changes somewhat. But for now, the vast majority of production LLMs still use standard quadratic attention, with the matrix multiplication that makes the KV cache so important.

Here’s a simplified view of what happens during inference:

Prompt: "You are a security assistant. Analyze this API endpoint..."

  ┌────────────────────── PREFILL PHASE ───────────────────────┐
  │                                                            │
  │  Token 1 ──> Attention ──> K1, V1  ─┐                      │
  │  Token 2 ──> Attention ──> K2, V2  ─┤                      │
  │  Token 3 ──> Attention ──> K3, V3  ─┼──> KV Cache (GPU HBM)│
  │  ...                                │                      │
  │  Token n ──> Attention ──> Kn, Vn  ─┘                      │
  │                                                            │
  │  Cost: O(n^2) attention across all tokens                  │
  │                                                            │
  └────────────────────────────────────────────────────────────┘

  ┌────────────── DECODE PHASE ──────────────┐
  │                                          │
  │  New token ──> Attend to [K1..Kn, V1..Vn]│
  │              (read from KV Cache)        │
  │                                          │
  │  Cost: O(n), only the new token vs cache │
  │                                          │
  └──────────────────────────────────────────┘

  Cache HIT:  Skip prefill entirely, decode from cached state
  Cache MISS: Redo full prefill, O(n^2) recomputation

Four consequences follow from this:

Prefill is expensive. Processing the initial prompt (the prefill phase) means computing attention across all input tokens. For a 4,096-token prompt with standard quadratic attention, that’s O(n²) work across all layers. The KV cache stores the result, so generating the next token requires attention only between the new token and the cached keys/values: O(n) instead of O(n²).Cache hits eliminate the dominant cost. If a node already holds the KV cache for a given prefix, generating the next token needs only a single forward pass for the new token, no prefill recomputation. Decode is not free (each token still requires a forward pass and is memory-bandwidth-bound), but skipping prefill removes the most expensive phase, especially for long prompts. This is often the difference between milliseconds and seconds of added latency.Cache misses can become extremely expensive at scale. Every cache miss triggers a full prefill. Depending on the architecture, precision, and batching configuration, prefilling a 4K-token prompt on a 70B model (which spans at least two H100s) can add a second or more of latency. Multiply that across thousands of requests per minute, and you’re burning GPU cycles on recomputation that could have been avoided with smarter routing.KV cache memory is finite. Each cached conversation takes up GPU HBM (high-bandwidth memory). A single 70B model serving a 4K-context conversation can use on the order of 1-4 GB of KV cache, depending on precision, GQA configuration, and the number of layers. Nodes have a limited budget, and eviction policies determine what stays and what gets thrown away.**This is why the routing layer also needs to consider metrics such as cache utilization, hit rates, and eviction frequency.**The system has to track which nodes are near capacity and route around them before they start evicting useful state.

The multi-cloud KV cache problem

Multi-cloud makes this sharply worse. When your GPU fleet spans multiple cloud providers (and in the current scarcity landscape, it has to), the KV cache becomes an even more painful constraint.

┌────────────────────── THE PROBLEM ───────────────────────┐
  │                                                          │
  │  Cloud Provider A              Cloud Provider B          │
  │  ┌──────────────┐              ┌──────────────┐          │
  │  │   Node A1    │              │   Node B1    │          │
  │  │  KV Cache:   │      X       │  KV Cache:   │          │
  │  │  [######..]  │<─────────────│  [........]  │          │
  │  │              │  Too slow    │              │          │
  │  │   Node A2    │  to share    │   Node B2    │          │
  │  │  KV Cache:   │              │  KV Cache:   │          │
  │  │  [####....]  │              │  [##......]  │          │
  │  └──────────────┘              └──────────────┘          │
  │                                                          │
  │  * KV cache lives in GPU HBM, expensive to transfer      │
  │  * Cross-provider WAN latency makes migration impractical│
  │  * Each cloud is effectively its own cache island        │
  │                                                          │
  └──────────────────────────────────────────────────────────┘

KV cache tensors live in GPU HBM. They’re large and tightly coupled to the specific model weights and runtime state. Can you technically externalize or transfer them? Yes. Projects like NIXL and llm-d’s tiered caching prove that KV transfer is possible, especially over RDMA or specialized high-speed interconnects within a datacenter. But across cloud providers, over general-purpose WAN with different interconnects, regions, and latency profiles? It’s usually not economical for latency-sensitive inference. The transfer time often exceeds the cost of just recomputing the prefill from scratch.

This means that in a multi-cloud setup, each cloud provider is effectively its own cache island. A request that was served on a node in Provider A built up a valuable KV cache there, but if the next request lands on Provider B, that cache might as well not exist. The routing layer has to be aware of these boundaries. It can’t treat the fleet as a flat pool. It has to understand that routing a request to a different cloud provider is almost always a cache miss, and factor that into its decisions.

This constraint shaped a fundamental design choice for us: Route within a cache island first, and cross cloud boundaries only as a last resort or for fresh conversations that have no cache to preserve.

Within a single node, you can also mitigate HBM pressure by off the KV cache to CPU RAM instead of evicting it — vLLM’s KV off connector asynchronously spills cache to DRAM and reloads it on a hit, which helps with preemption recovery and shared-prefix reuse. We chose not to use it: it adds transfer latency and operational complexity, and our bet is on routing each request back to the node that still holds its cache in HBM rather than paging cache around.

Key takeaway: LLM inference is stateful. If routing ignores KV-cache locality, it wastes GPU cycles by recomputing prefixes that may already exist elsewhere in the fleet.

Inspiration: vLLM Semantic Router and llm-d #

We didn’t start from scratch. Two open-source projects shaped our thinking.

vLLM Semantic Router

The vLLM Semantic Router is a signal-driven decision routing framework developed by engineers at Red Hat, IBM, and the broader vLLM community. It introduces the concept of routing as an intelligence layer, using encoder-based signals (domain classification, semantic similarity, complexity estimation) to decide which model, path, or policy should handle a given request.

Key ideas we drew from:

Routing as an intelligence layer. The Semantic Router’s core thesis is that routing decisions should be informed by content-level signals, not just network-level metrics. This shaped how we think about the proxy. It’s less about prefix-cache-aware scheduling (that’s closer to llm-d’s domain) and more about the principle that the router shouldunderstandthe request before dispatching it.Semantic caching with category-aware thresholds. Not all queries are equal. The router adjusts similarity thresholds, TTLs, and cache quotas per query category, which avoids the one-size-fits-all problem of naive caching.The 1/W law. Their research shows that tokens per watt roughly halve when the serving context window doubles, thereby making context-length-aware routing a real energy-efficiency lever on top of its latency benefits.

llm-d

llm-d takes a more infrastructure-focused approach. It’s a distributed inference serving stack built on Kubernetes that provides:

Prefix-cache-aware routing. The inference scheduler knows which nodes hold which prefix caches and routes accordingly, maximizing cache hits and minimizing redundant prefill.Disaggregated prefill/decode. Splitting the prefill phase (compute-heavy) from the decode phase (memory-bandwidth-heavy) onto different node types, optimizing hardware utilization for each.Tiered KV caching. Off KV cache entries from GPU HBM to CPU memory, local SSD, or remote storage, extending the effective cache capacity beyond what a single GPU can hold.Workload autoscaling. SLO-aware scaling that adjusts fleet size based on actual inference metrics (queue depth, time-to-first-token, throughput) rather than generic CPU/memory metrics.

Both projects confirmed our intuition: The load balancer for LLM inference must understand the KV cache topology. Treat the fleet as interchangeable boxes, and you pay for it in recomputed prefills.

Our solution: a cache-aware routing proxy with auto-scaling #

At Equixly, we built a routing proxy that sits between our API gateway and the GPU inference nodes. It combines cache-aware routing with automatic fleet scaling.

Architecture overview

The system has three core components:

Prefix index. Each inference node periodically reports its KV cache state: which tokenized prefix blocks it currently holds, organized as a set of block-level hashes. The proxy maintains an in-memory index mapping these block hashes to nodes and their cache occupancy. This is critical: Matching must happen at thetokenized level, not at the raw text level, because the KV cache blocks are keyed on token IDs, not characters. Two prompts that look almost identical as text can diverge once tokenized (a leading space or a different whitespace run shifts the token boundaries), so only a token-level comparison lines up with the cache blocks a node actually holds.Cache-aware router. For each incoming request, the router tokenizes the prompt prefix (typically the system prompt plus shared context), splits it into fixed-size token blocks, and hashes each block. It then queries the index to compute aprefix overlap score for each candidate node, basically how many contiguous prefix blocks the node already holds. The node with the highest overlap wins. If multiple nodes tie, the one with the lowest current load is selected. If no node has meaningful overlap, the router falls back to least-loaded routing. The router also factors innode-level metrics like cache utilization and eviction rate. A node that’s 95% full on KV cache memory is a bad target even if it has a high prefix overlap, because your cache entry is likely to get evicted soon anyway.Auto-scaler. The proxy monitors aggregate fleet metrics: average queue depth, cache hit ratio, p99 latency, and per-node cache utilization. In practice, this is one background control loop: Every node exposes a lightweight metrics endpoint, and the loop polls the fleet at fixed intervals, aggregates the samples, and compares them against configured thresholds. When thresholds are breached, it fires asynchronous API calls to cloud providers to provision new GPU nodes. When load subsides, it drains and terminates excess nodes.

Routing decision flow

The routing algorithm follows this logic:

FUNCTION route_request(request):
    tokens  ←  tokenize(request.prefix)
    blocks  ←  split_into_blocks(tokens, block_size)
    hashes  ←  [compute_hash(block) FOR block IN blocks]

    // score each node by contiguous prefix block overlap
    FOR each node IN fleet:
        node.overlap  ←  count_matching_prefix_blocks(node, hashes)

    candidates  ←  nodes WHERE overlap > 0

    ┌─ Any candidate with overlap > 0?
    │
    ├── YES → Sort by overlap (descending), then load (ascending)
    │         Route to first candidate
    │
    └── NO  → Route to least-loaded node (cold start)

The tokenize and split_into_blocks

steps are important. Matching happens on token blocks, not raw text. Two slightly different prompts might share the same token-level prefix up to a divergence point. By scoring overlap at the block level, the router can partially reuse a cache even when the full prefix is not an exact match. In many LLM workloads, the system prompt and tool definitions are shared across requests, so even requests from different users can benefit from KV cache reuse. Session affinity can’t achieve that.

Auto-scaling trigger logic

The auto-scaler runs as a background loop:

EVERY scaling_interval:
    metrics  ←  collect_fleet_metrics()

    // scale up: demand exceeds capacity
    IF metrics.avg_queue_depth  >  queue_threshold
    OR metrics.cache_hit_ratio  <  hit_ratio_floor
    OR metrics.p99_latency      >  latency_ceiling:

        desired_nodes  ←  compute_desired_capacity(metrics)
        current_nodes  ←  count_active_nodes()
        deficit        ←  desired_nodes - current_nodes

        IF deficit > 0:
            FOR i IN 1..deficit:
                // fire async API call to cloud provider
                cloud_api.provision_node_async(
                    gpu_type   =  preferred_gpu,
                    region     =  select_cheapest_region(),
                    callback   =  on_node_ready
                )

    // scale down: excess capacity
    ELSE IF metrics.avg_queue_depth  <  drain_threshold
        AND   metrics.utilization     <  utilization_floor:

        excess_nodes  ←  select_drainable_nodes(count = scale_down_step)
        FOR node IN excess_nodes:
            node.drain()       // stop accepting new requests
            AWAIT node.idle()  // wait for in-flight requests to complete
            cloud_api.terminate_node(node)

A few design decisions worth calling out:

Region selection at scale-up time. When spinning up new nodes, the proxy queries available regions and picks the best option that meets latency requirements. Multi-cloud buys us redundancy, but it also opens up pricing arbitrage across providers.Async provisioning. Node provisioning is fire-and-forget with a callback. The proxy doesn’t block on cloud API responses. When a new node comes online, it registers itself in the prefix index and starts accepting traffic.Graceful drain on scale-down. Nodes are never killed mid-request. The drain process stops routing new requests, waits for in-flight completions, and only then terminates the instance. This avoids wasted computation and client-visible errors.

Putting it together

The full life cycle looks like this:

                      ┌─────────────┐
                      │   Request   │
                      └──────┬──────┘
                             │
                             v
                    ┌─────────────────┐
                    │  Compute prefix │
                    │      hash       │
                    └────────┬────────┘
                             │
                ┌────────────┼────────────┐
                │            │            │
                v            v            v
          ┌──────────┐ ┌──────────┐ ┌──────────┐
          │  Node A  │ │  Node B  │ │  Node C  │
          │ cache:## │ │ cache:#  │ │ cache:_  │
          └──────────┘ └──────────┘ └──────────┘
                │
                v  (best cache overlap)
          ┌──────────┐
          │ Route to │
          │  Node A  │
          └──────────┘

      - - - - - - - - - - - - - - - - - - - -

      Background:
          ┌───────────────────────┐
          │   Auto-scaler loop    │
          │                       │
          │  queue_depth > T ?    │──>  Provision Node D
          │  cache_hits  < F ?    │──>  (async cloud API)
          │  p99 > ceiling ?      │
          └───────────────────────┘

From theory to practice

The pseudocode above describes the high-level design. Now let’s get concrete about how the routing key actually works, since that’s where cache affinity is won or lost.

Important caveat first: What follows is a simplified slice of a much larger system. The real routing layer also considers node health, current concurrency per host, queue depth, GPU memory pressure, rate limits, request priority tiers, whether a node is mid-drain for scale-down, geographic proximity to the client, and more. We’re zooming in on the cache-affinity heuristic specifically because it’s the most interesting piece and the hardest to get right. But don’t mistake this for the whole picture.

The idea: For every API request, we extract a stable routing key that acts as a cache-affinity heuristic. It doesn’t precisely identify which KV cache blocks live on which node (that’s the prefix index’s job). Instead, it approximates the following: “Requests with similar prefixes should land on the same node.” It’s probabilistic, not exact, and it’s one signal among many.

FUNCTION extract_routing_key(request):
    model    ←  request.model                      // e.g. "equixly-70b"
    budget   ←  512 characters, split evenly

    // grab the system prompt (first half of budget)
    sys_content  ←  first_message_where(role = "system")
                      .content[0 .. budget/2]

    // grab the FIRST user message only (second half)
    usr_content  ←  first_message_where(role = "user")
                      .content[0 .. budget/2]

    // system + first user because they NEVER CHANGE across turns.
    // Turn 1: [system, user]
    // Turn 2: [system, user, assistant, user]
    // Turn 3: [system, user, assistant, user, assistant, user]
    //                  ^^^^^^^^^^^^^^^^^^^^
    //                  always the same prefix
    //
    // This gives us a stable heuristic: same conversation
    // tends to land on the same node, keeping the KV cache warm.
    // It's not perfect (two chats with identical system+first-user
    // will collide), but it's a good approximation for routing.

    RETURN model + ":" + sys_content + usr_content

Once we have the key, we hash it to pick a preferred node. The simplified version uses hash MOD hosts.length

, but in production with autoscaling (where hosts are added and removed), a plain modulo would remap most keys every time the fleet size changes. The real system uses consistent hashing or rendezvous hashing to minimize disruption when nodes join or leave. Here’s the simplified view:

FUNCTION route_completion(request, hosts):
    key          ←  extract_routing_key(request)
    primary_idx  ←  fast_hash(key) MOD hosts.length

    // try the cache-affine host first
    // if it fails (5xx, timeout, no replica), try next in order
    FOR attempt IN 0 .. hosts.length:
        idx   ←  (primary_idx + attempt) MOD hosts.length
        resp  ←  forward_request(hosts[idx], request)

        IF resp is OK or client error (4xx):
            RETURN resp              // done, cache-affine hit (or miss)
        IF resp is server error (5xx):
            CONTINUE                 // try next host

    RETURN 502 "All upstream instances failed"

Again, this is the simplified version. The production system weighs multiple signals before picking a host:

Cache affinity(the hash above) is the starting point** Concurrent request countper node, so we don’t pile onto a busy host Queue depth**, because a node with a deep queue will be slow even with a warm cache** GPU memory pressure**, since a node near its KV cache capacity limit might evict your prefix before you benefit from it** Health status**, including recent error rates and latency percentiles** Geographic proximity**, when routing across regions within a cloud provider** Priority tiers**, so high-priority requests get preference on less loaded nodes** Drain state**, so we don’t send traffic to a node that’s about to be terminated

The cache-affinity hash gives us a good default. The other signals override it when they need to. The result is a system that’s cache-aware by default but load-aware, health-aware, and cost-aware when it matters.

Why does this approach work in practice? Two reasons:

The routing key is stable across turns. If your key changes every time the user sends a new message, you lose cache affinity. By using only the system + first user messages, the key stays the same throughout the conversation. It’s a heuristic, not a guarantee, but it works well for the common case.The fallback is ordered, not random. If the preferred host is down, we don’t pick a random fallback. We walk the list in a deterministic order. This means that when a host recovers, requests naturally flow back to it without any coordination.

The production implementation does this without ever calling JSON.parse

on the request body, just fast string scanning. When you’re processing thousands of requests per second at the edge, avoiding a full JSON parse on every request adds up.

Why the performance of the routing layer itself matters

There’s a subtle trap here. You’re building a routing layer to save GPU time, but the router itself runs on every single request. If the routing logic is slow, you’re adding latency to every request before it even reaches the GPU. That defeats the purpose.

This is why we care a lot about the performance of the routing algorithm itself. A few principles we follow:

Never unmarshal the full request body. A chat completion request can be large, especially with long conversation histories. Parsing the entire JSON into an object means allocating memory for every field, every message, every token. We don’t need any of that for routing. We only need the model name, the system prompt prefix, and the first user message prefix. So instead of parsing, we scan the raw bytes looking for the specific keys we need. ThinkindexOf / strstr

style scanning. It’s O(n) in the worst case, but in practice it finds what it needs within the first few hundred bytes and stops.

// slow:
FUNCTION route_naive(raw_body):
    parsed  ←  JSON.parse(raw_body)     // allocates entire object tree
    model   ←  parsed.model             // walks nested structure
    system  ←  parsed.messages[0].content
    ...

// equixly - fast:
FUNCTION route_fast(raw_body):
    // scan for "model" key, extract value directly from bytes
    model   ←  scan_for_key(raw_body, "model", max_len = 128)
    // scan for first "system" role, then its "content"
    system  ←  scan_for_key_after(raw_body, "system", "content", max_len = 256)
    // scan for first "user" role, then its "content"
    user    ←  scan_for_key_after(raw_body, "user", "content", max_len = 256)
    ...
    // No allocation, no object tree, no GC pressure

A note on correctness: Raw byte scanning makes assumptions about the shape of incoming requests. It works because we control the API contract and know what our clients send. If you’re dealing with arbitrary inputs, deeply nested objects, escaped content inside strings, or multimodal payloads with base64 blobs, a streaming parser or validated schema is the safer choice. The tradeoff is deliberate: we accept a constrained input shape in exchange for routing latency under 1ms.

Use a fast hash with good distribution. The hash function runs on every request, so it has to be cheap. We use FNV-1a (a few XORs and multiplies per byte) with a Murmur3 finalizer. The finalizer is important: it avalanches high-bit differences into low bits, which matters a lot when your modulus is small (like % 2 or % 3 for a small fleet). Without the finalizer, you’d get poor distribution, and some nodes would get much more traffic than others.Cap the routing key budget. We don’t hash the entire system prompt or the entire first user message. We cap at 512 characters total (256 per field). This bounds the hash computation time regardless of how long the prompt is. In practice, 256 characters of the system prompt is more than enough to differentiate between workloads, and 256 characters of the first user message is enough to differentiate between conversations.Zero-copy where possible. The request body is read once into a buffer. That same buffer is forwarded to the upstream host. We never copy it, never transform it, never re-encode it. The routing key extraction works on a read-only view of the same byte.

Lessons learned #

A few things we’d hand to anyone building inference infrastructure under these constraints:

LLM inference is stateful. The node that served the last turn holds a warm KV cache; send the follow-up anywhere else, and you eat a full prefill you didn’t need.Cache locality can matter as much as raw capacity. A cache hit skips the prefill phase entirely, often the difference between milliseconds and seconds of latency. More GPUs don’t help if you keep landing on cold ones.Routing has to be token- and model-aware. The cache is keyed on token IDs from a specific tokenizer, so routing inference like generic HTTP traffic throws away cache hits you’ve already paid to compute.Small routing choices have outsized infrastructure impact. A sub-millisecond decision made for every request changes how many GPUs you have to rent, and at current prices, that compounds quickly.

Being smart is the only option #

GPU scarcity is not going away. The demand curve for inference compute is steep, and the supply side, constrained by fabrication capacity, energy infrastructure, and geopolitics, can’t keep up. Throwing money at the problem (renting more GPUs, reserving larger clusters) is a losing strategy if you’re not using those GPUs efficiently.

At Equixly, the cache-aware routing proxy has become core infrastructure rather than a nice-to-have. By understanding the KV cache topology of our inference fleet and routing requests accordingly, we extract more useful work from every GPU-second we pay for. By auto-scaling across multiple cloud providers based on real inference metrics, we avoid both over-provisioning (wasted spend) and under-provisioning (degraded service).

Renting more GPUs stopped being a strategy once supply tightened. What’s left is using the ones you can actually get, and using them well.

[ ]

Alessio Dalla Piazza

CTO & FOUNDER

Former Founder & CTO of CYS4, he embarked on active digital surveillance work in 2014, collaborating with global and local law enforcement to combat terrorism and organized crime. He designed and utilized advanced eavesdropping technologies, identifying Zero-days in products like Skype, VMware, Safari, Docker, and IBM WebSphere. In June 2016, he transitioned to a research role at an international firm, where he crafted tools for automated offensive security and vulnerability detection. He discovered multiple vulnerabilities that, if exploited, would grant complete control. His expertise served the banking, insurance, and industrial sectors through Red Team operations, Incident Management, and Advanced Training, enhancing client security.

[ ]

Simone Businaro

Head of Solution Architecture

Simone is a seasoned Cloud and Solution Architect with extensive experience in leading transformation projects for major enterprise customers such as UniCredit and Generali. He specializes in Private and Public Cloud adoption, as well as designing Cloud Native platforms including Kubernetes and OpenShift. He is adept at crafting comprehensive automation solutions that seamlessly integrate service lifecycle with customers' internal ITIL and financial processes, ensuring streamlined, efficient, and scalable solutions.

[ ]

Paolo Maccacaro

Staff Cloud DevOps Engineer

Paolo is an experienced Cloud Native DevOps Engineer with a specialized focus on infrastructure design and automation. He possesses a robust background in operations, with extensive experience in both on-premise environments and major cloud hyperscalers. Having worked in both Italy and the UK, Paolo has collaborated with various startups and large international corporations. Throughout his career, he has held diverse roles, contributing effectively as an Individual Contributor, Team Lead, and Architect.

[ ]

Giorgio Roffo

Head of AI

Giorgio is an AI leader with a Ph.D. in Computer Science, focused on machine learning and pattern recognition. His expertise spans computer vision, scalable AI systems, and applied machine learning. He has worked across industry and academia, translating advanced research into reliable production technology. His work includes medical AI, computer vision, and security-focused intelligent systems, with publications in top-tier international venues and multiple research awards.

source & further reading

equixly.com — original article Equixly vs. XBOW How an AI agent talked itself into an XXE — and was right AI red teaming vs. AI penetration testing