Claude please rack me a datacenter, make no mistakes

wpnews.pro

We let Claude rack tens of millions worth of compute. We are not sorry.

If you, like me, were an an Infrastructure Engineer at Railway in January 2026 and your CEO just went out and raised a hundred-million-dollar funding round, you would be looking to invest a large chunk of it on the hottest commodity of the year - DDR5 DIMMs! So first you'd order 100s of beefy servers - great.

Now what?

Getting your 8-figure investment operational isn’t so straightforward. We’re talking 4 geographies, 8-9 different datacenter facilities, a half-dozen suppliers, dozens of network providers, dozens of technicians, hundreds of line items, thousands of cables, and lots of velcro. Also, everything needs to line up perfectly inside a 2-to-3 week installation window. The slightest upset causes downstream delays that can throw timelines back by months.

A rough chart of all the steps involved before $$$ becomes deployable compute

Unlike most companies, Railway doesn’t believe in headcount-maxxing; we believe in leverage and high-agency - so we decided we’d assign this all to one-and-a-half people in planning.

So how the fuck did we pull this off?

With the first generation of buildouts, we had the luxury of time.

Our Gen 1 deployments scaled in 8 phases spread across 18 months. We’d build an initial site skeleton in each region, fill in 20% capacity, take a couple of months doing other stuff, then repeat at the next site. From the initial skeleton, we executed copy-pasta scale-ups in batches of 15-20% of site capacity and slowly filled them out.

This kept pace with demand at the time- each server order and install gave us roughly 3 months of capacity runway. In 2026, with the current rate of growth, that same capacity increment would last us 3 weeks...

As a consequence of insane growth, Railway now runs workloads on AWS, GCP, and on our own metal. Combine that growing demand with the state of supply chain (huge demand and resulting shortages of everything from DRAM to glass fiber) - it pays to order all you can up front, get delivery whenever the material becomes available, and operationalize it as fast-as-possible.

To add to urgency, this isn’t even a cost-thing! That compute you buy may be the only compute you can get your hands on; even adding more cloud VMs is now a battle to get allocation.

Demand has required us to add compute at an ever increasing rate (and this chart not track build, storage, logging or network nodes)

The 2025-era artisanal human-led Excel build-documentation and site provisioning flow doesn’t cut it anymore. Even hiring enough people and getting them up to speed wouldn’t be fast enough; and given our small team there’s no one we could pull in without dropping something else. Humans were most definitely the bottleneck here.

So how can we get an LLM to cook with physical infrastructure? LLMs work great with closed loops they can iterate on, so thats what we had to build.

If we deconstruct a datacenter build, it’s very similar to a life-size Lego set. Things assemble in a deterministic order with a known library of parts. So in theory we could write a build_site.py that threw out a magic Excel sheet. That works, and we built something that way last year for the Amsterdam region - but the limiter there was - again - human.

The fixed template generation worked great when you built an entire site, but when the script has a single off-by-one error and you don’t notice, that’s a 6am phone call from a tech asking where port 34 is on a 32-port switch (true story).

The templates needed a lot of manual curation, and any desired deviations added lots of friction and special-casing. Worse, if someone changed the real-life install to not match, the template was unforgiving and any further applications of the template would need manual fixes.

But it’s 2026, and tokens are (still) cheap. To generate this stuff, we first need to give the LLM a framework within which to operate, which includes 4 things:

Version control for physical infrastructure - all add/remove/move operations for hardware needs to be tracked as immutable changesets that can be atomically applied and inspected
Design rules checks - a set of auditable rules needs to exist that can be run against devices, cables, racks or sites to verify that they meet our requirements
A library of parts - each server configuration, optic, network switch or cable needs to be modeled as a part in a database, and trackable by model number, manufacturer and order code. We define a schema of attributes for each item kind, so populating this can be as simple as throwing a catalogue URL at an LLM and letting it create the part via MCP
An attribute and constraint system - parts need to expose things like “part X fits in a QSFP cage,” and “Part Y has Port 3 on the front that supports QSFP and QSFP-DD optics,” etc… The more granular the attributes/constraints, the better. (For example, for a PDU, we model the breaker arrangement and trip thresholds.) These are again generated by letting the LLM parse a spec sheet and create the constraint data

An example of changeset history at an US datacenter - each change is a set of Add/Remove/Modify operations for devices and cables

Sample DRC errors for a changeset indicating incomplete cabling, more severe issues block as errors

For fiber cables we model both generic types and SKUs for specific lengths as lengths are only known after a cage layout is known - this unblocks getting a draft layout for techs to do measurements The attributes for switch ports encode the port type and the supported optic kinds

These primitives alone aren’t enough. LLMs can generate a lot of information cheaply, so we needed the right tools and observability as an operator to leverage them effectively. For us, this involves two additional things:

Making Railyard, our internal DCIM, “real-time” with fast push UI updates and richer interfaces
A curated set of MCP tools with skills documenting processes. Tools range from one that bulk-adds cables, to another that runs design rule checks and returns paginated output. Skills walk through the workflow of racking and cabling equipment using these tools

With this setup, a LLM can translate an instruction like “rack me 5 compute servers at site sjc10” into set of operations that it can validate and present to the user for approval.

The following is a breakdown of the plan for one such run:

Once executed, the devices show up in Railyard:

A human user can then look at the changeset in the UI, interrogate the LLM about the final result, and then decide whether or not to accept and apply it. Easy! We’re done right? Nope. (Sorry, this is a long blogpost.) Applying a changeset is just the operator saying they accept a change; it still hasn’t manifested in real-life.

If you work with great technicians: their feedback is worth gold. For example, how should you lace cables in a way that’s easy to maintain? It's an art. Thus, initially, we want to be able to modify a design as we gain more expertise, so applying a changeset doesn’t make things permanent but rather, it kicks off a lifecycle for every involved asset, be it a cable or rack device. Our lifecycle is simple, with 5 valid states: Draft, Buildout, Active, Repair, and Decommission

Draft - these assets are mutable, you can delete or move them without penalty via changesets. We assume the design is still in flux at this stage. Iteration happens via changesets, so the history of the design evolving is tracked - including the rationale via changeset descriptions.

Once you’re happy, you promote assets from Draft to Buildout.

Buildout - signals that the information is released to technicians on site. Assets in this state may already be partly installed, so mutation requires that we document the source and end states clearly for any change such that technicians can track it.

Promoting assets to Buildout does to additional things to make logistics easier:

Because a user may promote multiple changesets, we create a Work Order to encapsulate all the different cable/device add/remove/modify operations under one collection
Because the devices and cables are fed from a component library; we auto-generate a Bill of Materials using that data with all the quantities to purchase to fulfill the work order

Once the technicians have completed all the work, a qualification process is undertaken and the devices are promoted to the Ready state.

Ready is the ultimate “in-production” marker, and the only mutation possible is if there is a removal or in-place repair.

This lets us model the design, build and operate lifecycle, but how does that translate to real life?

When we build-out sites, we leverage a single global contractor who has teams in every single geo we are present in. But a key challenge is in getting all the information we put together into a format that is actually useful to the technicians in the cage.

We found that the printed cabling matrix is still king here. But unlike how we started - this document is the output of our tooling rather than the input. Using the go-excelize library, our Golang backend can easily fill in a skeleton template with the cables and device lists affected.

Working with our contractors, we optimised this sheet towards their preferred workflow

Cables are always oriented by device hierarchy in the network, eg: switch to server cables are grouped and ordered by switch and port
Cables are split up by rack to align with the workflow where team leads assign racks to certain techs to cable
Cabling errors detected by the validation workflows (covered in a future blogpost) can be presented in-line next to each cabling line
Cables from completed work orders marked in green allowing techs to skip over them
Things that were “nice to have” like a changelog or matrix breakdowns by cable type are now super cheap to generate
Label strings are autogenerated in a format that is compatible with our preferred cable labeling system

Because the sheet isn’t the easiest thing to process when verifying links; we had Claude build a little UI that operates on a JSON export of the same data - but mocks deeper views. This tool is available to technicians via an iPad in every cage.

Techtool is a read-only view of buildout - it allows interactively tracing through racks, devices and cables

Some “improvements” were regressions from the technicians point-of-view; so it was important for us to be onsite to solicit feedback during the first install. For example, Railyard can generate full color rack elevations - but technicians preferred the simpler Excel wireframe elevation because it was easier to read. These wireframe elevations are generated via code and added to the Excel export.

For buildouts, we provide simple wireframe elevations for clarity. Devices are colour coded by type, and the cell text includes device name and their mounting direction. So now that the technicians know what to install and where, we can finally get started with the building. This can take a few weeks provided all the material is on-site; assuming everything goes to plan. Sounds like we’re done? Remember, long blogpost, so not quite yet: techs will make mistakes, lasers can have bad days and components can shake loose in transit - basically shit happens in the real world. Verifying everything is ready is a mission in itself, but it Claude can help here too.

Our latest Ashburn cage mid-buildout; the boxes are just the final 10% of material being unpacked

With our Gen 1 sites, Qualifying hardware was us walking through the cage with a torch and a checklist. We’d spot check a few patches, confirm that there were no red error LEDs anywhere, and check that broadly - every server link was up once we’d installed the OS by hand. This caught 90% of things, but it required physical presence and lots of time. The 10% of things we couldn't catch became too hard to fix after production workloads landed on there.

With Gen 2, the manual process didn’t work.

Firstly, the one project lead couldn’t be in 3 cities at once, and secondly, there were way too many cables, servers, and switches for us to check effectively. So once again, we built a process for an LLM to take this on.

We first began by deleting the Notion Runbooks that were the corpus of our “how to standup a DC pod” knowledge, and we encoded those steps as a /dc-provision

skill. The skill instructs the LLM on how to guide the operator through setup. This involves generating instructions (”eg: plug laptop into port X”) and scripts for the operator to run. Scripts are templated based on DCIM data for that specific site, reducing the manual effort required.

An extract of the /dc-provision skill

Once the basic network config is setup, the skill transitions into a qualification process. This is is composed of two main harnesses:

Fabric validation - MCP tools expose port config, port to interface mappings, transceiver DOM stats, and LLDP neighbour information for each switch. The tools abstract away the Network Device OS, so we are able to interrogate both SONiC and Arista EOS devices using the same frontend
Servers boot into a mini Linux burn-in environment. The agent that runs as PID 1 in this environment exposes an HTTP API that the LLM can interrogate. The agent exposes hardware sensor data, LLDP neighbours and ethtool

derived stats. A smalldropbear

SSH server allows Claude to SSH in for more intensive debugging

With these two harnesses, the skill can validate wiring, mounting and connection quality (light levels). Part of the input to the skill is a manifest of device serial numbers against that rack position; the LLM can compare this data against the LLDP neighbour info, DCIM data and metadata from devices to determine that devices are installed and connected to our spec.

A sample fabric validation report from Claude; we’ve seen a steady drop in qualification failures with each successive build as the documentation/feedback loops improve

Server qualification requires it’s own blog, the solution is LLM driven, but includes a 10 hour burn-in process with the LLM reviewing results to spot outliers and detect patterns across the fleet. Simple failures like metrics crossing failure thresholds cause a programatic fail, but Claude has spotted novel failure scenarios such as servers with outlier temps that turned out to be missing fans and thus not alerted on by the BMC.

All qualification reports get written to our DCIM as notes providing memory to the LLM, audit-trail for operators and documentation for future on-call engineers.

A server qualification report generated by Claude - a single fan speed sensor stuck at max due to a faulty tachometer

The site being qualified, we’re pretty much at the point that we’re good to provision and go-live; and in the process, we’re one step closer to bringing a DC online from a train (career goal for many a Platform Engineer at Railway). But with be selling you lies if we said vibing metal it was 100% trouble free, in reality - it got to about 99%.

It might surprise you to learn that the LLM-ified buildout workflow performed on average much better than our prior human-driven design.

The main errors were often operator driven or originating from bad constraints given to the system - the fixes mostly involved adding more check rules and tightening the constraints.

Eg: during some cable moves, the operator forgot to re-add a number of cables leading a device to be racked without network access - a design rule to validate a minimum 1-pair of diverse switch links fixed this footgun.

The more interesting errors centered around instances where the LLM had no constraints that guided it in the proper direction. Eg: the LLM can pick power sockets based on phase balance and PDU breaker distribution, but it has no context as to the positioning of a PDU socket on a vertical rack PDU relative to the power input of a server - the result - a server at the bottom of the rack needing a 3m C14/C13 power cable to reach a socket at the top of the rack. For now, we’ve addressed this by way of a skill that would prompt for user input - but cabling could be modeled better.

4m power cables are A-OK for our robot overlords

LLMs are also not error free, and skills can be loose guidance with large context windows. Anywhere the harness isn’t guiding the LLM through a strict sequence of steps can lead to the LLM forgetting to do certain things. Eg: during /dc-provision

Claude “forgot” a few steps during setup that led to a later setup phase failing. Though it was great at picking up the omission and fixing it, no harm done, it still cautions how these systems are non-deterministic and require guardrails. We haven’t still resolved this fully, but in the interim, we’ve found great milage in having the LLM review device config during the validation phase as a final check.

It goes without saying; config such as the BGP settings driving critical parts of the system such as the dataplane network fabric or transit peering is 100% human reviewed. We also still ensure human-in-the-loop for any steps that may impact user workloads; but as we scale, safe automation will come for these too.

These checks aren’t the sole defence against misbehaving hardware. Once a site is validated, we run a second fully independent test-suite via railway app deploys targeted at the region - these checks verify every component meets our specs from a user-workload perspective.

It’s been 6 months since we decided to build this thing; and 5 since we started using it in anger. Just this past week, we verified the system works by forcing the author on PTO, gave Mark (our Head of Engineering) the reigns to Claude, and saw how far he got - apart from a few papercuts; remarkably well - another batch of servers onboarded, a dozen or so failing servers identified and more compute shipped to prod!

Not exactly shipping servers from a beach; but close.

So what’s next? Like everything at Railway, this is just our beachead. As we scale compute, by 8x, 16x, 32x and onward - we are committed to investing in the tooling to get us there. For Railyard, this means investing in giving it more awareness of our infrastructure - and consolidating signals so it’s LLM assists can see broader patterns.

This lands us squarely on why we’re writing this post; because shock/suprise, we’re hiring! Two roles specifically aim at expanding on the things we’ve built here: Senior Infra Engineer: Datacenters is seeking a keen human to tame our metal wielding robots and take Railyard and it’s tools into 2027 and beyond (you’ll have a 8-figure ramp card for servers), while Senior Infra Engineer: Baremetal Orchestration seeks someone up for wrangling those servers into operational compute.

If either of those roles sound interesting; or you want to bring your own spin on the farm-to-table of compute; drop us a line on careers@railway.com.

source & further reading

blog.railway.com — original article The New Calculus of AI-Based Coding

Claude please rack me a datacenter, make no mistakes

Run your AI side-project on zahid.host