{"slug": "claude-please-rack-me-a-datacenter-make-no-mistakes", "title": "Claude please rack me a datacenter, make no mistakes", "summary": "Railway assigned its datacenter buildout for tens of millions of dollars in compute hardware to one-and-a-half people, using an LLM to plan and execute the installation across four geographies and nine facilities. The company turned to AI after its 2025-era manual processes became a bottleneck, unable to keep pace with demand that now consumes a capacity increment in three weeks instead of three months. The approach treats a datacenter build like a life-size Lego set, generating deterministic assembly instructions to avoid human errors that previously caused delays and costly mistakes.", "body_md": "# Claude please rack me a datacenter, make no mistakes\n\nWe let Claude rack tens of millions worth of compute. We are not sorry.\n\nIf you, like me, were an an Infrastructure Engineer at Railway in January 2026 and your CEO just went out and raised a hundred-million-dollar funding round, you would be looking to invest a large chunk of it on the hottest commodity of the year - DDR5 DIMMs!\n\nSo first you'd order 100s of beefy servers - great.\n\nNow what?\n\nGetting your 8-figure investment operational isn’t so straightforward. We’re talking 4 geographies, 8-9 different datacenter facilities, a half-dozen suppliers, dozens of network providers, dozens of technicians, hundreds of line items, thousands of cables, and lots of velcro. Also, everything needs to line up perfectly inside a 2-to-3 week installation window. The slightest upset causes downstream delays that can throw timelines back by months.\n\nA rough chart of all the steps involved before $$$ becomes deployable compute\n\nUnlike most companies, Railway doesn’t believe in headcount-maxxing; we believe in leverage and high-agency - so we decided we’d assign this all to one-and-a-half people in planning.\n\nSo how the fuck did we pull this off?\n\nWith the first generation of buildouts, we had the luxury of time.\n\nOur Gen 1 deployments scaled in 8 phases spread across 18 months. We’d build an initial site skeleton in each region, fill in 20% capacity, take a couple of months doing other stuff, then repeat at the next site. From the initial skeleton, we executed copy-pasta scale-ups in batches of 15-20% of site capacity and slowly filled them out.\n\nThis kept pace with demand at the time- each server order and install gave us roughly 3 months of capacity runway. In 2026, with the current rate of growth, that same capacity increment would last us 3 weeks...\n\nAs a consequence of [insane growth](https://x.com/JustJake/status/2029288508402311404), Railway now runs workloads on AWS, GCP, and on our own metal. Combine that growing demand with the state of supply chain (huge demand and resulting shortages of everything from DRAM to glass fiber) - it pays to order all you can up front, get delivery whenever the material becomes available, and operationalize it as fast-as-possible.\n\nTo add to urgency, this isn’t even a cost-thing! That compute you buy may be the only compute you can get your hands on; even adding more cloud VMs is now a battle to get allocation.\n\nDemand has required us to add compute at an ever increasing rate (and this chart not track build, storage, logging or network nodes)\n\nThe 2025-era artisanal human-led Excel build-documentation and site provisioning flow doesn’t cut it anymore. Even hiring enough people and getting them up to speed wouldn’t be fast enough; and given our small team there’s no one we could pull in without dropping something else. Humans were most definitely the bottleneck here.\n\nSo how can we get an LLM to cook with physical infrastructure? LLMs work great with closed loops they can iterate on, so thats what we had to build.\n\nIf we deconstruct a datacenter build, it’s very similar to a life-size Lego set. Things assemble in a deterministic order with a known library of parts. So in theory we could write a `build_site.py`\n\nthat threw out a magic Excel sheet. That works, and we built something that way last year for the Amsterdam region - but the limiter there was - again - human.\n\nThe fixed template generation worked great when you built an entire site, but when the script has a single off-by-one error and you don’t notice, that’s a 6am phone call from a tech asking where port 34 is on a 32-port switch (true story).\n\nThe templates needed a lot of manual curation, and any desired deviations added lots of friction and special-casing. Worse, if someone changed the real-life install to not match, the template was unforgiving and any further applications of the template would need manual fixes.\n\nBut it’s 2026, and tokens are (still) cheap. To generate this stuff, we first need to give the LLM a framework within which to operate, which includes 4 things:\n\n- Version control for physical infrastructure - all add/remove/move operations for hardware needs to be tracked as immutable changesets that can be atomically applied and inspected\n- Design rules checks - a set of auditable rules needs to exist that can be run against devices, cables, racks or sites to verify that they meet our requirements\n- A library of parts - each server configuration, optic, network switch or cable needs to be modeled as a part in a database, and trackable by model number, manufacturer and order code. We define a schema of attributes for each item kind, so populating this can be as simple as throwing a catalogue URL at an LLM and letting it create the part via MCP\n- An attribute and constraint system - parts need to expose things like “part X fits in a QSFP cage,” and “Part Y has Port 3 on the front that supports QSFP and QSFP-DD optics,” etc… The more granular the attributes/constraints, the better. (For example, for a PDU, we model the breaker arrangement and trip thresholds.) These are again generated by letting the LLM parse a spec sheet and create the constraint data\n\nAn example of changeset history at an US datacenter - each change is a set of Add/Remove/Modify operations for devices and cables\n\nSample DRC errors for a changeset indicating incomplete cabling, more severe issues block as errors\n\nFor fiber cables we model both generic types and SKUs for specific lengths as lengths are only known after a cage layout is known - this unblocks getting a draft layout for techs to do measurements\n\nThe attributes for switch ports encode the port type and the supported optic kinds\n\nThese primitives alone aren’t enough. LLMs can generate a lot of information cheaply, so we needed the right tools and observability as an operator to leverage them effectively. For us, this involves two additional things:\n\n- Making Railyard, our internal DCIM, “real-time” with fast push UI updates and richer interfaces\n- A curated set of MCP tools with skills documenting processes. Tools range from one that bulk-adds cables, to another that runs design rule checks and returns paginated output. Skills walk through the workflow of racking and cabling equipment using these tools\n\nWith this setup, a LLM can translate an instruction like “rack me 5 compute servers at site sjc10” into set of operations that it can validate and present to the user for approval.\n\nThe following is a breakdown of the plan for one such run:\n\nOnce executed, the devices show up in Railyard:\n\nA human user can then look at the changeset in the UI, interrogate the LLM about the final result, and then decide whether or not to accept and apply it. Easy! We’re done right? Nope. (Sorry, this is a long blogpost.) Applying a changeset is just the operator saying they accept a change; it still hasn’t manifested in real-life.\n\nIf you work with great technicians: their feedback is worth gold. For example, how should you lace cables in a way that’s easy to maintain? It's an art. Thus, initially, we want to be able to modify a design as we gain more expertise, so applying a changeset doesn’t make things permanent but rather, it kicks off a lifecycle for every involved asset, be it a cable or rack device.\n\nOur lifecycle is simple, with 5 valid states: Draft, Buildout, Active, Repair, and Decommission\n\nDraft - these assets are mutable, you can delete or move them without penalty via changesets. We assume the design is still in flux at this stage. Iteration happens via changesets, so the history of the design evolving is tracked - including the rationale via changeset descriptions.\n\nOnce you’re happy, you promote assets from Draft to Buildout.\n\nBuildout - signals that the information is released to technicians on site. Assets in this state may already be partly installed, so mutation requires that we document the source and end states clearly for any change such that technicians can track it.\n\nPromoting assets to Buildout does to additional things to make logistics easier:\n\n- Because a user may promote multiple changesets, we create a Work Order to encapsulate all the different cable/device add/remove/modify operations under one collection\n- Because the devices and cables are fed from a component library; we auto-generate a Bill of Materials using that data with all the quantities to purchase to fulfill the work order\n\nOnce the technicians have completed all the work, a qualification process is undertaken and the devices are promoted to the Ready state.\n\nReady is the ultimate “in-production” marker, and the only mutation possible is if there is a removal or in-place repair.\n\nThis lets us model the design, build and operate lifecycle, but how does that translate to real life?\n\nWhen we build-out sites, we leverage a single global contractor who has teams in every single geo we are present in. But a key challenge is in getting all the information we put together into a format that is actually useful to the technicians in the cage.\n\nWe found that the printed cabling matrix is still king here. But unlike how we started - this document is the output of our tooling rather than the input. Using the go-excelize library, our Golang backend can easily fill in a skeleton template with the cables and device lists affected.\n\nWorking with our contractors, we optimised this sheet towards their preferred workflow\n\n- Cables are always oriented by device hierarchy in the network, eg: switch to server cables are grouped and ordered by switch and port\n- Cables are split up by rack to align with the workflow where team leads assign racks to certain techs to cable\n- Cabling errors detected by the validation workflows (covered in a future blogpost) can be presented in-line next to each cabling line\n- Cables from completed work orders marked in green allowing techs to skip over them\n- Things that were “nice to have” like a changelog or matrix breakdowns by cable type are now super cheap to generate\n- Label strings are autogenerated in a format that is compatible with our preferred cable labeling system\n\nBecause the sheet isn’t the easiest thing to process when verifying links; we had Claude build a little UI that operates on a JSON export of the same data - but mocks deeper views. This tool is available to technicians via an iPad in every cage.\n\nTechtool is a read-only view of buildout - it allows interactively tracing through racks, devices and cables\n\nSome “improvements” were regressions from the technicians point-of-view; so it was important for us to be onsite to solicit feedback during the first install. For example, Railyard can generate full color rack elevations - but technicians preferred the simpler Excel wireframe elevation because it was easier to read. These wireframe elevations are generated via code and added to the Excel export.\n\nFor buildouts, we provide simple wireframe elevations for clarity. Devices are colour coded by type, and the cell text includes device name and their mounting direction.\n\nSo now that the technicians know what to install and where, we can finally get started with the building. This can take a few weeks provided all the material is on-site; assuming everything goes to plan. Sounds like we’re done? Remember, long blogpost, so not quite yet: techs will make mistakes, lasers can have bad days and components can shake loose in transit - basically shit happens in the real world. Verifying everything is ready is a mission in itself, but it Claude can help here too.\n\nOur latest Ashburn cage mid-buildout; the boxes are just the final 10% of material being unpacked\n\nWith our Gen 1 sites, Qualifying hardware was us walking through the cage with a torch and a checklist. We’d spot check a few patches, confirm that there were no red error LEDs anywhere, and check that broadly - every server link was up once we’d installed the OS by hand. This caught 90% of things, but it required physical presence and lots of time. The 10% of things we couldn't catch became too hard to fix after production workloads landed on there.\n\nWith Gen 2, the manual process didn’t work.\n\nFirstly, the one project lead couldn’t be in 3 cities at once, and secondly, there were way too many cables, servers, and switches for us to check effectively. So once again, we built a process for an LLM to take this on.\n\nWe first began by deleting the Notion Runbooks that were the corpus of our “how to standup a DC pod” knowledge, and we encoded those steps as a `/dc-provision`\n\nskill. The skill instructs the LLM on how to guide the operator through setup. This involves generating instructions (”eg: plug laptop into port X”) and scripts for the operator to run. Scripts are templated based on DCIM data for that specific site, reducing the manual effort required.\n\nAn extract of the /dc-provision skill\n\nOnce the basic network config is setup, the skill transitions into a qualification process. This is is composed of two main harnesses:\n\n- Fabric validation - MCP tools expose port config, port to interface mappings, transceiver DOM stats, and LLDP neighbour information for each switch. The tools abstract away the Network Device OS, so we are able to interrogate both SONiC and Arista EOS devices using the same frontend\n- Servers boot into a mini Linux burn-in environment. The agent that runs as PID 1 in this environment exposes an HTTP API that the LLM can interrogate. The agent exposes hardware sensor data, LLDP neighbours and\n`ethtool`\n\nderived stats. A small`dropbear`\n\nSSH server allows Claude to SSH in for more intensive debugging\n\nWith these two harnesses, the skill can validate wiring, mounting and connection quality (light levels). Part of the input to the skill is a manifest of device serial numbers against that rack position; the LLM can compare this data against the LLDP neighbour info, DCIM data and metadata from devices to determine that devices are installed and connected to our spec.\n\nA sample fabric validation report from Claude; we’ve seen a steady drop in qualification failures with each successive build as the documentation/feedback loops improve\n\nServer qualification requires it’s own blog, the solution is LLM driven, but includes a 10 hour burn-in process with the LLM reviewing results to spot outliers and detect patterns across the fleet. Simple failures like metrics crossing failure thresholds cause a programatic fail, but Claude has spotted novel failure scenarios such as servers with outlier temps that turned out to be missing fans and thus not alerted on by the BMC.\n\nAll qualification reports get written to our DCIM as notes providing memory to the LLM, audit-trail for operators and documentation for future on-call engineers.\n\nA server qualification report generated by Claude - a single fan speed sensor stuck at max due to a faulty tachometer\n\nThe site being qualified, we’re pretty much at the point that we’re good to provision and go-live; and in the process, we’re one step closer to bringing a DC online from a train (career goal for many a Platform Engineer at Railway). But with be selling you lies if we said vibing metal it was 100% trouble free, in reality - it got to about 99%.\n\nIt might surprise you to learn that the LLM-ified buildout workflow performed on average much better than our prior human-driven design.\n\nThe main errors were often operator driven or originating from bad constraints given to the system - the fixes mostly involved adding more check rules and tightening the constraints.\n\nEg: during some cable moves, the operator forgot to re-add a number of cables leading a device to be racked without network access - a design rule to validate a minimum 1-pair of diverse switch links fixed this footgun.\n\nThe more interesting errors centered around instances where the LLM had no constraints that guided it in the proper direction. Eg: the LLM can pick power sockets based on phase balance and PDU breaker distribution, but it has no context as to the positioning of a PDU socket on a vertical rack PDU relative to the power input of a server - the result - a server at the bottom of the rack needing a 3m C14/C13 power cable to reach a socket at the top of the rack. For now, we’ve addressed this by way of a skill that would prompt for user input - but cabling could be modeled better.\n\n4m power cables are A-OK for our robot overlords\n\nLLMs are also not error free, and skills can be loose guidance with large context windows. Anywhere the harness isn’t guiding the LLM through a strict sequence of steps can lead to the LLM forgetting to do certain things. Eg: during `/dc-provision`\n\nClaude “forgot” a few steps during setup that led to a later setup phase failing. Though it was great at picking up the omission and fixing it, no harm done, it still cautions how these systems are non-deterministic and require guardrails. We haven’t still resolved this fully, but in the interim, we’ve found great milage in having the LLM review device config during the validation phase as a final check.\n\nIt goes without saying; config such as the BGP settings driving critical parts of the system such as the dataplane network fabric or transit peering is 100% human reviewed. We also still ensure human-in-the-loop for any steps that may impact user workloads; but as we scale, safe automation will come for these too.\n\nThese checks aren’t the sole defence against misbehaving hardware. Once a site is validated, we run a second fully independent test-suite via railway app deploys targeted at the region - these checks verify every component meets our specs from a user-workload perspective.\n\nIt’s been 6 months since we decided to build this thing; and 5 since we started using it in anger. Just this past week, we verified the system works by forcing the author on PTO, gave Mark (our Head of Engineering) the reigns to Claude, and saw how far he got - apart from a few papercuts; remarkably well - another batch of servers onboarded, a dozen or so failing servers identified and more compute shipped to prod!\n\nNot exactly shipping servers from a beach; but close.\n\nSo what’s next? Like everything at Railway, this is just our beachead. As we scale compute, by 8x, 16x, 32x and onward - we are committed to investing in the tooling to get us there. For Railyard, this means investing in giving it more awareness of our infrastructure - and consolidating signals so it’s LLM assists can see broader patterns.\n\nThis lands us squarely on why we’re writing this post; because shock/suprise, we’re hiring! Two roles specifically aim at expanding on the things we’ve built here: [Senior Infra Engineer: Datacenters](https://railway.com/careers/dc-engineer) is seeking a keen human to tame our metal wielding robots and take Railyard and it’s tools into 2027 and beyond (you’ll have a 8-figure ramp card for servers), while [Senior Infra Engineer: Baremetal Orchestration ](https://railway.com/careers/orchestration-baremetal) seeks someone up for wrangling those servers into operational compute.\n\nIf either of those roles sound interesting; or you want to bring your own spin on the farm-to-table of compute; drop us a line on careers@railway.com.", "url": "https://wpnews.pro/news/claude-please-rack-me-a-datacenter-make-no-mistakes", "canonical_source": "https://blog.railway.com/p/datacenter-no-mistakes", "published_at": "2026-06-03 00:00:00+00:00", "updated_at": "2026-06-03 19:44:11.450293+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-startups", "ai-products"], "entities": ["Claude", "Railway", "DDR5 DIMMs"], "alternates": {"html": "https://wpnews.pro/news/claude-please-rack-me-a-datacenter-make-no-mistakes", "markdown": "https://wpnews.pro/news/claude-please-rack-me-a-datacenter-make-no-mistakes.md", "text": "https://wpnews.pro/news/claude-please-rack-me-a-datacenter-make-no-mistakes.txt", "jsonld": "https://wpnews.pro/news/claude-please-rack-me-a-datacenter-make-no-mistakes.jsonld"}}