Presentation: The Infrastructure Challenge Behind Production AI

wpnews.pro

Transcript #

Renato Losio: We will chat about the infrastructure challenge behind production AI. A couple of words about today's topic. AI, as we all know, has moved from experiments that we were doing probably a couple of years ago to now probably always on a system that runs whole business operations. As adoption grows, one of the biggest challenges we all face is no longer building just models, but as well how to run them in a reliable way at scale. We have seen as well in the last few months, large organizations like even GitHub, a platform that many of us rely on every day, discuss the challenges they face in scaling their capacities and the pressure they experience under the increased workload they had with AI. If a company at GitHub scale is rethinking its infrastructure, what should the rest of us be learning?

Today, probably AI is not just increasing the load, but it's really changing the shape of the load we experience in production.

My name is Renato Losio. I'm a staff editor here at InfoQ. I'm joined today by four experts coming from very different companies, countries, sectors, backgrounds that are going to give their perspective. They're all experts in the field, and answer your questions and give their feedback on all the open topics. Before starting our discussion, I'd like just to give a chance to each one of them to give a short intro.

Luca Bianchi: I'm Luca Bianchi. I'm the CTIO of MESA. I'm focused on developing software for highly regulated sectors. We have products in that domain, and that change of workload is something that we are experiencing, because yesterday, our workload was pretty much databases. Now, they are pretty much tokens. Everything changed in the shape and the amount of workload.

Alex Infanzon: I'm Alex Infanzon. I'm a Solutions Architect with Cockroach Labs, working with the engineering teams to help them solve data infrastructure challenges. I talk every day with AI-native companies, fintechs, and platform teams to help them try to figure out the agentic applications, why they work in a prototype and are failing in production. Lately, the most interesting conversations are not about models, they are about the infrastructure underneath them.

Meryem Arik: I'm Meryem. I'm one of the co-founders at Doubleword. We're an inference provider. We've been working in inference and model serving since before ChatGPT came out. Back then, people didn't think it was as important as it is now. Most recently, we've been talking and thinking a lot about tokenomics and the problem of token cost and scalability in inference as it goes to production.

Simerus Mahesh: My name is Simerus. I'm on the founding team of a company called Forge. It's a startup based in San Francisco, backed by a few folks like Greylock, Palantir, BoxGroup. Previously, I've worked at a few other companies like Meta, Google, PlayStation, where I've done a lot of infrastructure-related work, pertinent to data centers, or cloud infrastructure, and even just agentic systems. Now, I'm pretty focused on the AI security side of things, and just overall working on the infrastructure for deploying production AI agents.

Unexpected Infra Bottlenecks in Production AI #

Renato Losio: I'd like to start with a very personal curiosity. It's like, for you, all really experts in the field, what's been the most surprising infrastructure bottleneck you have seen in production AI? Because I have my own experience, but I have a limited one. Do you have any suggestion, anything that you didn't expect to see?

Meryem Arik: Has anything happened that I didn't expect to see? I think everything that's happened has been along our prediction. We had two major predictions when we started the company four years ago. One is that open-source models would have to win for a number of reasons, like cost, performance, privacy being some of them. We had another bet that token cost would be a very big issue, and that people would scale very quickly in the amount they're spending on AI and tokens. Both of those have been proven to be true. I think the thing that has surprised us is how true that's gotten, how quickly.

Jevons Paradox is one of those things that we as an industry talk about and know well, but it's one thing talking about and knowing well, and there's another thing living through it where we maybe anticipated that the token cost would go up, let's say, aggregate 10x every year. It seems like it's more like 100x for most organizations. The scale of change has been much faster than we could have possibly expected. I think we were directionally correct. I think the thing that surprised us is the pace, which makes it much more difficult to think about how you build for production. Because if I was building, let's say, an application for production a year ago, I'm probably building it for an order of magnitude less scale than it needs to be for next year. That adds a lot of challenges and complexity.

Renato Losio: Have you seen anything similar? Do you have any other feedback?

Alex Infanzon: Yes. No, definitely. The token cost is increasing, and it's a big challenge. It's breaking the budget of companies, no matter how large or small they are. They are definitely hard to manage. I'm seeing also some other things that are breaking the bank. People used to think that GPU constraints, and that was going to be a cause. What we are seeing as well is not only GPU cost, it's also the energy utilized in data centers. Now, if you want to actually have a data center, you have to plan years in advance because you have to think about locations where there's enough energy to maintain and allow these AI applications to run. That's at the infrastructure level. At the next level up, what we are seeing is that the legacy data layer is becoming very costly to maintain.

Traditional databases are not well suited for AI high velocity and high constraint demands. With distributed SQL uses, it is one of the answers to address that. We're seeing more and more that databases are used to store all this information and to make AI applications operational. You can store the data of the usage of your agents and then make reports about how your tokens are behaving, what's the memory they are consuming, do traceability, and so forth. The cost of infrastructure is just growing up exponentially these days.

Resource Constraints #

Renato Losio: Alex mentioned as well electrical power. I was thinking in general, what's the first resource you're likely to run out? Is it more really GPU, database capacity, network, engineering time?

Luca Bianchi: Actually, the first to run out, I think it is usually the availability of the external system. If you are using internal GPUs, so if you are provisioning GPUs either in cloud or on-prem, it doesn't matter. If you're provisioning GPUs, probably electricity and the shortage of the hardware could be something that you need to take into account. If you are going, like probably most of us, directly towards external providers, that could be a public cloud or other alternatives. One of the most challenging issues that I've been facing when shifting to AI has been the availability of the external systems that could change dramatically just in a number of hours. Let me give you an example. I have been experiencing some regular or recurring issues with databases in the past.

You provision your database and you know that your database can scale up or down with a given latency or maybe you can be throttled, but they are quite clear figures that you can handle and you can also mitigate. On the other side, when you are sticking to an AI endpoint, say you are sending a message and you are expecting back in, say, 30 seconds to have an answer, that answer can fail, that answer can arrive in three times or four times the expected time, or that answer can arrive and be truncated. It could depend. I have been experiencing some really nasty issues when we were just moving from, say, noon to 3 p.m. The only difference was that a new region was awake. The U.S.

were awake and they were starting to use the same endpoints and the same constrained resources and then we were throttled, just basically shifting a couple of hours later during the day. This is something that is quite new as a challenge when you are dealing with external AI resources.

Renato Losio: What's your experience with infrastructure bottlenecks on a personal level? What do you see as the most challenging part, or what have you faced so far?

Simerus Mahesh: I think tagging on to what Alex said about energy consumption and power, I think this is a very critical factor when it comes to working with infrastructure and managing data centers. I was on a data center optimization team where I directly optimized for power consumption, for example. I never found this to be the bottleneck, per se, especially for DevEx-related reasons or even production-related reasons. The main bottleneck, I think, was compute. Most companies that I've worked at, based on anecdotal experience, that have their own data centers, their shortage comes from compute, just because of the fact that it takes a lot to run workloads. It's a lot of things that are needed when doing something like this, especially at such a large scale when it comes to vertical scaling, horizontal scaling.

With the advent of AI and agents, you have things that sometimes keep running even after the response has completed or whatever, because it doesn't follow a regular thread-type model, but more like a process-type model. Let's say, for example, if an agent spins up a sub-agent, it can keep running even if the parent agent just dies for some reason. There are some parallels with operating systems there. Back to the point, compute is the main bottleneck that I've seen. These companies are spending a lot of effort optimizing, configuring their own Kubernetes environments to be able to manage this, handle this. It's an ever-growing system bend. It improves with time, but with AI workloads being more and more unpredictable, I have seen it increase and cause more uncertainty and breakage.

Planning Capacity for Unpredictable Workloads #

Renato Losio: It's hard to prepare capacity, but I wonder, how do you plan capacity for a workload, as Luca mentioned, that things can change quickly and you can hardly forecast what the API is going to do. You might have a change that, compared to the past, is not a spike anymore, it's like 10 per or whatever overnight. How can you really plan for that? How can you as well minimize the side effect?

Meryem Arik: I think it's so difficult that you should probably not try to do it unless it's your job. I've been working in open-source model inference for a while, and one of the key use cases used to be that you wanted open-source model inference so you could host it yourself and self-host it. Actually, the job of inference has just become too big for most companies to even attempt to self-host at any decent scale, because you end up having to do a full operation of capacity planning and of building an entire inference stack, which is just far more than any business wants to do. I would say that this is best done by the inference companies. It's correct, and what someone mentioned earlier, is that when you end up on multi-tenant endpoints, you do end up with this noisy neighbor effect.

You'll end up that your endpoints are a little bit too slow when the U.S. wakes up. There's a lot of infrastructure providers, inference providers specifically, who have done a very good job at capacity planning, and they will probably do a better job than you will. As was just mentioned earlier, there just isn't enough compute in the world to satisfy the demand. Even the best capacity planning in the world and the best inference provider in the world still probably doesn't have enough compute. There are still times where there are going to be noisy neighbor effects. I think we've done a relatively good job of this. We mainly serve large volume tasks and large long-running agent tasks. We don't get hit by the same noisy neighbor thing that other people do, but it's very hard.

I think people that try to do that capacity planning themselves and GPU planning themselves will find it very difficult, unless it's their full-time job. That would be my word of caution.

Measuring the AI Cost Per Dev #

Renato Losio: How are companies deciding the AI cost per developer? What is the average cost that companies should budget for that? Do you have any insight about how to measure the results based on the cost involved?

Alex Infanzon: It depends. It depends on what infrastructure you have procured, what are the models that you are using, what your AI application is using. Cost is very difficult for this. It's so difficult that Uber just recently published this in May that they ran out of budget for the yearly budget. They consumed it in the first four months of the year. I think the cost was something about between $200 and $500. Something about that rate per user. What triggered the consumption was the incentives that they were giving to people to use AI. They were encouraging people to use AI, and this went X number of times higher. Actually, they had games where you were in the leaderboard. If you were using more AI, you were at the top of the leaderboard and so forth.

Renato Losio: You're saying basically wrong incentives that were pushing the cost.

Alex Infanzon: You have to make your engineers aware of the cost, and keep in mind that when you are using AI application agents, they are very eager to help the user to answer the question. They figure out ways. If something is not working, they do a loop and then they try again, try again, try a different route. They are creative. With this creativity, there is more token usage, more resources consumed. It can become a nightmare to try to contain the cost. It's not what is the cost right now, but how do you help your organizations, your IT organizations to be responsible and contain that cost? Because as the participant was asking you, it's a problem that is out there today. You are going to run out. Probably Meryem's team and product can help a lot with these kinds of situations.

Meryem Arik: I think the amount that teams are spending on token cost is varying wildly. I don't think there's any issue with spending a huge amount of money on tokens as long as they're productive. For example, I have very good friends at companies like NVIDIA and similar companies whose token spend for people in their team is like $15,000 a month for one team member, and they don't mind it because they're like, actually, we needed to hire for this team. It's improving our velocity. We are doing so much more than we could. These are very productive tokens, and so it's so fine spending $15,000 a month. I know other companies who are spending $200 a month per employee and are complaining. I think Uber's limit or cap was like $1,500 or whatever it is.

The amount that you're spending on tokens, I think you should spend as much as you think you're getting value out of, and if you are getting value out of $50,000 a month per employee, then you should spend that amount. You need to be getting the value out of it. It's like the value capture, and that is proving difficult for some people. There is for some people also this element of token maxing as Alex was saying, like people putting things on loops and they end up doing incredibly stupid things to try and solve basic problems. I have no issue with spending huge amounts on tokens if it's useful. For us, we've spent a lot on tokens, but I think we deploy them on the whole pretty productively, and so I don't mind.

Alex Infanzon: That is a problem also as well of getting a measure of the ROI. To your point, if you don't get the ROI, then you're just burning money with nothing.

Simerus Mahesh: We're building a governance platform to be able to essentially govern what your agents are doing, and like how much they're spending and stuff. Basically, like to answer part of the participant's question, you would need some observability type wrapper around all the agents that you're deploying, or even running within your organization. There are specific ways you can collect this. For example, for all Claude Code sessions, for all Codex sessions, this telemetry and the logs are literally getting stored on your file system. You can directly access them if you want to do a native solution. Going back to what Alex was saying, governance is a very big issue, I think nowadays. Because first of all, we don't want to let engineers do whatever, because what if they're pasting a bunch of production code, or what if they're pasting in API keys?

We need some sort of way to govern exactly what's getting fed into these models and stuff. That is a very big layer that me and my team are actually tackling right now. It is a very big problem. An out of the box solution obviously takes a lot of time to make, which is why we're focusing on this. The more custom solutions are going to be a bit hacky, which I think you can probably implement in-house maybe, and get a naïve solution working, but something more robust, probably going to be like an out of box solution for it.

Tracking Company Data in Production #

Renato Losio: Actually, that brings me to the next question that is related to how to track company data in production and what are some of the challenges for not experimenting with large data in production? Do you have experience as well with some sensitive data in your sector?

Luca Bianchi: It is a quite difficult question, because it is something that we know quite well, or we're supposed to know quite well, the challenges. It is quite difficult to find a solution that could fit all the different use cases. Let me explain a bit more. We know that we need to keep some data reserved and some other data secure. The degree of reservation or security that we need to provide can change based on the kind of customer, on the kind of sector that you are targeting, and also about the kind of data that you are managing. This means that sometimes you don't have one solution that could fit all. Let me give an example. I have had many customers that decided to go directly for self-hosted models, because then they can have all the data, all the data management, all the data processing within their perimeters.

The problem was that these costs and even the size and the scalability of the system was very difficult to plan beforehand, due to two factors. The first one is that the price of the underlying hardware resources are constantly changing, but the other is that also the accuracy and the models are changing as well. If you need to plan beforehand for the next six months and then say, ok, which kind of model I have to bring in production or something using a Qwen, whatever version, and I need to plan that now. I basically don't know what is going to happen within six months in this time frame. It is very difficult. To have 100% data locality, it is difficult.

On the other side, an approach that I've seen being used by a lot of companies, and we are leveraging that as well, is to have different kinds of models. Some local models that could handle very sensitive data and not anonymized data, and then an anonymization layer that could strip all the most sensitive data and then leverage on frontier models in order to be able to balance the shift between data security, data locality, and on the other side, cost and forward-looking, because being able to make predictions within six months, to me, is very difficult right now.

Alex Infanzon: I've been noticing for the last couple of years that the database layer is becoming the control plane these days. In the old days, the database was just the repository in the back to get the data. These days, especially in agentic AI, the database has become the control plane. Why is that? Because a lot of the transactions and a lot of the metrics that the agents are generating, the telemetry, is stored in the database, and then you can then produce reports of usage, token utilization, and so forth. Furthermore, what you are also storing is the identity of the agents. You have metadata of the agents stored in the database. Why is this important? Now you have to manage the identity.

For security reasons, you have to manage the identity and treat the agents with an identity to access what they have access to and revoke access to the agent. Also, the database, the reason it's in the control plane is now you have to track what the agent's doing to have a log so you can audit what's happening in your environment. That can be stored also in the database. In terms of locality, what Luca was mentioning, you need a database that is distributed globally and can provide to you latency times for local applications. That's one thing that with a distributed SQL you can achieve by having a database that is distributed across multiple regions in the world, but it's seen as a single logical database on top of it. That is the idea. Agents accessing data in Italy will have local latencies because they access nodes that are co-located in the region. The most important thing for agents, all this information has to be synchronized and always consistent. That's the key point. You cannot allow for eventual consistency in these types of applications. They have to be consistent. Otherwise, your agents are going to act on wrong information. Rolling back things that agents do with wrong information is going to be tremendously difficult because that triggers things, and it's out of control.

Rollbacks for AI Applications #

Renato Losio: How easy is it actually to roll back an AI capacity cleanly? I'm thinking mostly about traditional applications, but what does it actually mean for an AI application to roll back in this sense?

Meryem Arik: What do you mean with regards to capacity planning?

Renato Losio: If I think about a very simple old, traditional application, I'm thinking, I have my cluster of applications with a number of nodes growing and scaling down and the database that was scaling down capacity, whatever. I actually wonder when I think about a growing number of tokens, growing number of everything, but I think if I want to have an elastic capacity that I pay for what I use, how do you actually scale down? How do I roll back things on AI? Probably I have a completely different mental model I'm coming from. Probably I'm old enough that I'm not used to these new approaches.

Meryem Arik: The majority of people who are using AI in production, with a few exceptions of very regulated businesses, interact with them through third-party API providers.

Renato Losio: They give away the problem.

Meryem Arik: They give away the problem. This is the problem that I have to deal with. I have to deal with the problem of like, how do I scale up and scale down instances very quickly? How do I swap models in and out for each other very quickly? For example, on my stack, let's say I've got 20 different models that I offer and a fixed GPU capacity, how can I offer all of those models in line with the usage that people want? There's a very active research community figuring out how we can do better cold starts, how we can do faster scale-ups. For most people, they've given this problem away to an inference provider, but it's a very real problem that we solve and that we think about.

Simerus Mahesh: I think I can take a little different lens to this, actually. It obviously depends like what you mean by rollback. There are things, obviously, just like rolling back a system prompt, model version, or just even a feature flag, which can be pretty clean, just like standard rollback procedures. Rolling back an AI capability that has already taken action is a lot harder as we move to autonomous agents, coding agents that are working on production and stuff. With agents, the output isn't just text. The agent may have created files, changed configuration, opened pull requests, called cloud APIs, which are not really reversible to some degree, updated state in a database, or a queue, or whatever, or even triggered some downstream workflow on Jenkins or something like CI/CD pipeline, whatever.

At this point, I feel like rollback becomes less about reverting a deployment and more like compensating for side effects. I think the right approach is design for rollback before launch itself, like feature flags, dry run models and approval gates, especially with these autonomous agents, coming back to the point of governance, which is a very important player for agents these days. Idempotent operations, similar to ACID compliant databases and stuff. I think the cleanest rollback is often just like preventing irreversible actions from happening automatically in the first place. I don't think there's a clear answer to this because there's so many variable workloads and different possibilities. I think prevention is probably the key thing to note here, especially with autonomous agents.

Luca Bianchi: I totally agree because it is quite difficult to roll back the work that an agent has done. That's quite strange because the history of the conversation is quite easy to be recovered and to be saved. The problem is that the reasoning part and the reason why an agent decided to use a given tool and not another, it is something that is not easily rollbackable. I can roll back the effect of an action. You have done this query to the database, I can roll back that query and do whatever I want. It is very difficult to roll back the reasoning process that led to that query or led to the usage of that tool.

It's a bit of uncertainty that we have in agentic systems and we need to deal with them, and to deal with them with guardrails maybe or just focusing on the effects of the actions.

Alex Infanzon: I totally agree with you. We are working with a very large credit card provider at the moment, and the reason I brought this in is because once you have agents, multi-agents working, one agent can trigger one thing on stale data or wrong data and pass that along to another agent, and that agent is going to do its own thing and maybe trigger two more agents doing actions, downstream actions, and they are going to persist the results that they created into somewhere. All that information and all these records are going to be wrong because they were derived from a wrong assumption from the initial agent. We are working for this credit card company. What we are working on is we are partnering with two other companies, one called DBOS, and that's a product that basically helps track the workflow of the agents.

The workflow is actually stored in the database and it can be rolled back and then take action. You know the exact steps that the agents took so you can trace back and fix things. The other company that we are working with is called Memori, and they are the ones responsible for storing and persisting the memory state of the agents into the database. Also, to be able to figure out how to write and provision better prompts using what the agents already learned that is stored in the database, and then things that they've done in the past, inject that into new prompts and then continue executing. That is also very important. The key thing is by leveraging vector search within the database you can actually look for similarities, things that had happened in the memory that are similar and that they are persisted in the database.

Renato Losio: Search is not really technically a rollback.

Real-Time Saga Pattern and Rollbacks #

Participant 1: Would or could real-time Saga pattern that builds up over the conversation help with being able to roll back?

Alex Infanzon: The Saga pattern is good if you have a way to trace the steps of every single agent, every single action that they did, and record those updates. Then when the model fails or the agent fails, then you can revert and start tracing back and undoing things. That's the only way that you can do. In the past, all that was in the log of the database. It was easy just to go to the log and roll back a transaction. Today, it's more complicated than that because you are not rolling back a transaction. You are rolling back a series of actions that occur on different agents that have different states. That is the problem.

Renato Losio: As Luca mentioned, they could as well be not reversible. You might have to compensate, you might have to act as soon as you had said.

Participant 2: Actually, I was thinking about what the implication of that is. We mentioned before the privacy and we didn't talk too much about security. I was actually wondering, who is getting the worst of this? Is it more the data folks, the platform one, the on-call one, security people? Who is getting the hardest part right now with capacity? What I feel is that we say capacity is limited. We have limited resources. We might scale 10 per in a few hours or a few days, whatever. Even large companies like GitHub had challenges. What's next? Who is the one to blame? Who is having the hardest part right now?

Simerus Mahesh: I've done a lot of SRE work, software engineering work, production engineering work. I think I can confidently say that it's still the SRE, like the on-call SRE people, or even the platform engineers, but mainly the on-call SRE people. The reason is that mainly it's because AI workloads are changing shape very quickly. In a normal product, traffic growth is usually tied to just user growth or just a known launch. With AI systems, the same number of users can suddenly produce so much more traffic or load onto your production systems. Just because of the fact that a change prompt or a new tool call that was added for your agent, or just even increased context can just make such a more aggressive agent a loop. It can increase your demand by tenfold for your infrastructure.

This basically just goes to show that production can break even when user traffic is normal. Typically, obviously, the first people that come to mind when trying to fix these sorts of things are the SREs. Even the request count might not be alarming, but each request is now doing a lot more work behind the scenes. Just to illustrate the point, the agent can run longer, call more tools, create more state. The on-call person isn't just dealing with more traffic, they're dealing with a workload that the cost and the behavior can just change overnight, like drastically. Thus, the lack of sleep.

Alex Infanzon: I would say that the database is always guilty until proven otherwise. It's always the database which is the problem. You can see this when an engineer decides to change the embeddings of the database and the data platform team has to migrate the schema with 40 million rows or more than that. That's a huge problem. Or if the security team flags a governance gap in agent access, it's the data platform team that needs to build the audit trail. As Simerus was mentioning, the SRE team gets paged at 2 a.m. because the agent loop is hammering the database. Guess who is going to be in the call? It's the DBA. I've been a DBA for many years, so I know that you are always the person responsible until you can prove that they are not.

Renato Losio: Are we saying the old rule that the DBA is always the first culprit?

Alex Infanzon: Yes, you are always a culprit. Your database is slow. Your database is not good. The data is wrong. Yes, I know. You are always guilty until proven innocent. Yes, that is the key thing. It is, as we are saying, very complex to do this. The data platform team has my genuine sympathy for the work they've done. On the other hand, I would say databases are becoming the control plane again of agentic AI, because now we need to keep this state consistent across all of that, all the agents in your applications.

Renato Losio: Do you agree? Do you see always database as the main culprit?

Luca Bianchi: Yes. I think that it could be, if not the main, just one of the most important. Because in this world where we have a lot of moving parts that can be replaced quite easily with maybe some generated code, everything hits the database. Everything at the end of the day, you are going to be slapped in the face by your database, or your poorly designed database, or the performances that are not sufficient, or the data is not structured well enough to be retrieved. There is something quite different from the past. This is the fact that, in the past, we used to plan a database for humans. I can remember a very famous book from Hernandez, "Database Design for Mere Mortals". The idea was to allow programmer developers to properly design a database for the data to be retrieved and shown to humans.

Now the logics are changing. In the near future, we probably will have agents retrieving data from databases, maybe from a number of databases. Some constraints may be related to the fact that we cannot keep in our cognitive load too many databases or too many different databases for the same team or whatsoever, maybe are becoming less important. I definitely agree databases are where everything fails.

Alex Infanzon: I have to clarify, the database is not failing. You are blaming the database. It's not failing. They're always saying, it's the database. To your point, there are so many agents, so many things accessing the database that a DBA is not capable of keeping up with the demand. That's why, at least in Cockroach Labs, we are working on making the database more intelligent. We are adding AI capabilities to our database. In real time, the AI agent can be monitoring, why is performance stalling? Why are my writes slowing down? Why is my latency going up? All those things can be done automatically by the AI inside the database. At the beginning of the 2000s, Oracle was talking about autonomous databases.

Renato Losio: It's quite some time ago.

Alex Infanzon: We are at the realization now of that capability by putting agents inside the database. That's from the infrastructure point of view.

Infrastructure - Wisdom from the Past, and the Present #

Renato Losio: Actually, I have a different question in that sense. It's like, if we look back at the past, what's one piece of conventional infrastructure wisdom that used to be true and is not true anymore with AI? Do you have any feeling of what has really changed, because now we are saying, ok, the database is still blamed as the main culprit. Something else has changed in infrastructure, apart from provisioning that we know that there's not enough capacity out there. Anything else that we can say has really changed significantly?

Meryem Arik: I think something that's changed pretty significantly is the unit economics of running software has changed really radically. It used to be the case that people loved investing in software businesses, because they had great margins, and they were super scalable and all of these things, because the underlying infrastructure cost was not the driver. Now that's not true. Now your infrastructure costs are so expensive, and will be a huge part of your cost base as a business going forward. The days of 70%, 80% margins, it's just not possible anymore. Infrastructure is actually a key cost driver, or maybe the key cost driver in a business. That's a new phenomenon that we're not used to seeing in software businesses.

Simerus Mahesh: AI agents specifically, and just like AI workloads in general, I think have made the idea of if the demand spikes, then just autoscale your way through it, like obsolete. I think that definitely worked better when the workload was mostly human driven, bounded by just normal user behavior, click patterns, database access patterns. With AI agents, now one user action can trigger a loop of model calls, tool calls, code execution even, because it has sandbox environments, retries, database reads, database writes, cloud API calls. If that loop is inefficient, or misconfigured, autoscaling doesn't actually solve the problem, it can actually amplify it by turning a product bug, or just like a small inefficiency into a huge infrastructure bill, or like even outage, if effects cascade downstream.

The new wisdom is, before you scale AI workloads, you first need to bound them, put limits around runtime execution, tool calls, retries, your multi-tenancy environment setup, your sandbox environment for security and isolation, and blast radius overall. Elasticity and cloud compute is still super useful. The cloud is genuinely a great place. I think bounded, like autonomy has to come first with these AI agent workloads.

Alex Infanzon: I agree with you, just one thing that I've been thinking a lot about in recent years is, I used to think that I should do my capacity planning for my expected growth, but with agentic AI, that's really impossible. I think that the better idea is to plan for elasticity, make your architecture elastic. Otherwise, you cannot predict what is the actual workload that these agents are going to generate. That is a huge problem. I cannot predict that workload. I cannot say with this amount of database infrastructure, I can do that. I need to be able to allow my database to be elastic. Think about that elasticity. To your point, Simerus, definitely what you need to have is guardrails in the agents to constrain the uses of requirements. To me, elasticity is going to be key to allow my agentic application to grow or shrink as needed.

Simerus Mahesh: I agree. Also, I do think that elasticity should come second versus like bounded, you should bound your agents first and ensure that it's secure, and it doesn't have a blast radius that's too big. Because with elasticity, like when you use AWS, you essentially have infinite compute, it's just a matter of cost in the user's perspective. What this can typically do is trigger more and more compute, depending on what you use for autoscaling. If you use Karpenter to generate more nodes, or just create more pods within your Kubernetes environments, you create more and more compute and containers running your programs for things that could be like misconfigurations, or something for behavior that you don't want. I think that's very scary, because that can rack up your AWS bills, for example, or just even lead to just slowness.

Renato Losio: There are basically two problems that I see people complaining about the lead one. One side is that your AI workload can trigger that and you suddenly have a huge bill. On the other side, cloud provider due to the capacity as well that Meryem mentioned before, are putting soft and hard limits in place that most of the time are not enough to address what you want to have. If you're using Bedrock, or if you're using anything accessing your model, but suddenly you don't have enough capacity there. You have the two problems at the same time, people that complain that we don't have enough capacity, on the other side people that don't put enough constraint on the infrastructure that they actually provision.

Alex Infanzon: Yes, database vendors like CockroachDB, what we're trying to do to minimize is separate compute from storage, that gives you the elasticity of compute. If you need more compute resources, you add more compute, or if you need more storage, you add more storage, try to segregate those two, so you can better manage your infrastructure.

Actionable Insights #

Renato Losio: I'm an attendee in this roundtable, I enjoyed and you convinced me of the importance of the topic. I'd like to say, what can I do as a practitioner, what should be an action item that I can take care of. It can be reading a book, can be reading an article, can be provision an instance, can be set some guardrail? What's the advice you can give to someone that they can implement tomorrow?

Luca Bianchi: Starting from tomorrow, reconsider your architecture in light of the adoption of AI and how it impacts the infrastructure. I think that architecture is the first and foremost point that you should consider starting from tomorrow. I think some evergreen books such as "Evolutionary Architectures" from Neal Ford, are something that are really worth reading, because they can tell you some tenets, some principles that you could apply even in this changing world. Expect things to change. Expect new databases with shared compute to be released pretty soon. Expect new agents. Expect new models. The question is, how do you stay in business with that? The only answer that I could find is building an architecture that is able to evolve, that is able to adapt and use this new advancement as soon as they become available.

Meryem Arik: The biggest piece of advice that I would give is for anyone who hasn't tried open-source models in, let's say, 6 months or 12 months, try the latest open-source models. They have made huge capability improvements over the last months, like GLM 5.2. They are incredibly good, much cheaper. The latency is really good depending on the inference provider. Open-source models, you should definitely try them and have them.

Simerus Mahesh: Going off of what Luca was saying, but not entirely. I don't think you should start by redesigning your entire architecture. That's something that is constantly evolving and something you can't really do in one sitting, because migration is a very big thing. I think I would pick one production AI workflow, preferably a critical one to either developers internally, or even just production itself. Then trace what actually happens when a user or a developer or whatever triggers it. Map the full path, like model calls, tool calls, database queries. Then just basically ask yourself questions regarding this path related to like, what's the maximum runtime? What is the maximum number of tool calls? What happens if a dependency slows down? What happens if a model retry goes wrong or whatever? I think the goal is to find the places where the system is unbounded.

You obviously may not fix everything, but I think you can identify usually one or two concrete things that you can then evolve upon and base the start of your agentic system or your AI system change from. I think that's a good first step.

Alex Infanzon: For me, it's basically be more rigorous in evaluating your data infrastructure. Ask your vendor to show you how the data infrastructure works under failure, not under a predefined load. The benchmarks, as of today, it's TPC-C, they are very constrained. They work perfectly on a system that is up and running 100% of the time, but your vendor should show you now what happens if the network partitions, or what happens if a node goes down, or what happens if somebody asks me to change my schema. Can I change a schema online? What happens if a whole region fails? What happens if I'm updating my database software?

All this, you need to ask your vendor to show how the benchmark should be showing you what happens when you are under pressure, because AI agents are going to stress your infrastructure, and things are going to happen, we know that.

See more presentations with transcripts

source & further reading

infoq.com — original article