Presentation: Designing AI Platforms for Reliability: Tools for Certainty, Agents for Discovery

NVIDIA hired former startup founder Aaron Erickson to build a GPU allocation system for internal AI researchers after his own AI-powered reorg startup failed. Erickson's previous company, Orgspace, attempted to use ChatGPT for corporate restructuring but ultimately shut down, leading him to join the chipmaker where he now focuses on managing GPU resources for initiatives like Nemotron and BioNeMo.

Transcript Aaron Erickson : I'm Aaron Erickson. We're going to talk about why it's not all AI, but sometimes it's AI. "Tools for Certainty, Agents for Discovery", is the name of the talk. Who here has ever had this question? "I know, what if we decided to do some of this with AI?" Has anybody ever had that, and then decided that's a terrible idea? I have all sorts of bad ideas, one of which was something, I know, obviously, GraphQL, terrible idea. Agile, come on, what was that, the 2010s? We all did Agile, and we discovered it was a bad dream or something, maybe this'll become one of those things. Maybe this'll be a new paver stone on this road to hell that is filled with all sorts of good intentions. Key Works To that end, let me talk about one thing I built. Before my current life, I worked at this company called Orgspace. We had this idea of what if you could write software to do reorgs? Who has ever done a reorg? Who has ever been subjected to a reorg? Who has ever had a pleasant experience after being subjected to a reorg? No, nobody's had a pleasant experience after being subjected to a reorg. This was some software we built to help you do that. Then, 2023 happened, and there was one immutable truth about 2023. If you had founded a startup and you wanted to raise another round, if you didn't do AI, you weren't going to be a startup anymore. How many non-AI startups have gotten funded that aren't like a coffee shop since 2023? Not very many. We had to have an answer for this. I thought, you know what would be brilliant, is to do this thing where we maybe ask ChatGPT to do the reorg for us. I know. Some of you are thinking, does Aaron just watch Black Mirror and think, that's a good idea. No, we probably shouldn't develop products that way. We were desperate. We were running out of money. We thought, what if we do this plugin? Who remembers ChatGPT plugins? It was like MCP before MCP was cool. What we were doing is, what if you just type in, I've heard people are flattening their orgs, can you help me with that? I think we should do engineering. There we go. It gave a good answer. It pulled in some data from Orgspace, our reorg software. It came up with the kind of plan that if you called up a consulting firm, they'd probably write that plan, because as we learned in the keynote, it produces the most mid output you could imagine. Who here has ever gotten a reorg and ever thought the plan for the reorg wasn't the most mid output you could imagine? It's exactly in that category. It came up with that and we could do this thing where, great, let's actually do the plan. We would do the plan and it would generate this flattened organization for you. It would literally branch your organization, figure out who should move around, who should be in this team or that team according to whatever you type in for how you want to do the reorg, and it would generate it. Then you would come back to this piece of software where it would write your reorg email for you. If you wanted to do it in iambic pentameter, you could do that. If you wanted to do it as a Homerian epic, you could do that. If you want a reorg email as a haiku, you could do it. This was like the thing you would do in 2023. GPU Fleet Governance, at NVIDIA Why am I here? It turns out we did not become the future of HR. Everybody can rest easy that you're not reorged by a robot. You're reorged by a consultant, which is totally not a robot. Yes, I know, it was terrible. I crashed out. I ended up at this chip company somewhere in Santa Clara. I heard they do AI things. It was cool. I saw this headquarters, I'm like, that's a headquarters? Isn't that amazing? What did we do? My first job there wasn't doing AI. My first job there was building a system that would allocate GPUs to initiatives. We have all these internal researchers at NVIDIA, and they all have really cool AI models that they're building, things like Nemotron, things like BioNeMo, things like Cosmos, all these cool things. In the old world, I was building HR software, so we had human resources. We also had open positions in the old software. We also had employees in the old software. Complex hierarchies. Who hasn't seen a complex org with multiple lines of reporting and stuff like that? Performance management, you have calibration at the end of the year. We need that stuff. It turns out a lot of things in my new world were not that different. When people would request GPUs, it's almost like requesting headcount. In fact, more expensive. We're talking like, if you want 1,000 H100s, that might be $20 million, $30 million, $40 million to have it for a month. These are really expensive, more so than people in some cases. You would have idle GPU clusters that are like open positions. We can go down the list. You had AI training jobs. You had things like cloud providers with regions and blocks in them that would be this complex hierarchy of things. Then performance management, the GPUs also have to perform. We have observability. We need to know when fan failures happen, stuff like that. What did I do? Reinvent what I've seen before. We built this thing called Llo11yPop. You might think it's spelled funny, and you'd be right. You might think, who named this thing? That was me. Which is why at NVIDIA, I'm never allowed to name things anymore. They took that away from me. I feel so terrible. What it did was actually, I thought, was pretty clever. We built this system that would use AI. You'd have these things called retrieval agents. Retrieval agents were built to do one simple thing, which was convert a question about something into an API call. You could use Elasticsearch. We used Elasticsearch at the time. We would convert some kinds of questions, we'd give it some examples and some RAG, into the proper query. We found if we constrained it, it would work really well. We then had things called analyst agents. They were built for a different reason. They were built to understand what kind of questions should I be able to ask? They might know things like, if I see these conditions about an H100, I should ask this question to get data to validate about it. Today, we might call this a deep agent framework. At the time, we were just a bunch of dummies trying to figure out how do we take multiple instances of an LLM and have it actually do useful work against a database, and actually then turn around and then say, in the full vision of this, wouldn't it be great if this could just be an autonomous data center? For the time, it was, ok, why don't we just imagine flagging every GPU cluster to look a little weird and let the LLM analyze it for a little bit, and then just raise a Jira ticket or raise a Slack message or raise a something. That's what task agents were for, initially. The big vision of it was maybe instead of raising a Jira ticket to tell a human, go do the thing, what if we could instead just have it automatically run some workload that would then remediate it? That was the big vision. Lessons Learned from the Llo11yPop Project We didn't get all the way there. What ironically ended up happening, as we think about the lessons from the Llo11yPop project, which I think we learned a lot, one of which was there's a lot of things that are rare context. The question I remember would always stump the system, where somebody asked, where are the zombie nodes? A zombie node in a GPU cluster is one of the groups of eight GPUs that can't connect to the network properly. Usually, it's something like that or something where the job is continuing to run, but it's not reporting results correctly. There's a lot of nuanced versions of this. Sometimes the AI would get that right, but most of the time, if you didn't actually give it some examples or even consider maybe doing a little bit of post-training so it understands some of the vocabulary or a little bit of RAG or graph RAG or any of these techniques, you could actually have it understand that. No system is going to understand unless you go find all that rare context, all those definitions of what this is. We call these semantic layers today. I think we have terms for this. At the time, we didn't have great terms for this. That's one of the biggest lessons. One of the other lessons we learned in doing this, so who here has ever built a text-to-SQL system? This is this dumb idea that if you just ask a question, the right kind of LLM might be able to write the right kind of SQL that will automatically answer it. The first couple times you do it, you'll do a couple queries. You'll be almost right, but it's just powerful enough. You're like, I want to have this system where I can just ask an arbitrary question, and if the data can address it, the generated SQL wouldn't tell me the answer. Now it turns out, you don't get very good accuracy if you make it do joins, and that's with interns. Also, the AI doesn't have very good results if you do joins. All sorts of things don't have very good results. That's why we invented ORMs, because we don't like joining things, apparently. That's my angry intern-generated picture. I have 8 interns a year. What we do is we flatten the schema. We make it do real simple things like selects, where clauses in some examples, group bys and maybe things like that that it really understands. It can get more complex things correct, but we found that if you keep it really simple, you increase your accuracy significantly. Oftentimes, this would be something that would take us from maybe 70% right to up in the upper 90s, which for some use cases is good enough. We also learned that LLMs, at least at the time, granted, this is something that was true as of a couple years ago. I think it might still be this way, but they classify better than they code. If you ask an LLM, is it a this or that or that, and there's five things, it can actually get that pretty well. They might not code as well to figure out exactly how that is. Now they're actually getting pretty good at it, but that was the goal. One of the ways that we figured out how to take advantage of this is we said, if the query matches this pattern, say we're just counting GPUs, just run this query in this pattern, and here's where you put the variables. This is almost what we consider the way out of this road to hell, which is one of the topics that I really care about is, when do you decide to go to a deterministic system? When do you decide to make the job of the AI easier and go into this world where, it's this query and we know how it works? That's actually pretty important. That was one of the first things we learned is off-ramps to determinism really help reliability. One of the other things we learned, and I think you've probably seen this scene in some of the other talks where we talk about you can only have so many tools when you're doing Cursor rules, you can only have so many, is if you have too many options, particularly too many options that look similar, the ability for it to classify properly actually goes down significantly. We started to notice very increased error rates if we had 50 agents that the system could choose from and a lot of them were similar in scope or similar in what they're supposed to do. Just like if you open that menu and you're at Cheesecake Factory. Who here has ever gone to that menu and had to look at it for 20 minutes to figure out what you want. Most people are like that. That's why restaurants that have simpler menus do better sometimes. LLMs, again, they suffer from this problem sometimes as well. That was one of the ways we went to solve that. Another thing that we did was we did these purpose-built agent hierarchies. You're wondering, why is he always talking about org structure? That's a thing. It seems like a thing. In this case, it kind of worked out, where you can have a VP agent that has wide context but isn't great at any particular thing. That's like every VP of engineering. We're not good at anything, but we have a lot of context. We know how to pass context around. That's what we do. You might have a manager agent that's really good at like, how do you ask questions, down to the individual agents. The most important agents in the system are the individual agents that are doing specific tasks, just like in a system, just like in a company. It's not the managers like me that do the important things. It's the ICs. It's the ICs that actually do the real engineering, and so those things have to have a lot of context about how you do any individual task. That's how these two work. Those are some of the lessons. One of the other major lessons we relearned was, you have a construct of a test pyramid. Who here knows what that is? Test pyramids. In the old world, back before AI, back when cavemen ruled the world and all this other stuff, you had a test pyramid where you had the end-to-end tests. You had fewer of them, and they were more expensive to run, and that's the similar thing where we would have, at the top level, the hierarchy, tests that call multiple LLMs and then have to get an answer back out of an entire system. You might have individual LLMs in this multi-LLM architecture that only do one thing, that have simpler tests. You run those more often. We found that if you have this pyramid of eval, similar to how you have a pyramid of tests in the old world, this would work really well. This was a slide from the Llo11yPop people we wrote as a blog, and this was one of the contributions from the team, which I have a team. In fact, I have people here that are on the team that are very annoyed if we don't have really good evals. You can't vibe test this stuff. Actually, accuracy matters. That was one of the first things we also realized. There's how you build your OODA loop-based system, your observe, orient, decide, act system, is something like that. People were talking about these ideas of deep agents now which are rediscovering some elements of this idea, but actually doing it for real. There's a lot you can read about that these days. Agent Archetypes Part two is, let's talk about agent archetypes. The first thing I like to think about when I'm thinking about, is a problem an AI problem, is a problem an LLM problem, is it some other kind of problem, or should we just write some software? Imagine what you could accomplish if you had just a whole bunch of army of dumb interns that could do one thing particularly well, but you could scale it. You can scale it up to 1,000 or 10,000. That's one of my favorite use cases. Yes, Michael Burry. If you've seen The Big Short, you might know this line. At the beginning of the movie, The Big Short, Michael Burry asks his associate, "So, I want you to look at all the mortgage bonds". The associate replies, "So, you want to know what the top selling mortgage bonds are, right?" "No, I want to know what's in each one of them. I want to know if any one of those is risky". If you can imagine if this was done in say 2024, 2025, you might say, why don't we have an agentic system instead of the guy looking at Michael Burry in this scene, actually look at every single mortgage bond, look at it for certain kinds of anomalies, look at it for certain kinds of things that might be wrong, and then just flag those. That is what I call a worker agent problem. Similar to your task is, paint this design on all the rocks on the beach. It's not a mechanical thing, it has to be a little bit different for every single one, but that little bit of difference is something the LLM can do. The task is fundamentally the same across a very large group of records. There's some problems where this kind of worker agent approach can work really well. I love this kind of thing for, look at all the GPU clusters and find the ones that have patterns of fan failure and give it some ability to be a little bit creative, to be maybe a little bit wrong in its analysis of figuring out things that might be a new failure mode you might discover. Things that look exceptional, maybe you don't know for sure it's wrong, but it's good to check. That's a worker agent type of problem. Go look at all 100,000 clusters, analyze them for whatever issue you care about, maybe have multiple that have different kinds of prompts attached to them or different kinds of ways you might query to look at different aspects of the problem. Another kind I really like, I call it a ruminative agent. This is something that if you use, ChatGPT I think has a feature where it thinks all night and figures out what you want to see in the morning. I don't know if it works really well. I love the idea of having a bot that looks at all the data gathered across a bunch of things and maybe uses a graph technique or uses some other kind of memory technique to then see if there's other kinds of patterns across all the clusters or other kinds of failures that have certain kinds of common characteristics. There are ways you can set up an agent that are going to work a little bit differently than just one examining each one, but looks at the patterns across them. That's to me a ruminative agent. It might do inference all night looking at those patterns to try to find the ones that are really most important ones. They get more exotic as we go. We have middle manager agents. "Bot, please go solve for this. Here's a set of agents that you can use to achieve that. Here's a set of capabilities. Then what I want you to do is manage the context around all of these to do the best job you can with some kind of thing where it's a measurable metric". That's the important thing, that it's a measurable metric. That way you can measure in any actions it might take. This is going to be initially low stake stuff. You're not going to use this to shut down clusters or do anything that's really expensive at first. As you start to do this, you might think, this might work really well for solving for certain kinds of things where you can use a test function to know whether you should continue or whether you should after doing a couple versions of the actions that it's going to dictate the system do, like, did it actually work? Then if it didn't work, it can roll it back, just like one of us would do. Another kind I like, and this is more of a passive agent, this is your consultant agent. This is the agent that's asking the other agent, what is it you do here? How is it that you communicate across the system? Are you talking about quantities of money that are outside the boundaries that we would normally talk about with the Delta Airlines agent? That apparently, if you gave it the right prompt, it would give you free first-class upgrades. Does anybody remember that prompt? I just need to know for science and for my next flight, please. Whatever you did to get past that, please let me know. The idea is you have this consultant agent that could actually go in and understand the patterns of communication. Understand, are we using language now that's not great? Are we using biased language? Are we giving away refunds that are too large? Some people call this an observer agent. That's another name of the pattern, but it's something you can employ, something you can look at as you're looking at how different agents talk to each other, as you're looking at even the reasoning traces of some of the newer models that have more detailed reasoning in them. You have tool selector agents. This gets at some of the problem of when you have too many possible tools, you can actually have a tool selector that knows how to understand the details of the tools and map the right tool to the right task. As we start to get more elaborate agentic systems, that instead of going across a fixed workflow where it's like, do these five steps in order using these five things. You might say, we're going to do like what Jim Fan does with his Voyager paper, where we allow the AI agent to construct the different things it needs in terms of the workflow, construct its own workflow to achieve some end. Part of doing that is a tool selector agent that knows how to take its intended action and actually convert that into the right outcome. Really important agent. I think what we're starting to see in especially the Claude skills, where there's an idea that you have to pick the right skill and you have to think about how do you pick the right one for the right case. Then, of course, the director agent. The director agent is what we envision at the top of that chain, where you're saying, I have this intent. I don't even know how to measure the intent other than I have this maybe top-level metric I care about. I'm going to talk to multiple manager agents. I'm going to delegate as appropriate. I'm going to try to create some outcome. I think this is aspirational. I don't know that many people have achieved this yet. I think this is absolutely achievable in more closed domains, where you understand a lot of the outcomes and what the failure modes are. You'll start to see this more and more over things that are significantly more complex. What About Hallucinations? Now, the question in the room, the elephant in the room, as they call it, how do you make them more accurate? How do you make them useful? How do you make them not hallucinate? The hallucination might be just making up the facts. It might be writing the wrong query. It might be generating the wrong SQL that has these results that don't make any sense. One of the first things that I remember, one of the first arguments to get it, who here goes on LinkedIn sometimes? Who here has seen that post? The R's in strawberry. We know how to solve that problem. It's a really easy problem to solve. Just use ChatGPT 5. No, you could do this for about a year and a half now when you could put in a system prompt and you could tell ChatGPT, please use Python when counting things, or please use Python with anything that involves math. This has been something where it could generate Python in line that would just count the number of R's because it could just write Python that knows how to do that. You've been able to get the right answer to this for quite some time. It turns out that if you allow it to write code that actually does a better job of understanding how language works, in some cases, you can actually do a good job, or understanding how math works. Another way I think about this, who here does long division in their head? One person does long division in their head, why? Why would you do that? When we have perfectly good things called calculators. We use calculators, they're deterministic. They know how to do math, they're really good at that. We don't reinvent how to do math every single time we need to do an equation. We look at a times table, even before computers, we would look at that. You use a calculator. Imagine you are Delta Airlines, or whatever airline it was, and you're given an agent, how do you make the customer happy? Would you just let them use an unlimited budget to do that? No. You would give them guardrails. If you had a brand-new associate, a human person, that were working the customer service desk at an airline, you would say, I don't care what reasoning you came to, you are not giving any refunds over this amount. You would have a hard guardrail in the system that's a deterministic guardrail. That's not an AI thing, that's just a, you cannot do refunds over this amount. Maybe the agent could trick somebody into giving a bunch of smaller refunds somehow, but you could also manage that just like you would manage that with a human as well. You would give them guardrails, they'd have limits. You would govern them with humans. This is obvious, but it's important because I think a lot of people get on the stage and hope that just AI agents are going to require no humans. I don't think that's true. Another example. Imagine you are managing a large cluster of any kind of computer or any kind of large distributed system. Let's say you do a routine operation, and do you reinvent how you fix the DNS server every single time? Do you look up how DNS works and look at the network architecture diagram and figure it out on your own? No. You use a runbook. Who here uses runbooks and still has AI? How else are you going to fix the AI if you don't have a runbook? No, you have other AI. No, that's not going to work either. Use a deterministic runbook, we don't have to reinvent this every time. One of the other things, and this gets to the theory, why I care about this in terms of weaving together deterministic systems that know how to ground the AI, along with AI that knows how to maybe discover new ways to accomplish something. If you give it the right combination of tools, the AI can discover the right combination of tools. It can use the deterministic tools to actually get reliability. That's the theory here. We see this in practice. We see this in practice with systems like deep research. If deep research only looked at one document on the way to doing its query, it would almost always get wrong results, because the whole point of deep research is this continually re-grounding the query in real-world data to understand, step one, understand this fact. Step two, maybe do some rumination. Then step three, four, go and check it again. It does this over and over again until it gets the answer that you want. You can put as many tokens as you want at this. If it's designed well, you get this scalability where the longer you let it run, the more you get correct answers. It's actually amazing that this works. In fact, we built this blueprint in NVIDIA so that deep research isn't just something you have to go to OpenAI or some other company to get. You can actually build this into your products. If you want to have deep research in something like a thing that figures out why the GPU is going badly, or any other hard question, you could embed this into your product. Deep research is just a pattern. It's not necessarily a product that you have to buy from some other vendor. This gets to the real point here. Most effective AI agents, they have access to useful tools, they're governed by guardrails, they have feedback loops in them. The feedback loops where you understand what queries went badly or what things didn't work are that thing that allows you to improve it. AI agents built well are the first kind of software I've seen in a long career that theoretically should get better the more that you use them. If it's a well-designed system, the system should get more accurate the more that you use them. The more training it gets, the more ability it gets to handle edge cases. When I think about this, I think about what the platforms look like. I think there really is a dividing line. I think there's two important layers in the AI platforms of the future. There is a tools layer. The tools layer is made of deterministic software. It may be written by AI, guided by a human. If it's the transaction system at, say, a payment processor, that might just be written by a human. It might be old-school software as of three years ago where it just does the thing. It's totally cool. Everything doesn't have to be AI. Even as an NVIDIA employee, I will say, not everything has to be AI. I know, it's controversial as heck. Then the AI agents on the other side allow you to do the stochastic things, allow you to do the things that are fuzzy, allow you to do the things like interpreting a fuzzy input and trying to figure out, what category does this fall into? What is the best guess of where this would fall into? Then maybe route the call or route the sales lead or route the, I think something's wrong with this GQ, but I don't know why. Maybe this other system can give its best guess. Maybe it can give it to three different systems, and then come out with the right answer. Those are going to work with tools that then gather the data in a very deterministic way, in a very defined way. That's one of the most important things about systems like these. I think a lot of the conversation in the industry where people are saying, agents don't work or agents are hard, it's because I think there's been a lot of desire to say, why don't we just have AI agents do everything? Have AI agents construct every single tool call they'll ever make, rewrite every single tool call they ever make, or don't give them deterministic tools and have them regenerate the code every time. Listen, folks, this isn't magic. This is just math. I think we have to figure out where is the appropriate case for one or the other and build platforms that allow AI to construct, out of multiple tools, how to generate the right outcomes. I think once you frame it that way, it becomes a lot easier to think about how these are going to run in production. Every now and then, regardless of how I feel about this, I read a report where somebody says AI doesn't work. I'm in New York, who here has ever ridden in one of those things? When I do this talk in San Francisco, it's two-thirds of the room. This is what I do. Every time I'm feeling bad or I'm wondering, is this real? I book one of those, and I just ride around for a while. It's almost like therapy for people that are worried about the bubble or whatever. I just go on one of those, and I think about, how much reinforcement learning did this have to undertake to be able to cross a street where there's crowds going back and forth, to drive through San Francisco's Tenderloin, which if you've ever driven through that, I try to avoid it, people just walk around wherever. It's like New York almost. No, I book a Waymo, and it makes me feel better. We have at least one evidence proof that you can deliver an AI agent that can safely deliver humans from one location to another in very unexpected conditions. If we can build AI agents that can transport us, I think we can build AI agents that can look at a GPU cluster and know what the problem is, or at least have a pretty good idea of what the problem is. We can do self-driving systems that aren't cars, that just work with information. I don't buy the line, I think we have to not expect it to work on day one. That's the problem with some of this AI stuff, especially LLMs, is LLMs have almost conditioned us to have results at zero-shot that are just there, because when you use an LLM, it kind of looks right all the time. The writings are professional. It's polite. It's the kind of thing you'd write to your CEO. Why do you think they like it so much? That's why you have to investigate it. Sometimes, as humans, I've written reports that have wrong stuff in them. I've hallucinated, but it looks good, so people believe you. That's the problem with LLMs. The payoff is too fast, when in reality, agent development is not that different than software development. It's just the failure modes are different. We have to try to react to those, because once you get it to work, again, it's that first software ever that improves the more you use it. What Are Good Agent Problems? Maybe you're not doing self-driving cars. Maybe you're not even doing self-driving data centers or self-driving accounting. What are good agent problems? My favorite kind are the dumb diamonds. Is the cover on the TPS report properly filled out? They're not that hard to figure out, but you need a person to look at it sometimes. You had that boss in office space that was like, is this filled out correctly? Yes, looks like it. There's a whole bunch of problems in business that are dumb diamonds, that are like the diamond in the flowchart. A human mostly just checks the box, but is looking for that every now and then exception. That's really important. That's one kind I really like. Classifier kinds of problems. There's problems that I was talking about before where LLMs are really great at classifying. Is it this or this or this? You can use vision models to do a similar thing, where it's, look at the picture and is it one of these things? Does it make a decision based on that? Those are really great agent problems. It doesn't have to be like an elaborate OODA loop kind of agent. It could be an agent with three steps in it where it uses a classifier in the course of making some dumb diamond decision that results in some outcome that now has taken, not a human's job, but it's taken a low-level intellectual task that nobody really wants to do and just makes it not a thing. Content organizers. This is one of my favorite kinds because our team built one. We named it Codex right before OpenAI named their thing Codex. Julie works on it every day. Use prompt X to rewrite content Y into format Z. We do this with this technique. I think we're calling it template RAG, where what we do is we have this system that takes all of our meeting transcripts from all of our Teams meetings. Unfortunately, NVIDIA uses Teams. I tell candidates that we use Teams, and they're like, I don't know about NVIDIA. What we do is we then run it through a template. The template defines what the structure of the wiki should be. You run the transcript through the template, and what you get on the other side of it isn't just a meeting summary, but it's a meeting summary exactly in the form that you want. Then if later on you want to say, then tell me what the impact would be on this industry or this team, you can then rerun all that content through the templates, and then you can get the new perspective that you wanted based on the kind of summarization you want, or the kind of tool you want to give it to maybe add new metadata or add new information to what you're generating as what would ideally be an organized wiki of content that is very legible to everybody, but in the same format. Similar to how you might have somebody that's a note taker that does the same thing. This is just trying to organize information so it's more legible to humans. I think it's pretty important. Another kind of system is a scaled inspector. I talked about this a little bit, but look at every X, and it could be GPU clusters, it could be transactions in a stream of transactions. It could be anything where there's a lot of them, but you're looking for certain kinds of conditions, and you want to give the AI the ability to say, is this something that's a transaction that looks questionable? Is this a GPU cluster that looks like it's got this pattern of failure? You can go on down the line for the kinds of things you think might be wrong, or even give it a little bit more latitude to come up with ideas that maybe you didn't think of that might be matching failure conditions. You can have this look at literally everything all the time, for the amount of budget you want to spend. Actually, I think it's a pretty powerful technique. One of the other kinds, and this gets into things that aren't necessarily LLMs, and I'm going to get on this topic, which is, we make another mistake in the industry of equating AI to LLMs. LLMs are a subclass of AI. They're actually a tiny subclass of AI. They just happen to get a lot of news now. Constraint navigators are really neat. They can just say, we have this solution space. It's something that's really hard to search. For example, we have a problem where we want to be able to reorganize our GPU clusters. It's a little bit of a bin packing problem, but with tens of thousands of bins, and millions of GPUs that have to be arranged in very specific ways, so that if you're saying you're running a thousand GPUs to run a big training job, they should be in the same hall. It's a really hard bin packing problem. This is not unlike the game of Go. The game of Go has, you can't search the entire space of solutions for a Go. That's why AlphaGo was such an important discovery, which is the ability to actually look at statistically what is the best possible option here, given I can't possibly search everything? Honors lots of constraints, and what are the rules of the game? If you do that correctly, you can actually solve really hard problems. It's not necessarily the most ideal solution, but it's your best guess at how you might do that reorganization, amongst other possibilities. The advice I would give people is start with small, composable skills. You can do this with Claude skills now. It's actually pretty powerful. Chain two or three together, and over time you're going to start to build skills that actually compose more of them, and that gets you more comfortable. See what you can do with your tool set. See what you can do within the safe boundaries of what you're trying to accomplish. Then that gives you the power to say, now we can solve bigger problems, once we get really good at the small agents. I say this because you read that study, I've read it a million times. I think I'm on the leaderboard for the most Waymo's ordered, because I see that study that says 5% of AI projects succeed, which I think is great. Five percent succeeded, 95% fail, often it's because they're too ambitious. That's why I say start small, get comfortable with it, and then we go from there. Diverse AI's This gets to the thing I really want to emphasize. There's more than one kind of AI. I think about AlphaGo solvers. There are combinations of them. I think there's one that does code now, where it is a solver for the best code using generic algorithms and a AlphaGo style solver to do more advanced code generation. More than just what you do with an LLM. One of the things my team builds are things called time series foundation models. I'll talk about that in a little bit. There's things called protein language models. Who here knows what one of those are? This is the thing that allows you to figure out novel molecules that might cure a disease. The way those work, instead of using words, instead of using human language, we use the language of DNA, we use the language of chemistry, but with the same transformer model, to figure out what might be the best molecule to solve this disease or to react to this antigen. Then we can even do things like figure out, is this going to cause increased liver toxicity? Which would then help a drug company not put drugs up for FDA approval that they can know early aren't going to work, because they're going to be too toxic to the liver. That's one case. There's a lot of cases like this. In fact, we worked on a system, which was like that same OODA loop system we talked about before, that was taking healthcare data, so take your patient medical record, take the presenting condition. Who here has ever been to the ER? Did you know that the doctor at the ER is vibe doctoring? You thought vibe coding was bad. Talk about the vibe doctor. What the vibe doctor has to do is look at you for five minutes and figure out, that person can't breathe. They should get CPR. No, that person has what we think is this condition. Here are the parts of the medical record that might be relevant. Maybe I have time to look it up. I probably don't. Then I'm going to make a decision what medication you should have. This should terrify you, that we don't have scaled AI actually helping us figure out what is the second, third, fourth, fifth best option, and is there something that the doctor's not seeing at that point, then figure out what is the right thing to do. In my dream, in the regulated medical space, as we say, where AI is going to be really hard, I think this is going to be one of the most important things you start to see scaled out. I think Microsoft had a video, they call it Health Superintelligence. It's really just that same multi-agent model applied to healthcare. Healthcare is very much a stochastic thing. When people make a diagnosis in healthcare, and if they're 45% right, that's considered good. That's actually good, which means that if you can solve this with AI, it's 80% better. The accuracy rate we would never accept in a financial transaction is twice as good as what you get in a healthcare decision. I think that's really important to think about when you think about the power of AI, is, where are those questions that are stochastic questions where the bar is this low? If we're 80% right, it's actually a lot better than what we had before. Self-driving is like that. Self-driving isn't perfect, but it turns out human drivers are terrible, and so all it has to do is meet this minimum bar of not being as bad as a human, and it's technically better. In fact, most of the time it's been shown as about 10 times better on a per mile basis. I think those are the really interesting areas where we can use AI. I think of domain-specific reasoning models. Think about a general reasoning model that has Chain of Thought built into the model. You can do this. You've done mathematical reasoning models. We've done biological reasoning models. We've done all sorts of reasoning models. There are these smaller models that you compare with a larger one to then get better results out of the larger one. I talked to one healthcare company that was going to spend $1.5 billion training a model, and I think I made our sales team mad when I said, you might be able to do it with a model one-tenth the size, but with a protein reasoning model attached to it. That actually worked out pretty well. He came back and said, thank you. I'm like, sure, can I get a commission on what I said? They said no, but still, that's the point. You might be able to get more with less with reasoning models. We also think of things like world models. Yann LeCun, not Meta anymore, talks about world models, that there's so much more you can learn from interacting with the world in 10 minutes than an LLM would possibly ever know. When we think about the advances in the next five years, it's going to be using world models that have things that gain real-world experience, and so we have this model called Cosmos that you can then post-train. There's an early version of what we'll really see in five years of these really elaborate world models to understand not just the relationship between words, but the relationship between objects in three-dimensional or four-dimensional space. I know it sounds like space age stuff, but these models are much larger. These models might be in the hundreds of trillions of parameters or even in the quadrillions of parameters to understand all this stuff, and we just don't have enough compute to do it. That's why I think we're so aggressive in trying to just build out these data centers everywhere. NV - Tesseract, and Open-Source Models I wouldn't be here if I wasn't at least pitching something I work on. Our team works on something we call Tesseract. Tesseract is a new model, a time series transformer. Who has seen one of these before? Time series transformer models. ChatGPT will predict the next paragraph after the context you give it. It's trained on this corpus of all the world's text. You give it your question and it's able to predict the answer. It does so pretty well. Time series transformers are different. What they're doing is they're trained on the notions of time and the relationship between data over time. Tons of data goes into one of these about just time series data, and you're able to actually forecast data based on patterns it can find in the data prior. Using this, we do things like anomaly detection. We do things where if you are using one of these on a factory line, we can know where the anomaly started based on the patterns in the signals, based on the patterns in the data, and know every object made on this assembly line after this point probably has an error. Then we can go back and we can actually save a bunch of inventory. We can do things like forecasting. This can work in financial services. It works in supply chain. It works in all sorts of domains where the data is large, where the patterns are not necessarily well understood, but they're discoverable, and that using training the same way you would with an LLM but with data can then tell you new things, can do actual forecasting. These things can get surprisingly accurate. They might be a little bit more expensive to train. You might spend a million dollars post-training one of these time series foundation models. They're not as generalizable as human language, so they require a little more post-training. Once you get that, if you have an economically valuable decision that you make at scale, this is one of the best techniques for being able to do better anomaly detection or forecasting on those kinds of things, to actually get a better idea of what's actually going to happen. These are literally models that predict the future. I'm just happy I get to work on them. That's that part. Then the other exciting part, and this is something I'm excited to announce, is we have this new open-source model. Everybody's coming out with a new model all the time. Qwen's coming out, and then Llama's coming out, and all these things. I don't know if there's going to be another Llama or if the next Llama's going to be great. I don't know if we're going to just take whatever OpenAI decides, open-source of whatever they're not state-of-the-art models. They don't have incentives to release that. Whereas what we want to do is we want the best possible open-source models to be out there in the industry so we can raise what the minimum bar should be. What Nemotron's doing, it gives us the ability to say, not only do we have that, we can have a state-of-the-art model that has open weights. We can give you all the data that we trained with it so that if you want to then post-train it or do any fine-tuning you want or study it or anything else, it's there. It's very much in the open-source strategy of NVIDIA because what we want is for the entire market to succeed. For that we need as much open-source as possible. We've been, especially this year, really just investing in as much open-source as we can to the point where we pretty much just turned around everything that we're doing to work on more open-source in that capacity. Summary I'm going to start summarizing this a little bit. Determinism is good. Determinism, I hear sometimes is a word when people don't want to use AI. I think it's a false dichotomy. Determinism is good. Stochastic systems are good. Non-determinism is good. They just use what they think for the right thing. We don't have to be a purist about it. Dumb agents are fine. I like dumb agents to start because they're probably going to work. I care about, does it work? As you go up the learning curve of figuring out how these work, start with dumb agents. Great agents are defined not just by automating a workflow. We've known how to automate for a long time. What they allow you to discover. New ways of working. New ways of understanding why something fails. This is where systems get better over time because they can discover things. Then when you find out something works, you can just make that part of the system. I think that's incredible. The feedback mechanism can make it better in the way that it's used. Bottom up over top down, one of the biggest reasons these things fail. I think it was even pointed out in that MIT study that when it's top down from the CEO that doesn't have enough context on how things work, there's all sorts of failure modes. They may not understand how much institutional knowledge is required to do something effectively. They just think the process is check, check, check, and it's done. When it turns out, bottom up, they actually know all the complexities. I actually would rather have the agents built on the person that's actually working on that work day-to-day at the coalface than some hypothetical way that the CEO might understand. I love CEOs, but they don't always understand the details of everything about how a company is run. Rare context, that is the specific way that your company might understand a topic or the way that your company might understand how a zombie node works. Go back to that point earlier. It's that rare context that is the thing that the LLM would never know that's going to be really important for you to employ in these systems if you want them to be effective. Because people are going to talk to these systems or have language in these systems that is very nuanced to your company or to your organization. Mercilessly use evals. If you don't have evals, you're not serious about what you're doing. You're just vibe checking. That's not good enough. Use evals. LLM as a judge is pretty good. There are other better ways to do it. If you're not measuring accuracy, I don't think you're serious about what you're doing. Design the system to improve over time. Feedback loops, this is how systems in general get better. If we were not with AI, we would say feedback loops matter. That's why we invested in Agile. To get feedback to make it better. We can automate that process. We can get feedback on was the answer right or not. We can get that little bit of feedback from humans. That's why every AI app has an up-down button. If those bits of feedback are used and are rolled back into the system, ideally in an automated way, it gets better with use. Really important. Final Note I will wrap this by saying the world will belong to those with the wildest imaginations. This gets back to the theme of our keynote speaker, which is, there's AI systems that might be able to connect two distant ideas, might be able to connect chocolate and peanut butter together, but they don't yet. I know some people have ideas for that, but for now, the people that can connect two distant ideas together into some unified idea, say for example chips and AI, that can connect these two things together in a way that wasn't done before, I think that's going to be the people that do the most innovation over time. I really encourage you to not limit yourself with what's possible. Limit yourself with what evals tell you don't work, but keep your mind open to a lot of these ideas. See more presentations with transcripts