{"slug": "presentation-building-evals-for-ai-adoption-from-principles-to-practice", "title": "Presentation: Building Evals for AI Adoption: From Principles to Practice", "summary": "Mallika Rao, a former engineering leader at Twitter, Walmart, and Netflix, warned that \"evaluation debt\" — the gap between evolving AI system architectures and stagnant testing frameworks — poses a greater risk to product success than model performance itself. Drawing on experience building personalized search and rewards systems serving billions of users, Rao explained that outdated evaluation methods like precision-recall and static testing silently accumulate until they catastrophically break user trust and product pipelines. The presentation outlined how enterprise systems across different domains share the same root cause of failure: evaluation infrastructure that fails to keep pace with increasingly sophisticated AI architectures.", "body_md": "## Transcript\n\n**Mallika Rao**: How many of you have test datasets that are most definitely 100% wholesome and 100% covered for the products that you're building?\n\nI'm Mallika Rao. I have led search infrastructure teams at Twitter, trillions of documents, sub-50 millisecond latency budgets at global scale, highly personalized ranked search formats. We took it from just an inverted keyword lexical search to a highly personalized search at Twitter. Then I was at Walmart, where we built for a premium subscription service, a rewards model to boost growth, commerce, retention, acquisition, and all that good stuff. Most recently, the content systems at Netflix, where we process billions of personalization decisions every day for global scale, including recommendations, ranking, slicing the data, at scale. I've shipped intelligent systems that can scale up to over a billion users. Here's what I have learned at scale. Very rarely do the models actually come in the way of shipping products that thrive. It's actually your evaluation frameworks that can break your products, break your pipelines, and actually touch that user trust, which is so critical for shipping AI products at scale.\n\nToday, I want to talk about something that's invisible to your dashboards, but deadly to your products, and that is evaluation debt. It accumulates silently and explodes spectacularly. We'll talk about what is evaluation debt, what are the symptoms of it. What are the challenges that enterprise systems, enterprise companies face in building evaluation frameworks, and that can scale and evolve. Then we'll talk about a couple of case studies that I'll bring from my experience building personalization systems across all these companies. Then we'll talk about what are the key takeaways, what can we do to assess where we are? What are our maturity models? How can we shape our adoption models going forward? Then wrap with some principles.\n\n## Two Systems, One Problem\n\nLet me start with two systems that I've worked on and very close to my heart. First, personalized search at billions of queries per day. Real-time semantic understanding, not just keyword searches, and like I said, sub-100 millisecond latency budgets distributed across multiple data centers, global scale, and every query touches hundreds of microservices internally. My learnings here, of course, span Twitter, Netflix, and a little bit of search at Walmart as well, where building personalization systems across recommendations, ranking, and search have their own challenges. Second, we look at cash rewards for, let's say, 25 million users every month, dollar denominated transactions, zero scope for error. This is spanning physical stores and online presence, and compliance requirements across 50 states. This was at Walmart scale, so spanning physical and online presence across stores.\n\nVery different systems, very different architectures, very different engineering challenges, and very different business stakes, but the same error pattern, same infrastructure gaps, and the same root cause, the way I see it, evaluation debt.\n\n## What is Evaluation Debt?\n\nLet's define what is this evaluation debt. It's what happens when your system architectures have evolved, gotten more sophisticated, but your evaluation infrastructure doesn't. It's stuck in 2018. You add LLMs, embeddings, vector stores, new ranking layers, new personalization insights, multistage pipelines, you have agents for your workflows, but your evaluation has not evolved. It's still stuck. It's not progressing with the product. You're still doing precision, recall, static testing. You're doing some latency graphs, unit testing, and maybe a few manually verified examples here and there. That gap is what grows slowly, invisibly, quarter over quarter, until you notice it, and one day, it catches up pretty spectacularly.\n\nHere's the thing about distributed systems using AI at an infrastructure layer. They don't fail the way traditional systems fail. A database crashes, you notice it. Your monitoring system sees a drift, you notice it. Your services go down. You have your runbooks, alerting mechanism, your observability frameworks to catch you up. AI systems fail weirdly. They fail semantically. They return results that are technically correct, but completely wrong for the user. Your dashboards are green, your metrics look good, but something's not ok with how your users are responding to your products. There are silent failures, pathological edge cases you never anticipated would happen in your products, and trust degradation that compounds day after day, month over month, and that's how I see evaluation debt.\n\nHere's the mental model that changed how I think about testing and evaluations at scale. Evaluation is more like a stack. It's not really a score. It's not binary. Each layer requires different infrastructure, different tooling, different expertise, and that's why it's more organizational, where different pillars, different orgs in the same company have to come together to make that dream a reality. Let's talk about what the stack looks like. Layer 1 is your model correctness, table stakes, that's what everyone measures and should. Precision, recall, F1, can the model predict the right thing at the right time on a test set? That's table stakes. Layer 2 is infrastructure robustness, your constraints, your P95s, how are your P99s looking? How are your monitoring dashboards looking? What's happening across the architecture stack all the way from your APIs to microservices to your databases to the caching layers? End-to-end, how does the infrastructure respond to the agent taking the AI workloads?\n\nLayer 3 is what I call the product guardrails. This is where it gets interesting. Does your system avoid generating harmful outputs? Do you have semantic plausibility checks? Can you detect when your model is producing technically correct but absurd results? This is where product and engineering leads have to sit together and just put in those hours to actually understand how do those guardrails need to look, and what is ok and what's acceptable. If you have things that are acceptable at a certain iteratable version of your software, then that's ok, but it needs to be aligned between product and engineering. Layer 4 is human experience. This one is often completely missing in companies. Does the user understand why they are seeing something on your product right now? Does this feel trustworthy? Is the visual presentation consistent across platforms? Does this build confidence or create chaos and confusion in the user's head? Again, this is where design, research, product, and engineering needs to have a stake in this decision-making. We'll talk about how we can checklist it in our deployment pipelines.\n\nLayer 5 is systemic output and systemic impact that it has on your engineering organization. The hardest to measure but the most important at scale. What are the long-term effects of trust, governance, compliance, privacy on your business metrics? Are you optimizing for quarterly metrics at the expense of user trust and user experience, which you will only have a view into if it is done at a cadence, at a ritual, as a rhythm in your companies. Most organizations are actually evaluating layers 1 and 2. Maybe they have some basic guardrails for 3, but that's where mostly it stops. The best organizations are the one that are incorporating all the five layers and calibrating and evolving it as product evolves. The evaluation debt is something that I see as the gap that is there between these two worlds, what you think you're doing and what you should actually be doing.\n\n## Symptoms of Evaluation Debt\n\nWhat does evaluation debt actually look like in a production distributed scaled AI system? Let's look at some of the symptoms that we can look at and put together a pattern saying maybe we have this evaluation debt. First symptom is silent regressions. Your monitoring dashboards are green. Your metrics are good. They haven't dropped. Your services are healthy. Your support tickets, not so much. They're increasing. Your users are complaining on social media. We saw that with Twitter Search. They were actually going to Google to find tweets. There is a mismatch between what your instrumentation tells you and what your users are truly experiencing. It comes in a little slowly, but the signal is developing.\n\nSecond symptom is impossible failures. The kinds of things that you never anticipated that could happen in the product, but they happen in production. Edge cases you can't reproduce in staging, failures that occur under specific combinations of user state, time of the day, patterns, behaviors, and load conditions that you never anticipated in your test harness. Third symptom is this edge case explosion. Every launch surfaces a new category of weird behavior. It keeps adding on, because we did not pay attention to how things would pan out. Then you're playing a whack-a-mole of sorts, fixing one semantic failure, three more pop up, and the engineering team now is spending a ton on its on-call.\n\nFourth symptom is long-term decay. A little hard to see in the earlier days, but your trust metrics start drifting downward over quarters. User engagement slowly declines. You can't pinpoint a single root cause, so it looks like death by a thousand cuts. This is where the organizational setup actually plays a very important role, where how we have set up the AI product strategy with the engineering strategy, how are we leveraging our tech leads, our L6s, staff-plus engineers to collaborate with product, research, design in making these decisions and to catch these symptoms sooner than later actually plays a very important role. Here's the dangerous thing. Across all these symptoms, you actually don't see the fires. You only see the smolderings. Smolderings are the worst, because they hide and they show up in categorical ways to cause systemic damage to your product and to your business.\n\n## Why Traditional Evals Fail Modern AI\n\nLet's look at what were those major shifts which actually broke traditional evaluation. First is something that I call as a contamination crisis. Very obviously, public benchmarks like MMLU or HumanEvals are contaminated and compromised. Models train on test sets, sometimes intentionally, but mostly accidentally when scraping the internet. They've seen the answers. For example, GPT-4 scored 85% on MMLU, pretty impressive, but later the analysis actually showed that 15% of that data or the questions were actually part of its training data. It wasn't learned. If you're using public benchmarks, know that your scores are inflated by a certain amount, and that amount is usually unknown till you have internal product evaluations built in your orgs. You think you're at 90%, but you're actually at 72-ish. Every serious AI org is probably now investing in an internal private evaluation set, evaluation framework, which refreshes quarterly exactly for this reason.\n\nThe second shift is agent systems behave very differently and they break traditional metrics. Let's say an AI agent books your flight. It's a sequence of eight steps. Every step is working at 95% accuracy, but that's 95 to the power of 8. That's something like 66%. Two-thirds success, even though every step in your infrastructure, every step in your pipeline was actually pretty high accuracy. The traditional precision, recall is useless unless it actually leads to some ground truth that the users can actually use in how they respond to the product. What we actually need is success rates, trajectory analysis, goal achievement metrics, and entire new instrumentation that can go end-to-end. We'll talk about this a little bit more later.\n\nThe third shift I see is using LLM-as-Judge without grounding. At scale, you do need automation, so we are going to use these LLMs to be able to judge and calibrate hundreds of thousands of data, but that's not the only natural solution. The problem is that they have systematic biases. They have length biases, style biases. They have their own sycophancies. They like their own prior responses. The solution needs to be something more richer where it's a three-tier system. The humans evaluate 1,000 golden samples. Then the LLM as the judge can do another 100,000 samples. Then, again, we have a manual layer which can come and audit at a rhythm every quarter or whatever cadence is suitable for how quickly that particular organization is evolving. Putting all this together to give a richer tiered system for calibrations. Then this is where the static tests actually fail these dynamic systems. That's exactly where the evaluation debt starts accumulating without us noticing yet.\n\n## Benchmark Theater\n\nBefore we get to the case studies, let me talk about this benchmark theater which I have seen happen a lot. It happens all the time as AI engineering orgs start adopting AI in their products. Not on the developer productivity side, but directly on the product side. A team optimizes their model for a public benchmark. The score goes up, let's say from 85% to 92%. There are celebrations all over, your VP is very happy, and you ship to production. The real-world performance, not so much. It's unchanged, or maybe even a little worse. I'll give you a real example from the industry, Twitter Search. We were ranking and searching with Lucene, TensorFlow models, dense vectors. We had our own embedding stores. We had recently taken our search from a keyword inverted search index, all the way to a slightly more personalized search experience. This was the first time we were having a personalized Twitter Search experience for our users, where the users were also learning how to use the product.\n\nNow the ranking team in the beginning optimizes for maybe dense retrieval metrics, the model wanted to crush a public semantic similarity benchmark. Embedding recall jumps. The offline NDCG metrics were looking great, but search quality was actually dropping, and it took us a while to actually notice it. It was dropping for some product surfaces, and that's where the challenge comes in. It was dropping for trending, fast-evolving topics, and later we found out that the benchmark that we had defined was pretty static and news agnostic, and the Twitter product was not news agnostic. It was optimizing for freshness. The traffic was based on recent relevant data, which the benchmarks had not captured. The model learns to over-analyze and emphasize lexical similarity, and underweight freshness and authority signals. Your offline metrics can soar, but the online metrics, engagement, the recommendation metrics, all this is falling because the benchmark distribution never resembled real-time, high-velocity Twitter Search, which was what the product was optimizing for.\n\nThis is the gap between what is the product trying to achieve and how are we trying to define our benchmarks. The benchmarks eventually won, but it was a huge loss for the product. This is the danger of optimizing for this benchmark theater, which is metrics and something that we have defined at the start of the process, but did not evolve to keep up with the product. You're performing for the benchmarks instead of performing for your users and product. The fix is private evaluation sets that actually match your production workload. It's more infrastructure to maintain. More of your engineers will be involved in labeling, calibration, judging the outputs, but it's the only way to actually avoid this trap.\n\n## Case Study 1: Search and Recommendations\n\nLet me show you what happens when evaluation debt catches up with you in real-time distributed systems. Let's start with our first case study. I'll walk you through when we were building the highly personalized search version at Twitter. The goal was to actually understand the context, not just keywords, which is where we were at the time. Serving billions of queries across multiple data centers at global scale. It was our main search, local events discovery. We had to pick up the relevant signals: trending topics, artists, and content recommendations. The system architecture was we had the inverted key in a keyword index, which was based on Lucene. The ML ranking models trained on terabytes of historical engagement data. We set up our dense embeddings for semantic understanding, user history signals when available and when latency budgets allow, temporal context for trending and seasonality of content. We had to take in local signals for geographical relevant data.\n\nThink merchandising. If there are events happening in your city, we want to boost those tweets up to the users. The scale requirements, of course, there were billions of queries, sub-100 millisecond latency budgets at the 95th percentile, hundreds of ranking features being scored in real time, so it was very compute intensive, distributed across multiple data centers. This was not on the cloud. It was all on-prem. We had to optimize for fault tolerance as well. The engineering promise was we wanted to give the users more contextual understanding beyond simple keyword matching and personalization that adapts very quickly, in real time to the user needs. Fresh timely results that actually capture what's happening now and not two days ago. All of this fast enough for a very interactive real time experience. This was the new frontier search experience that we were building at Twitter. The models were also being trained pretty freshly.\n\nHere's where the evaluation debt started creeping in. This is a very important slide, in my opinion, because at scale, you hit four systematic challenges that traditional evaluation doesn't capture. These challenges are across recommendations, search signals, personalizations, real-time adaptive interfaces. Challenge number one is a very cross-cutting one, model staleness versus retraining cost. Retraining a ranking model is very expensive in compute. It's going to cost us a lot of engineering time for data pipeline updates, validation, and the deployment and the adoption. That's real infrastructure cost. User preferences shift rapidly. New trends emerge, new preferences emerge, seasonal patterns change. There are many local events happening that we need to incorporate. By the time you're finishing retraining your Jan data, you already have to be serving March queries. The model is already stale. You're stuck in a loop.\n\nYou retrain too often, you get highly personalized results, but the cost explodes. If you don't retrain, it's cheap, but it's a very generic recommendation system. How do you apply this ML very effectively? If you have a large foundational model for your enterprise, do you break it up? Do your downstream teams have the power budget to actually use the model's intelligence? Do your product surfaces need more fine-tunability, more smaller models that they can actually play around and experiment with? How is your tech stack looking to actually adopt these foundational model changes so that you can iterate very quickly through these changes? Challenge two is surface-specific optimization is infeasible. Local event search needs very heavily weighted recency and location signals. Artist discovery needs taste graph traversal and semantic similarity. Trending topics needs velocity metrics and engagement signals. Very different signals and metrics going on to be able to render this product surface.\n\nThese are fundamentally different ranking objectives as well. In an ideal world, you would fine-tune separate models for each surface, but at scale, that might not be practical for every enterprise. The infrastructure overhead basically is just massive, and the maintenance burden also compounds. At least you start with shipping a one-size-fits-all model, and it underperforms for every surface because it was optimized either for none or just for one surface. Challenge three is expensive personalization signals. Looking up a user's full history is very expensive. It adds up to 50 to 100 milliseconds, and those are 50 to 100 milliseconds you don't have in your budget. Most of these queries are also one-off. New users, logged out sessions, very exploratory, incognito mode. You don't even have user history available. The tradeoff is full personalization gives you great results, but it actually kills your latency SLA.\n\nThe challenge four is eval-product misalignment. This is what I mentioned before as well, and this is a critical one. Product reality might be 75% to 85% relevant results in the first iteration is good enough. Users are forgiving based on what kind of product. They'll scroll through a few results. As long as most of the results are relevant and engaging, and there's some diversity, the experience is acceptable. Our evaluation infrastructure, our engineering community is probably optimizing for some random metrics, which is way higher than that 75% to 85% product requirements. We are over-optimizing for the wrong metrics, and those were experiences the users didn't need in the first place. The insight here is that evaluation should strongly mirror that product reality, not necessarily exceed it. If good enough is 80% for the product, your evaluation should also measure that 80% threshold and not some arbitrary higher number all the time.\n\nThat was our evaluation debt in this case study. We were evaluating for perfection while shipping good enough. The mismatch was invisible on our dashboards until the user trust started eroding. For example, the user searches for Bob Dylan type of music. They're probably expecting Joni Mitchell, Joan Baez, Leonard Cohen, folk exploration, maybe local artists nearby. What they got was literal tweets mentioning Bob Dylan, Taylor Swift, because she clicks and we were actually benchmarking for clicks, wrong genre entirely, and concert announcements from 2018, which was not at all fresh. Here's the invisible part, our dashboards were actually telling us we are super green, all the metrics are great, latency metrics are looking great. Technically we were right. User satisfaction was declining quarter over quarter, and that wasn't really visible on our dashboards.\n\nWhy did this fail is first because of the semantic gap. The model treated this as a keyword search, that's what the benchmark actually optimized for, but that was not the main goal that the product was going after. We did not trigger the taste graph traversal. We had those dense vectors available, but they were working in silos because of which we were not able to integrate those signals. Second was optimization mismatch. We optimized for clicks when we built the benchmarks, needed discovery quality and Taylor Swift gets clicks, not what the user wanted. Third was signal integration failures. Even if the infrastructure and the engineering teams had the right recipes in place, we had not integrated those signals to come together to boost the search quality up. The model was optimized, infrastructure was solid, every service was performing within SLA, but we were still measuring the wrong things.\n\nHere's where we should walk through the five layers. Layer 1 is the model correctness, it passed, 92% accuracy. Embeddings were working. Layer 2 was infrastructure robustness, this was also there. Our latency budgets looked good, metrics were good, they passed our uptimes. The distributed systems were handling load spikes, no service degradation. Layer 3 was the product guardrails, and this is where we failed. We had no semantic intent validation, no checks of whether what the users really wanted was what we were showing them contextually, and no staleness filters to actually demote the content. Layer 4 was also human experience. This is also an important one with respect to perception, visual representation, UX. We failed here because we did not have any valuation for, does this answer what the user actually wanted? It was more going for benchmarks and rows and spreadsheets when we had set up these metrics being passed. We weren't really testing for comprehension per se, and we weren't really measuring the discovery quality at all.\n\nLayer 5 was systemic impact. Here as well we failed because we were optimizing for the wrong objective, engagement instead of satisfaction. We had no long-term trust metrics which was a huge one for how our users interacted with Twitter. No feedback loop from how these performed on the road, in the field, what was the product hearing, and tying them back to where we were actually calibrating these in our evaluation frameworks. High test accuracy, excellent infrastructure metrics, but we still had declining user satisfaction. This is often how it looks like in production distributed systems, where it starts compounding and it usually shows up in user trust.\n\nHere's what we learned from all of this. Technical correctness is not semantic relevance. The model can be right and still we could be showing not so engaging and relevant responses for the user's intent, which is a big one. Semantic relevance is not discovery quality unless you're really measuring for it. You can return relevant results that don't help the user explore or learn something new. Then discovery quality in turn is not user satisfaction. You can show great results, but present them in a way that is confusing or doesn't really seem intuitive. Then user satisfaction is not long-term trust, and this is very critical. You have good short-term metrics to measure how your product is doing, and maybe it's also good, but slowly eroding user's trust behind the scenes, and this also affects the user's relationship with your product. Each of these transitions actually need a new evaluation framework layer, different instrumentation, different infrastructure, richer tooling, more platformization. If you test only layer 1, you're flying blind on all the other layers where actually the critical information is passing.\n\nWhat we changed was we aligned the evaluations with product reality. We actually sat down with the product managers, the product team, design and research to understand where is it that the product was evolving in the next few months, and tried to match the benchmark metrics. Second, stratified evaluation strategy. Not all surfaces need the same level of evaluation metrics. If it's high intent queries like artist names, we demand a 90% accuracy, whereas if it's exploratory queries like music of this type, what's happening in the city, show me all the results, top 10 results from the last 24 hours in Japan. These kinds of exploratory queries, there is some room for error, and maybe a 70% accuracy is fine. Having strategic weights for how you want to structure your evaluation frameworks also gave us a lot of progress.\n\nThird was proxy metrics for expensive signals. Wherever we cannot use the whole member viewing history, all of the users' history, we can use directional signals to say that this is good enough. It's in the right ballpark. It might not be 100% accurate, but it's relevant and engaging, and you can adaptively continue to recommend to continue that user's trust. Fourth was we realigned on our optimization objectives regularly. It changes with every iteration. We build softwares in iterations. We need to build our evaluation frameworks as well in iterations. Just make sure that those rhythms are happening side by side, so that every time a product evolves, our evaluation frameworks also evolve, and that happens cross-cutting across all the product surfaces. The result was an evaluation framework that actually mirrored our infrastructure framework, and product reality instead of chasing irrelevant metrics and numbers.\n\nWe also have a toolkit towards the end, where I've captured most of this recovery framework, and you can adjust it to your domains. This recovery framework actually goes over, how did we fix it? What did we do week 1 to week 4? What happened in that one month? What did we evaluate? How many datasets did we go through? How much did we spend? What was the ROI in avoiding most of similar type of incidents? It's an interesting study, so feel free to go through it later.\n\n## Case Study 2: Walmart Rewards\n\nLet's go through a completely different domain, financial infrastructure at scale, Walmart Rewards, a cashback loyalty program. This was not relevance, recommendations, search personalization. This was money at stake, real cash, real redemption, trust was everything. The system was at global scale, online and physical stores, dollar-denominated transactions. The promise was that customers earn cash rewards on purchases, and they can redeem those rewards at any point in time. The balance is always accurate, and the transaction is always correct. We tested it in five states, California, Texas, New York, Florida, and Illinois. We picked these strategically, high volume, diverse demographics, and different tax jurisdictions. There was some thought into it. We ran for three weeks in beta. Model accuracy was about 99.8%. Infrastructure, all systems green. Business metrics, they passed. We got the green light, and we shipped.\n\nIt looked great for the first couple of weeks. Week 1, week 2, smooth launch. Transactions were flowing correctly, no visible issues. Then by week 3, we started some tickets increasing from, let's say, I'm making it a slightly hypothetical case here, but for a state like Louisiana, start quietly spiking. In this scenario, customers begin to notice and report that their accounts show zero rewards available. There wasn't a crash or an outage, but it was just not accurate as to what they were seeing on their rewards dashboards. The calculation was correct from the backend, and until we actually started delving richer and deeper into the root cause analysis, we couldn't actually see the hypothetical cracks in the system. Louisiana was a unique tax requirement state for rewards points. Most states treat rewards as simple accounts, non-taxable, straightforward. This state classified rewards above a certain threshold as taxable rebates, triggering completely different accounting requirements.\n\nThe backend handled this distinction perfectly. The calculations were tax aware, and the accounting was correct, but the display layer wasn't. This was purely our UX. Imagine the flow. The system computes an immediately redeemable rewards balance. In Louisiana, that requires the tax withholding step first, unlike other states. If that withholding logic hasn't been implemented in the UI layer first, the system could perfectly logically report an immediately redeemable balance of zero. Even when the user has something more than zero, is where he's seeing zero, which is the error case. From the system's point of view, everything is correct. Technically, the backend is still doing the right thing. From the customer's point, it's catastrophic. They see it as, I earned $47, but I'm seeing zero, is the system fraud?\n\nThis was like a 0.2% display error, which sounds minuscule on paper, but with 25 million users per month, that's just like 50,000 incorrect displays. Something that is opposite of a rounding error. That's where it becomes really critical. In this hypothetical scenario, call center volume in Louisiana exploded. More engineering hours spent on solving this issue. Leadership had to jump in. The deeper cost, I think, that we paid was trust. Imagine tracking redemption rates and watching them drop from 68% to 41%, where your main OKRs are to actually track growth and commerce, and they're dropping. Even after we fixed the issue, recovery took us months and years to return to the baseline.\n\nI have another sneaky issue here, which is the iOS perception failure. This talks to the UX issues and how important it is to actually validate our models from a UX perspective. How importantly are the UX team incorporated into your adoption and deployment models? Can they actually stop your deployments if there are perception failures? This example is actually a great example of that layer 4 principle that we talked about. I'll skip over the details, but that was part of the incident as well, where even if the buttons work or the users are able to access the feature, if it is not inviting and engaging enough, then your metrics are still suffering.\n\nAgain, this is a five-layer breakdown for Walmart, the way we did it for the previous case study, which is important where maybe the layer 1 passed. The reward calculation was mathematically correct. The metrics were right. Technically, we had the math behind it. Layer 2 passed as well. Infrastructure, no issues. Layer 3 was where the problems were. We were still missing the guardrails. Product and eng was not in sync. UX, product, and eng was probably not in sync to say that what is technically accurate is not what is visually accurate for the users. Layer 5 was the systemic impact, where the main cost of all of this was eroded trust. We had no trust recovery algorithm. We didn't model what would happen if users saw an error in their financial balance. We did not benchmark that. We did not capture that as a metric.\n\nI'm not saying that the product did not try to measure that, but we weren't measuring them in our evaluation frameworks. This is a good example for how it actually spans across these layers, and correctness is not trust. These systems were correct. Our infrastructure was right, but the users didn't trust it. Trust is what matters the most in these financial infrastructures. In most of the world-class businesses, trust and taste is what the users are going for. This graph actually shows that relationship very clearly. It took us one week to destroy that trust. Redemption rates went from 68% to 41%. We took 12 months to recover. That's a 48 to 1 asymmetry. To recover from it, we had to spend weeks to be able to actually invest in emergency error analysis, expanding our golden set, actually sitting and doing a lot of calibrations.\n\nThere's no way around human evaluations and manually labeling the data, and actually taking a good taxonomy of all your errors. How do your systems fail? Do you have a good view into what are the top 20 ways your product can fail? The final metrics that we got was, I think, GOH cases, errors reduced from 0.2% to 0.02%, and call center volume started going back to normal again, and user complaints started reducing by 95%. We knew that it was in recovery mode. That was a year of innovation, experimentation, prototyping, not unlocked to its fullest potential. The lesson was that technical correctness is easy to recover. Trust erosion, not so much. Trust requires that sustained excellence over years that the business would have actually invested in its brand and taste building.\n\nThat's one of the key takeaways where these two failures look different, search and discovery versus financial infrastructure, entertainment versus e-commerce, different surfaces and different users, but the pattern is identical. The model passes, infrastructure passes, guardrails maybe not so much, UX evaluation might be missing, there's no systemic foresight, and that's when we know that correct is not equal to trust. I think evals are so much about building that user trust, and it keeps evolving because your product is evolving.\n\n## Why AI Adoption Fails\n\nI've captured some thoughts on how AI adoption fails. I'm going to skip through this, but there are some good examples on why we shouldn't be saying we'll add evals later. Or why we shouldn't be saying, evals are the ML team's responsibility, product builds features and ships features. Why we shouldn't be going for that 95% benchmark and say ship, when we don't know if those metrics are actually capturing trust. Why if you don't build for the evaluations today, you'll probably be building for a compounded interest six months from now. Go through that. I have the slides with all the details, and you'll also get a toolkit towards the end.\n\n## The Maturity Model\n\nHere's where the maturity model eventually comes into place in most of these organizations building these AI infrastructure from scratch. Level 0 is your YOLO mode. Test in production, reactive firefighting, I hope you're not doing this. Level 1 is basic metrics, table stakes, precision, recall, static testing, your unit testing. Level 2 is that multi-layer evaluation that we talked about, all five layers covered. In this layer, we are still doing it in silos, we don't have the integration. Level 3 is where you have integrated across your cross-domain product surfaces, across tech stacks, across engineering pillars, across organizations, across design, product, research. How do you bring people and decisions in one place to be able to keep them evolving? Layer 4 is adaptive systems. If your product is evolving, your evaluation frameworks need to evolve. Do you have the right level of decision-making leadership to actually fund this and invest in it?\n\nBecause most companies think they're at level 3, but they're probably at layer 1.5, maybe at 2. That gap is your evaluation debt. This is also a diagnostic, I think it's a fantastic way that helped us actually take stock of where we are, that you can use it in your domains, in your teams, Monday morning rituals, go back to your teams and try to run it. Ask yourself these four questions. If you answer yes to all four, you are below three, and that's ok. That's where most of the organizations are, but we start to evolve beyond that. Again, the Monday morning audit goes over, where should you start? How can you start your error taxonomy? How can you go through all your infrastructure tooling, platformizations, your self-serve mechanisms, your runbooks, your on-call mechanisms, to be able to actually have that deep view into how your infrastructure makes decisions. Then, how do you estimate the impact of having these evaluation frameworks? What's working? What's not working? How do you have these feedback loops with product?\n\nThen, are you able to create a phased evaluation framework as part of your roadmap? Can you invest in UX frameworks? Do your UX teams have the authority to block deployments if something is failing visually? Those kinds of things. There's also an implementation blueprint based on where you are, like layer 1 versus layer 3. You can start at different places. There is an extensive decision-making tree that we used to be able to decide where should we go from here and where is our 80-20 ROI. I've included this in your companion toolkit towards the end that you can adjust it to your domains as well. This is a good audit of what you can do in one month, going from week 0 to week 3, to be able to understand what's working and what's not working, and how can you eval your framework.\n\n## Key Takeaways (Principles)\n\nEvaluation is a stack. It's not a score. It's not binary. Silent regressions are inevitable. Should we not be building for it, add guardrails. Principle three is human in the loop is unavoidable, not optional. LLMs scale evaluations, but it has to have that human in the loop at every level of your product iteration. Invest in your labeling pipeline right from the beginning. Principle four is, guardrails and trust are first-class evaluation layers, not an afterthought. Five is it needs to evolve monthly if your product is evolving monthly. If you're adding new product surfaces, you should be adding evaluation frameworks. If you are experimenting and prototyping, you should be adding those metrics into wherever you're A/B testing and wherever you have those offline, online metrics.\n\n## The Companion Toolkit\n\nBefore we close, something you can use for Monday morning is all these. You can scan the code. It's going to be there. It's a full companion toolkit, something that we used. I have fine-tuned it, bringing signals from different domains and different companies. What's in there is exactly that. It has a maturity assessment framework just to know where you are, where your org is, or maybe where your team is. Then you have the five-layer evaluation checklist. You have the error taxonomy. How do you adopt? What is your adoption? How are your pipelines looking? Are they usable by your downstream users? There is also an evaluation debt audit that we like to do pretty regularly. We're always surprised how the gaps either show up or have increased every time we come back to it. They're pretty battle-tested across multiple domains. They also list what they are not. I hope they're also honest. Fine-tune it for your context.\n\nThey're yours. If you're starting tomorrow, and you should, you should begin with two questions. Be specific as possible when you're answering these questions. What are you not evaluating today? Not, we'll try this and we'll figure it out, but what's the gap right now? Then, what's the cost if you're wrong? If you have not evaluated certain things that are important for your users, what's the cost of being wrong? That gap is where your evaluation debt mostly is. Once you start answering these questions and coaching your teams to think in this format, and I really like the takeaway from the keynote where the focal point is thinking from the product's perspective, we slowly start to bridge that gap. That's what good evaluation systems should be giving you, not just reliability and velocity, but the confidence to move fast without actually breaking things, especially user trust.\n\n## Conclusion\n\nEvaluation debt shouldn't really be an afterthought. It should be an evolving system, and not really a checklist that only one or two teams inside your company should be doing, but more of a holistic mechanism that can bring these principles together. You have the checklist.\n\n**See more presentations with transcripts**", "url": "https://wpnews.pro/news/presentation-building-evals-for-ai-adoption-from-principles-to-practice", "canonical_source": "https://www.infoq.com/presentations/eval-ai-adoption/?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=global", "published_at": "2026-05-29 12:00:00+00:00", "updated_at": "2026-05-29 12:13:39.365872+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "ai-products", "ai-infrastructure", "mlops"], "entities": ["Mallika Rao", "Twitter", "Walmart", "Netflix"], "alternates": {"html": "https://wpnews.pro/news/presentation-building-evals-for-ai-adoption-from-principles-to-practice", "markdown": "https://wpnews.pro/news/presentation-building-evals-for-ai-adoption-from-principles-to-practice.md", "text": "https://wpnews.pro/news/presentation-building-evals-for-ai-adoption-from-principles-to-practice.txt", "jsonld": "https://wpnews.pro/news/presentation-building-evals-for-ai-adoption-from-principles-to-practice.jsonld"}}