Open Source CEO by Bill Kerr- Posts
- Unfiltered: The System Vanta Built To Grade Its Own AI
How they build language models as the judge, human as the jury and more. 🧗🏻 #
👋 Howdy to the 4,298 new legends who joined since our last edition! You are now part of a 473,087-strong tribe outperforming the competition together.
LATEST POSTS 📚
If you’re new, not yet a subscriber, or just plain missed it, here are some of our recent editions.
⚡️ [The Founder Lightspeed Backed Twice](https://www.opensourceceo.com/p/ali-hussain-interview). An interview with Ali Hussain, Co-Founder & CEO at Tabs.
🗑️ [The Future New Media: a16z](https://www.opensourceceo.com/p/a16z-guest-post). How the famous VC firm has thrown out the old playbook.
🌟 How Vanta Learned To Trust AI. Wiring AI into the one product where AI isn't supposed to be trusted.
**PARTNERS **💫
AI is the fastest-moving field most of us have ever worked in, and the hiring at frontier companies looks the same.
The Athyna AI Job Board watches openings at the labs and AI-first companies actually shaping the field, matches them to your profile, and sends the strongest fits straight to you.
Set up a profile once. Let the right roles find you.
Every founder hits the wall where there's more work than people. Viktor is the AI employee that lives in your Slack and Teams.
Brief him like a new hire and he writes the docs, builds the landing page, runs the campaign, and ships the code. Viktor connects to 3,200+ tools and gets to work before you ask.
Interested in sponsoring these emails? See our partnership options here.
HOUSEKEEPING 📨
Not much to report as of today. It’s 8:19pm on Sunday evening in Melbourne, which is technically 19 minutes after I planned to ship this newsletter. It’s cold in Melbourne, kinda miserable, too. The one thing I feel like keeping me going is training and dog walks. But things are okay generally.
Anywho, we have a nice piece for you today. This is another one from the cutting room floor of our recent Vanta piece. These are the raw interviews with two of Vanta’s AI leaders. Super interesting stuff. I am sure you’ll love it. Cheers!
INTERVIEW 🎙️
Iccha Sethi is SVP of Engineering at Vanta, where she leads all of engineering across the company's trust management platform. Before Vanta, she was Senior Director of Engineering for Compute Products at GitHub, overseeing Actions, Codespaces, npm, Packages, and Pages, and before that held principal engineering and director roles at InVision, Atlassian, and Rackspace. Vanta, now past $300M in ARR with over 16,000 customers, is well beyond product-market fit and pushing hard into enterprise. Iccha joined to lead all of engineering through that transition, from Series B to what is now a Series D company.
What makes her perspective worth paying attention to is the overlap between her background and the specific problem Vanta is now trying to solve. She's been inside two of the companies most associated with developer infrastructure at scale (Atlassian and GitHub), and she experienced compliance pain firsthand at both, as a practitioner, not as a vendor. She came to Vanta already knowing what it feels like when the process breaks. Now she's on the other side of it, building the AI systems that are supposed to fix it, while managing the harder question that nobody in this space talks about enough: how do you build AI-powered features into a trust product when AI is, by nature, not trustworthy?
There's a whole set of security and compliance companies that aren't even thinking about AI in this space. One group is almost too afraid to touch it because they think security and AI are orthogonal—AI is too risky, it's non-deterministic, it's probabilistic. Then there are the companies that are a bit more AI-forward. They know AI sells, and they're just trying to get it out there. What people truly need to be thinking about is, now that we have a new piece of technology that can analyze information and is really good at coming up with solutions, how can we actually use it? That's what we've been trying to do at Vanta. I've been through a compliance process several times, and it's really manual. We SaaS-ified it, which meant automating some workflows, but even then, it's still do X, then Y, then Z, then ask this person, then go write this document.
What we've done instead is focus on what actually takes a long time in a flow. Writing a policy takes a long time. Analyzing all your information and developing a good set of risks takes a long time. When things change, figuring out which risks need updating and their impact takes a long time. That's still a lot of wading through information for a human, and LLMs are really good at analyzing text and generating suggestions. So that's where Vanta has been investing: identifying what's genuinely painful in that flow, where AI is actually good, and being honest about its non-deterministic nature, figuring out the right place to bring a human back into the loop.
Security and compliance are often mistakenly thought of as a cost center, and they're never funded sufficiently. But it's actually mitigating risk for the business, and every new customer who buys from you wants you to prove you're secure before they sign. Historically, if you ask most security teams, they'll tell you they could use more people. What Vanta helps you do is achieve a lot with a smaller team. There's a ton of context in any organization (especially as you grow, a lot is happening in different parts of the company), and it helps you pull all that context together, context that no single person can hold in their head. You don't need to hire a huge team when AI can do a lot of that analysis and workflow streamlining for you. So I'd put efficiency first, speed second, and quality third.
Source: Vanta. Every feature in Vanta will have some form of AI at this point. We just call them agentic features, and the quality we're talking about is the quality of the outcomes they're enabling in the compliance space. For example, say your company was fully in-office and decided to go partly remote, or vice versa. Now you have to update your policies to reflect that your infrastructure situation has changed. Your policies impact your frameworks and your controls. A human is fallible; they won't always remember that this policy maps to this framework, which maps to these controls, and that they need to go update or tweak those, too.
What Vanta does in a case like that is, when you go to update your policy, the AI agent flags that the update affects your compliance posture across certain controls. In that way, your compliance program improves in quality because the agent helps the human do their job better by catching what they might have missed. | Source: |
We have multiple features like this. We understand what we call your ‘trust graph’—the whole picture of your company: your risks, your frameworks, your vendors, your customer commitments, your legal obligations. The goal is to be your compliance brain, so that even if no single person in your organization holds all of this, we are nudging you and surfacing suggestions to make sure your program is actually sound. That's how AI raises the bar on compliance quality.
The other side of it, how we build these things, matters too. If you give incorrect suggestions, users lose trust in the product. They start ignoring it, like they ignored Clippy. So we put a lot of work into making sure all our AI features are genuinely high quality. Evals, experiments, LLMs as judges, an eval maturity model, tracking metrics. All of that is important to us because the higher the quality of our AI features, the higher the quality of your compliance program.
It varies by customer segment, and we've deliberately built it that way. For smaller companies, the approach is: give us your context. We have agents that go and pull information in the background, and then we validate it with you. We take a first pass and present it back to check whether it’s accurate. Then they edit and refine from there. We have something called ‘policy builder’. It's very interactive, walking through what they want to achieve and who's responsible for what, and we build up policies together that the human can then review and edit.
Source: Vanta. The further upmarket you go, the more that changes. Larger companies typically come in with their own policies already written. They're not looking for help creating a policy as they've already done the work. What they want help with is connecting their existing policies to the rest of their program.
It always starts with the customer problem. Engineering, product, and design work closely together. Then we encourage engineers and product managers to go build a quick prototype. That first pass lets you see what you want the prompt to be, what the flow should look like, where the human should be in the loop versus where the AI should be, and how you should surface the response. Not all AI features need to be a chatbot in your face. Some show up more subtly on a page; more MCP-first.
Once you've nailed the basic flow, you get into what we call the ‘quality hill climb‘. We work closely with GRC subject matter experts (we have an entire organization of GRC experts dedicated to helping us build products). We work with them to build the ‘golden dataset‘: what are the inputs into the system, and what would a GRC expert produce as an output? We have a firm policy of never training on customer data.
That's where the in-house expertise earns its keep. They help us create these synthetic, expert-grounded datasets. We start running the feature through that ‘golden dataset’ and scoring whether the AI output matches what a human GRC expert would produce. Whenever it falls short, we go back until we feel good about it.
When we're ready to move to production, the key is instrumentation. One of our engineering managers, Andy, built our ‘AI evaluation maturity model’. Every agentic feature needs traces, so you can see what tool calls the model is making, what responses it's returning. From that foundation, you build evaluators, and as you scale up traffic and exposure, you move to LLM-as-a-judge. And the most mature state is running experiments. The biggest difference from traditional software is that it doesn't end when you ship. The old ‘I'm done with this feature’ doesn't exist anymore. This is a constantly evolving field, and something we shipped a year ago might be running on a deprecated model. You have to keep evaluating, and that’s something we take very seriously.
When we move toward production, we go through release phases. There's private preview, where we work closely with early adopter customers or design partners who've been waiting for a feature, reviewing quality and checking whether the response meets the bar. From there, public preview, and then GA, where it stops being humanly feasible to review every agentic output. We still do some human evaluation, especially for anything a customer has thumbed down or marked as inaccurate, but for the bulk of outputs where customers don't give you that explicit signal, you still need to know whether your AI feature is doing a good job. That's where the LLM-as-a-judge evaluator comes in.
If you ask a GRC expert how they would evaluate whether AI is doing well, they'd say that they look for certain things in the response, certain fields structured this way, and they want the tone to be consistent with Vanta as a brand. So we build that into a prompt—essentially encoding how a human reviewer thinks—and run all the feature outputs through the judge, which scores them across multiple dimensions. | Some are about the correctness of the output itself, while others are about tone and brand consistency. It produces scores, and then humans use those scores to prioritize which outputs to dig into, rather than reviewing everything blindly.
The non-deterministic nature of LLMs means that even with the same model and the same prompt, you can get different outputs. I ran the same demo six times yesterday with the same model and prompt, and on the sixth run, it gave me a completely different answer. That can happen because frontier companies are also constantly tweaking their models underneath.
The other reason is coverage. When you build a feature, you design for the customers you know and the use cases you've imagined, but maybe you didn't consider a specific type of company or setup. |
So everyone else is having a great experience, but a customer in a particular industry with a non-standard infrastructure configuration is not. You might have done a perfectly reasonable job designing the feature, and just didn't have that case in your dataset. The LLM-as-a-judge workflow catches things like that and flags that you need to go back and adjust how the feature handles that new scenario.
A percentage of my time is forward-thinking—for the company and for the engineering org—and a percentage is running the engineering org operationally. I actually separate those. Forward-thinking for the engineering org has been a big chunk of my time lately, especially with everything happening in AI. I'm constantly experimenting with how we can be more agentic and what frontiers we should be pushing in the compliance space. Another big part is thinking about how engineers at Vanta use AI to improve their own productivity. For greenfield work, it's easy to move fast. For brownfield areas, you have to think harder about architecture and API design. I just created an AI developer experience team to focus on exactly those questions.
On the operational side, I spend time with stakeholders outside engineering—product, design, our CEO, Christina, marketing, and sales. Then a lot of time with my direct reports, and I do a lot of office hours. In the last three months, I've done over 50 of them. It's my style to stay connected to the ground. Right now, I have 28 Slack messages sitting there, and probably 15 of them are from individual contributor engineers with whom I have a direct line on some topic or other. That accessibility matters to me, no matter how large the engineering org gets.
Ignacio Andreu is the Director of Engineering for AI at Vanta. He joined the team in late 2024, after four years as a Software Engineering Manager at Google, where he led the internal tooling for generating and annotating data to train the LLMs powering Google Search's AI Overviews. Earlier in his career, he co-founded Masterbranch, a startup that aggregated open-source code contributions to build verifiable developer CVs, raising €470k and landing coverage in TechCrunch before eventually moving to San Francisco and the American tech ecosystem. At Vanta, his AI team grew from 3 to over 20 engineers in 15 months, and now has 45 people.
Ignacio is an interesting voice on AI in enterprise software because of the specific problem he's working on. Vanta sells compliance, which is fundamentally about trust. LLMs, historically, are not trustworthy. Threading that needle, building AI-powered workflows into a product where a wrong answer isn't a minor annoyance but a potential audit failure, puts him in one of the more demanding engineering contexts in the current AI wave. He came from Google Search, where latency is measured in milliseconds and 3B daily users create zero tolerance for sloppiness, and he's now applying that data-first, eval-driven rigor to a company of a very different scale with very different stakes. He has strong opinions about what gets lost when engineers apply 2019 thinking to 2025 models. And he doesn't sound like he's read them in a blog post.
I'm going to answer that more in product terms, because I think that's where it actually lands. Vanta has been very useful for a long time, but we can now do much more for our customers with AI. There are many manual processes in the compliance space that LLMs and agents can handle, but one very important distinction we make is that we always want the user to be in control. This is your program; this is your information. You need to know what you're signing off on. So even when our agent is working in the background doing more, you are still in control. You are always the decision-maker.
From an engineering point of view, that means we start by involving LLMs in many of these decisions, using agentic workflows, giving those agents more context, and providing enough information to the LLM to actually reach an outcome. When we talk agentic development, especially with reasoning models, we're talking about outcomes-based design. We tell the LLM what we want to happen, but we don't need to be prescriptive about every single step. We let the LLM decide. |
From an engineering perspective, it also means that the cycle has changed. It's a lot more data reviews, data iterations, evals, and evaluators, and keep improving the quality because AI features are broad by nature. The variety of input you can see is very wide.
It's both. Think of the Vanta agent as your 24/7 compliance engineer. Someone is constantly looking at your program, aware of what's happening, and recommending what to do based on the shape of your specific setup. In compliance, everything is interconnected. Think of all your objects as a graph. When you change a policy at your company, that usually means you have to change some of your controls. Classically, before something like an agentic platform, you'd change your policy, then manually search through all your controls, find the ones you need to change, maybe think of ones you need to add, review if they don't contradict each other, then move to the next step, and so on.
Now you can start in one corner, let your Vanta agent find everything for you, and do the work. It's both speed and helping companies become more compliant over time through constant monitoring. If you think about Vanta's origins, we monitored your resources, your AWS, your IDP, whatever it was, so you could improve your compliance. With agentic flows, this is a step further. We're monitoring your entire program, and what matters most is that if you miss something, it will catch it for you.
One is intrinsic to the time we're living in—everything changes so fast. What we're talking about doing right now was literally not possible six months ago. That is one key element of how we need to think, which may not be a purely technical problem but is a real one. Lately, when we work on features, my go-to question is ‘Are we doing the 2019 version of this feature or the 2024 version?’ The engineers need to understand what is actually possible right now, because it's easy to have an AI intuition that pushes you toward doing something that makes the model less powerful, where you write very detailed instructions and remove a lot of the capabilities that these LLMs have. Getting that balance right is hard.
Source: Vanta. And then there's the technical side: software engineering is becoming semi-data science. You have to look at the data, understand it, cluster your problems, then look back at your prompt, your harness, your system, and your context, and decide where to add.
It's evals, primarily, which require datasets and evaluators. When something drops, you want to see how it changes your datasets. You usually end up with two different datasets. One is almost like a regression dataset. Things that can't go wrong, your baseline. The other is more aspirational. Tasks that don't work well today, or a mix. That second one is where you can actually see how a new model improves things, and you can do that without writing a single line of code.
After that, I believe engineers should use coding as well as general tools that actually expose these models to them. A lot of this is an art. There's no pure science behind prompting. Model providers such as Anthropic and OpenAI usually publish prompting guidance when they release a new model. You can learn a lot from that. And if they open-source things, like OpenAI's Codex, you can study their prompt structure. People should do that more. Then there's Twitter. I opened it today, and the first tweet was from OpenAI announcing o3. I might or might not have an email from OpenAI about it, but I saw it there first. If you use it intentionally, engage with the content you actually want to keep seeing, your timeline changes. It really does pay off.
Vanta employs GRC professionals—former auditors, heads of security, CISOs—as part of our workforce. So what ‘good’ looks like is a mix of things. There’s tone and verbiage, which matters for something like the Vanta agent, but also deep technical correctness. We leverage those experts to help us annotate information, build evaluators, and get to quality. We do a lot of that work with what we call in-house GRC subject matter experts. |
The other thing is data reviews. Once we think a feature is good enough, we put it in front of our early access testers, watch the data come in, and look at it. When we don't understand the nuances, we bring in the experts. Over time, more and more things become clear before we need to escalate, so we invoke those experts only where needed. And the ‘good’ bar varies widely depending on the task. The risk of getting it wrong is different depending on what you're doing. If you're telling a customer they're ready for an audit, that bar is a lot higher than, say, helping them connect an integration.
The first thing is our own Vanta instance, used in-house. After that, we have a group of customers who have volunteered. Think of them as AI-forward companies who want to try things as soon as they're ready, and they're comfortable with rough edges. It's slightly different from a design partnership, which is a bigger commitment, more of a formal feedback loop, and more intensive. Early access is a lighter-touch process. Right now, we have around 20 companies in that group. In some cases, we can only release a specific thing to a few of them because they match the target use case, but we try to keep in very close contact. We integrate the account, we ping them directly, they book time with the team, we jump on calls, and go through it together. After we've validated with those early access customers, we move to a broader rollout, from 1% to 25%. And we're not afraid to roll back. If we ship something and the quality isn't what we want, we roll it back fast. Looking at the data is very, very important.
Source: Vanta. Your eval datasets need to evolve over time. Not just because you find gaps, but because user expectations drift. As a user of any AI product, what felt impressive six months ago is now your baseline. The same happens with your product and, because the input space is so broad, the types of data you see and the intentions behind them shift too. Keeping your evals current isn't about keeping the lights on. It's about keeping your feature actually good.
Google is a very special company, very different in good and bad ways. What carries forward for me is the importance of quality and its data-driven basis. A very Google way of looking at problems is to look at the data, understand the data, and have evaluators. Google Search has had AI in it for maybe 15 years (a stack of ML models, and now clearly LLMs too). That data and quality orientation are core to how I think. If your feature isn't good quality, if you're not looking at the data, then in my opinion, you shouldn't have the feature. Delete it. That might be a spicy take, but I do believe it.
Some things don't carry over. Google has a lot more resources, so you can throw money at problems in ways that don't translate. The other big thing is latency. I worked in Google Search, which is a very specific part of the company. Latency-first, always. If you've used AI Overviews or AI Mode in Search, it's probably the fastest LLM product for the quality level you can see. That's very deliberate. Google's position is speed and intelligence, not one or the other, and that serves billions of users and billions of searches a day. At Vanta, the trade-offs are different, but I still have great respect for that standard. You adapt it to your problem and your customer.
A lot of meetings. That’s the sad story of my life right now. But turning that around, where I think I'm actually adding value, is in looking at the problems coming down the pipe and spending a lot of time thinking what the right approach is, given the state of technology today, whether it's a novel or a conservative approach. The other big thing is scale. When I joined, my team was three people. Now it's 45, and it's been about a year and a half. There are also more AI people spread across Vanta. So a lot of my time goes into thinking about how I can bring more engineers along.
I think this might be controversial, but the role of a product engineer is shifting. Every product engineer in the near future will need to do prompting and evals, just as another tool, to build products. It'll be like another framework, another front-end library. Not every problem will call for it, but it will be a common way of solving things. More and more people at Vanta need to get there, so I spend a decent amount of time thinking about how we can make that easier and bring people up to speed. One of my teams is the AI platform team. We build the foundations that everyone else builds AI features on.
The rituals are per-project, so everything I describe applies at that level. We have a team-wide meeting on Mondays for general updates, ideas, and a chance for everyone to check in. Below that, each project or workstream has its own weekly meetings. Workstreams can cross team boundaries. They have the freedom to run things however they need to, but they must produce a specific progress report as output. Every AI feature has a data review. The frequency depends on the phase, but a lot of them start daily, then move to several times a week, and in some cases stay there or drop to once a week.
Source: Vanta. We also have what we call a more open build session. You're working on something other people should know about, so you come and talk about it. Then bi-weekly demos where anyone can share a piece of code, an eval, a product, a design. Anything people will find interesting. And every other month, we do a hack day. The AI teams pick one or two topics, and before the day itself, we talk through different ideas, and people sign up to work on them. By the time hack day arrives, everyone already knows what they're building. We run a dedicated Slack channel, where each project has its own thread, and people post throughout the day. We usually do it on a Friday. On Monday, we come back and demo. We also create a Slack canvas with all the logs from the day and share it broadly. That format has become the way we validate approaches and ideas that are floating around.
How Vanta Learned To Trust AI- June, 2026‘Shadow AI’ is real. Vanta wants to help manage it- June, 2026Vanta’s Agentic Trust Platform redefines how enterprises earn, prove, and scale trust- November, 2025Tying Engineering Metrics to Business Metrics- November, 2024
And that’s it! You can keep up with Iccha and Ignacio on LinkedIn or check out Vanta on their website.
BRAIN FOOD 🧠
TOOLS WE RECOMMEND 🛠️
Every week, we highlight tools we like and those we actually use inside our business and give them an honest review. Today, we are highlighting Granola*—an AI-powered notepad that takes, summarizes, and organizes meeting notes without using intrusive recording bots.
See the full set of tools we use inside of Athyna & Open Source CEO here.
**HOW I CAN HELP **🥳
Hiring global talent:If you’re hiringtech, business or ops talent and want to do it 80% less, check out my startup,Athyna. 🌏See my tech stack: Find our suite oftools & resourcesfor both this newsletter and Athynahere. 🧰Reach an audience of tech leaders:Advertisewith us if you want to get in front offounders, investors and leadersin tech. 👀