Presentation: AI Agents to Make Sense of Data at OpenAI

wpnews.pro

Transcript #

Bonnie Xu: Today I'll be talking about how OpenAI deployed AI agents to help our teams answer their data questions. Let me first paint you a picture. Your business lead comes to you and asks the question, how many ChatGPT pro users do we have in Italy? You consult a data scientist, but they're actually like, "This is hard. Let me get back to you". They don't know what table to look at, so they ask another engineer, but then the other engineer doesn't know. After three code deep dives, two quick meetings, and five Slack threads, we finally have an answer. Simple questions shouldn't be this difficult and this time consuming, but they are. The reason why this is hard is because there is so much data.

I'm Bonnie Xu. I'm on the data productivity team at OpenAI. I'm here to talk to you today about how we solved this problem. Here are some key takeaways I hope you get from my talk. Firstly, the importance of the right data context. Then, I'll be talking about how important memory is for self-learning, and then how important evals are for making sure that the model doesn't regress.

Data Platform Overview #

Let me kick off with an overview of data platform to illustrate why we need an AI agent in the first place, then I'll go into implementation specifics, and then learnings, and next steps that we have. At OpenAI, 80% of the company directly uses our data platform. That's 80% of the company using our team's 15 tools to process over 600 petabytes of data a day across 70k total datasets. The data is just growing even more rapidly. That means we have so many more questions to answer, but there's so much more data now to sift through to get the right result. When ChatGPT launched in 2022, we were asking ourselves, how many users do we have? As the product has evolved, we've introduced more regions, different plans, more features. We're now asking ourselves the question, how many daily active instant checkout users do we have in New York? That's a much harder question to answer now, but fundamentally, we're looking for the same type of answer. One of the reasons why this is hard is because table discovery becomes a lot harder at scale. Helene is a data scientist at OpenAI, and this is what she asked on Slack a few months ago. She's having difficulty finding the right table to use because there's a lot of similarly sounding tables, and it's unclear what data is in them. Here's Eric, another data scientist at OpenAI, struggling to make sense of the nuances of each table. This is a hard problem because some tables have encrypted IDs, some tables have unencrypted IDs, but we still might want to join them. Some tables have columns that adjust for fraud rates.

Some tables don't. Some tables are pre-filtered by feedback. Some are not. Missing one nuance can lead to an answer that is wrong by an order of magnitude, and this can be catastrophic when making important business decisions. Not to mention, writing SQL is hard. This SQL statement is 160 lines of code, and I have no idea if it's wrong or not. Because who can remember all the different ways we date format or writing performant queries or the aggravating fact that Trino arrays are one indexed.

Kepler - AI Data Analyst (Internal Tool) #

What's a better way of doing this? At OpenAI, we built Kepler, an AI data analyst that takes the full context of data platform and answer these data questions for you. At its core, the Kepler service leverages the model to produce AI-powered results when you need them. This could be on Slack. For example, on the left here, we have a Slack agent to ping. This could be in your IDE, like Cursor, you hook it up, our MCP server. Maybe you are asking our web agent for table information, or you can connect Kepler to MCP platforms for workloads, for example. Let's go through an example. Let's see Kepler in action. Let's say I want to find for New York taxi trips what pickup, dropoff ZIP pairs are the most unreliable, so biggest spread between typical and worst-case duration and when that happens.

You can see right now that Kepler first does an internal knowledge search. It's looking at, let's get some initial information and let's see what's out there. This is this chain of thought that you're seeing being streamed right now. Here we have the table schema, so Kepler knows how to write the query, and it's running through the query right now. Basically, it's writing all these different queries to try to get the right data. Bucketing is important here. We're looking at percentiles to determine the worst durations. You can see from the SQL, ok, we got back some results. We have to adjust the thresholds because those results weren't what we're looking for. Running some more queries, some sorting. You can imagine doing this manually yourself takes a lot of time.

The agent is just going through these query and result set steps for you and on your behalf. Again, running another query. We have to adjust, so on and so forth. Some sorting going on here. There's a lot of steps in the analysis. Our agent tries to do a thorough job. If you read the SQL, you can tell that there's some pickup. Ratios, I think, are used for the duration. This is the CTE involved. Now there's the analysis part. We're satisfied with the results that we want and we need to produce the actual thing that we're going to show the user. That's the summary here that you see on the screen. Some light formatting, because we want to make sure that it looks pretty. That's always important in the analysis. You can see here we are looking at timestamp. Finally, we get the right answer.

We measure typical by p50 and reliability by the p95, p50 ratio. We get the right dataset and the amount in numbers. This was an example dataset from 2016 that we had available in our data catalog. The results are in New York time. Here's all the results, so that ZIP code and then those are the durations. This is the SQL that actually was used. As you can see, it's pretty big. Then we have a link to the raw results. Here's the final answer. Morning commutes, weekdays, at rush hour, and late nights are the most unreliable, for anyone taking a New York taxi this week. Let's say I also want to do something slightly different. Let's say I want to plot a graph representing these results. It's a little more, I want something visual to better understand. Kepler is also able to do that too.

You see here it's figuring it out. It was able to come up with something. Here is the right query that it used to generate the chart result. If I scroll down here, then you can actually see the analysis on the left. It's the ZIP codes on the right. It's the unreliability.

That was a toy example using a dataset. Let me give something maybe some of you folks might relate to more, debugging an anomaly. In this example, we're looking at what caused a big upward spike in ChatGPT user growth for weekly active users late March. I'm just going to be sharing screenshots of the chain of thought here. The first part is looking at the right table to check the spike. If there actually was a spike, we should see the numbers before spike and after spike increase. Another part of the chain of thought is actually delving into the data. How do we know that that table was correct? Here you can see Kepler is referencing a dashboard and a Notion document just to confirm that. Also, Kepler interactively delves into the data. It does this by actually just running different queries to slice it up here.

That's what you see by dimensions. Kepler is running queries based on plan type, based on region. This is how it can see what motivated the spike. Like, was it a specific region that increased and that's what caused it? What a human might do for data analysis. Kepler also tries to come up with reasons now that it has all the data. An example hypothesis is, we logged too much. We have duplication issues. Kepler is looking around at internal company context to check that this is the case. Fortunately for our use case, it is not. Kepler was able to figure out that it was related to ImageGen that we launched. Now that Kepler is at this hypothesis, it actually does a web search to check for the timeline to cross reference itself.

Here in this chain of thought, you can see Kepler references the release notes at around that timeframe and also a TechCrunch article because that's how it can really check the trends as well. Finally, we arrive at the result. Kepler was able to figure out that it was due to the ImageGen trend. It worked. I always get nervous whenever Kepler shows an answer. I'm like, is this right or not? Fortunately, it's right most of the time. I'll be talking about that as well.

Let's talk about some other things that Kepler can do. Don't feel bad about asking Kepler questions. Kepler is available 24/7 for a quick conversation. You can also do follow-ups. That's how you get a thread with 56 replies like the one Dominique is having. The really lovely thing is that all follow-ups, Kepler just stores the whole context, so you don't need to repeat yourself. If you see Kepler veering off track because we stream up the chain of thoughts, you can see where Kepler is going, you can interrupt Kepler and Kepler will take that feedback and produce a different result. Here's a video example of what I mean by follow-ups. Here, similar thing. I'm asking about just the New York taxi scene because I'm really interested in that. I was asking about the results in the pickup. We have this graph.

Now let's say I wanted to dig into something a little more specific. I want to look specifically at pickup trends on February 14th, a very arbitrary day. I'm asking Kepler here. You can see Kepler is reasoning. This is our Slack agent interface that you're seeing here. There's a little bit of running queries. You can tell that, or hopefully you caught that. Kepler didn't need to do all the initial knowledge search again because we have the context from the previous response about the right table to use. Kepler was able to just immediately run the query for that answer and get the results that there were 375 total pickups in this dataset. Also, Kepler can handle follow-ups. If I ask a very incomplete question, Kepler just picks the best default if I don't follow up.

The other thing that's nice and that we found useful, at least for our purposes, is there's a lot of commonly repeatable processes that people do. Things like feature product analysis or data validation on dev tables versus prod tables. Basically, we have these workflows which are like custom shareable instructions that you can just instantly rerun in the UI.

How Does the Magic Happen? #

That was a lot of demoing and showing you guys what it might look like if you were using Kepler at the company. Let me now talk about how this all actually happens behind the scenes. This is a big picture diagram of how things happen. On the left here, we have the entry points. You've seen a couple of them. There's the UI. There's also the Slack agents. Then, you can hook up local or remote MCP. There's the top part which is the preprocessed offline information which I'll get to. This is the knowledgebases. There's the bottom part for sync calls, so I'm directly making API calls to our data warehouse or to other data platform sources like Spark or Airflow. This is really the core of it all. It's Kepler talking to the model armed with this toolbox.

I'm just going to take a step back here and explain why using MCP has been so helpful for us. The reason is because with all these tools, Kepler can use them to start off with. Then after realizing that something is wrong. For example, if Kepler just picks two tables to join but uses the wrong key, you'll get no results. Kepler can go back and repeat the steps. We internal knowledge search. We check the schemas, run the query. We got to redo it again. You can tell a little bit from my previous example where Kepler was just constantly running queries looking at results. Basically, this is the agent reasoning by itself. Instead of you giving the feedback, Kepler is running tools, giving feedback, then using the right tools to take the next potential steps depending on whatever feedback that's given.

The really lovely thing is that Kepler can interactively explore the data itself and context is carried over the whole time. We have the whole steps which just make for a better answer at the end.

Agents without context can give wildly wrong answers. Take this example where somehow the agent thought that there were 5 million ChatGPT users compared to the actual 800 million answer. Just a minor rounding detail. Or in this case, when the agent thought that Sora was the Kingdom Hearts video game character. Maybe in my regular ChatGPT history, this would be more right. Yes, I really wanted to ask about Sora, the video gen product that we just released. As a start, we need table metadata context. As we saw earlier, fitting all 70k tables with their schemas and query history is just too much data to put into a model's context window. We have to do some preprocessing ahead of time. Table schema information is particularly important because that way the model knows how to actually query the table. That's where the SQL generation bits come from.

It needs to know the columns and their types. Schemas alone aren't enough to understand the semantics and the relationships between data. That's why here you see query history, lineage. Those are important things to provide this extra context. All of this basically gets fed in into an embedding that we use the OpenAI API for, and then stored so that it can be live retrieved via specific table search and semantic search when the agent is actually answering questions. The other common problem with table metadata is that descriptions get easily outdated and often they become a burden to maintain. This leads to the agent getting bad results. It's incredibly tedious to manually update them, especially when you have 70k tables. We solve this by autogenerating as much as possible. We also include information beyond what is just in our data catalog.

That's actually what makes the generation so good. It's not enough to look at the table by itself just as is. You need to understand how the table was created and where it came from. This is the secret to the agent really understanding the differences between tables, knowing that a table was filtered down because it came from a subset of logs. We achieve this by essentially running an offline job that generates Codex tasks for tables. These Codex tasks are launched in parallel daily. They crawl the codebase to understand things like a table's purpose, downstream usage patterns, exact grain and primary keys, the freshness, when to use other tables, so on and so forth. Instead of knowing that a table is only about ChatGPT analytics, you know that the table contains first-party ChatGPT traffic and not third-party traffic, and that it's enriched by safety signals.

Or you know that some fields might be actually null because their upstream signals are missing or outside the hourly window. Here's an example Codex generation that gives information on a Spark observability table. As you can see, Codex easily goes through the agent files to crawl the codebase. It looks at the Airflow folder, looks at the projects. That's how we generate some tables. Gets this information, maybe it's job information, table info, stage usage, so on and so forth. Again, since this is all refreshed periodically by an offline job, the context stays fresh without any manual involvement. The really lovely thing about this is that you can also get lineage information, which is super useful in knowing how the tables relate. This is an example that I just pulled from one of our sample datasets.

As you can see here, it's a little richer than maybe what you might have if a human were just midnight writing this. This is all fed into the agent and also in our UI, so humans can also take advantage of it.

How do we get the company context? We actually have an internal knowledge service at OpenAI that ingests things like Slack threads, or Notion docs, or Google Drive docs. All of these go into blob storage with metadata so we can have the content and the source that it came from. Since these documents are quite large, they're entire docs sometimes, they're broken down into chunks and again embedded using the OpenAI embedding API. Then there's retrieval service that actually does similarly RAG search and does permissions checking and also caches so that we can pull these efficiently. That's why when we ask a question, we get the why and the context around just the what.

If you see a dip in weekly active users, you might find that Slack thread that points to an incident or an outage, and that gives you a much richer analysis and understanding of the problem. Memory is useful for things like corrections and learnings. Here's an example. Before, we just had a Slack statement, we're looking at user IDs since we're looking at daily active users. Let's say that for this particular use case, we really mean we want to look at users that's about at least a message and we want to exclude external, and we care about PST time zone. After the memory, our system will always generate this sort of query instead of this one because we have this correction saved. That way, the agent is basically able to produce the right results each time. Memory is ingested similar to table knowledge. We have all these corrections that we make. A user can submit a correction manually or the agent can also do so. We put things in embedding and then retrieve at runtime. We're relevant. For us, memory is really the mechanism that helps the agent continuously learn and improve. Context will get you maybe 80%, 90% of the way there. Sometimes you need those final little corrections that are just really hard to just infer. Let's say you rolled out a feature and you have a particular string for your stats and gates, that's an example of memory that you need to find the right result, but it's really tricky to know that otherwise. Here's an example. We have three scopes right now. There are user level ones because users might want their own customizations. We also want to protect potentially private information.

There's channel level for team scoped memories, and then also global memories for just general fixes that can benefit everyone. We're currently rolling out memory suggestions, so Kepler can prompt to generate a memory and then the user can confirm. Then it can be inserted in the right scope. We're also looking at ways we can create evals that only pass in the right memories just to make sure that all this is working correctly. The plan is to also have memory be compacted in case users generate a lot of the same memory. Also, sometimes there might be memories generated that are accidental or maybe they're just not that reusable. In that case, we want to prune them in an offline job. The other thing to make the memories richer and just to have a lot better signal is having them be edited.

In the UI right now, you can actually trigger an edit and then a resync if you want to make an update. Our users are really live helping contribute as well here. If all of the context isn't enough, our agent can make live calls to our data warehouse to find out what it needs. It's just a short API call away. One of the reasons, for example, why you might need this is if a table is new or something, like you generated the testing table, so it doesn't really exist in any of our services yet, or it doesn't exist in the offline job yet, you can just directly query the data warehouse and Kepler will get that information. Hopefully, this image really illustrates how important context is.

How Do We Measure the Response Quality? #

The next important factor to consider is how we ensure we don't cause regressions. Let me now talk about how we measure response quality. In the words of the wise Greg Brockman, evals are surprisingly often all you need. There's a lot of truth to that in our case. Our evals consist of sets of question-answer pairs. A question is usually some important metric we want to get right. Then we have a manually curated expected SQL statement that we want to be the correct answer to match. We hit our agent query generation endpoint to turn our natural language question to generate SQL and then run the query. We do the same with the expected query results. We have the generated query, the generated query results, expected SQL, expected SQL results, and we feed that all in to the OpenAI evals grader.

This evals grader is actually doing a model grading. This is particularly important because a lot of times generated SQL might differ by a little bit, but it still doesn't meaningfully change the results. All of this ends up giving us a score and a reason, so we can see how our evals did. Here are some key takeaways from our evals process. Firstly, I just want to say that exact SQL text equality isn't really a good representation, whether a SQL eval passed. Like you could write a date filter in multiple different ways and meaning is still the same. We normalize things. We convert things, everything into its AST representation. This helps us get around these minor SQL syntax different things. Also, when we compare result sets, we actually give a little wiggle room for things that don't meaningfully impact the answer.

In some cases, a float or an int, it doesn't really matter. In some cases it does, but in some cases it doesn't. Again, the LLM reasoning is really good because it does a much better job where there's nuance. Like in the float in case, sometimes it actually is important for precision purposes, but sometimes it's not. The other thing is that because the model is pretty good at reasoning, it gives us a much more informative response from the results it's seen. The third thing is that we also expose the chain of thought in our evals. This is really helpful for debugging failures. In one case, we saw that an eval was failing because it was preferring a curated table over a raw table.

After digging through the chain of thought, we realized it's because it thought that the question was more on the dashboarding side, and so that's why it picked that table. That wasn't immediately obvious in the beginning.

How Are We Doing This Safely? #

With great data access comes great responsibility. Kepler does not provide any extra authorization. It actually does pass the authentication. This means that Kepler won't grant you extra access to tables you don't have access to. When you don't have access, Kepler will actually helpfully tell you which access group to join. Or it might use a similar table that you do have access to. Data security is also something that we take very seriously at OpenAI. Users should only be accessing the data that they have a legitimate purpose to do so. When we ingest internal knowledge, we actually ingest pre-sanitized queries so that important IDs, for example, aren't accidentally leaked. Especially on Slack, for a Slack agent, when the audience is more broad, we also redact sensitive outputs. This is done by intercepting the results and then passing it to our internal anonymization service that detects PII.

Sometimes there's actually a reasonable use case for the users to see the raw results. We do actually allow this by linking to an external UI where users who have permissions to those tables, so they would have been able to run the queries themselves anyway, we do permissions check there to make sure that they should actually be seeing that data. This is true for all the terrible pieces that the agent generates. Just like a human, Kepler can also make mistakes. That's why we stream Kepler's chain of thought as Kepler is answering the question, as you saw in those screenshots before. It's been really helpful as an audit and also because in a bunch of cases it's important to understand the assumptions that went into the answer.

If Kepler ran any queries that resulted in data, Kepler will link those and provide the reference ID so you can click into the raw results.

## User Feedback (Kepler)

Now let's see what internal users are saying about Kepler. One of our users mentioned, for example, how Kepler they felt like was the most useful bot that we had at the company. Another user mentioned how Kepler was really good at writing SQL queries to the point where maybe writing them by hand is a total waste of time. At the bottom there, you can see Kepler is pretty good at sanity checking data and making sure your assumptions are correct about the data. This is my favorite quote, actually. Someone mentioned how Kepler to them felt like the closest thing to AGI that they've used. We have a lot of really nice people.

Key Learnings #

The user love is really great to see. How did we get to this point? Let me talk about some key learnings we had along the way. We owe a lot of our initial success to just having a really quick feedback loop with our users. We partnered with a key team. They would give us feedback. We immediately proved feedback, immediately proved so on and so forth. Let me give an example of what I mean. Initially, we thought that all questions were just metrics questions, like you generate a SQL, gets a result, Kepler agent tells you. Actually, a lot of questions can just be answered by company context or a doc, or it's just table information, like an access group.

As a result of this user feedback, we reworked our backend to also accommodate for these, so the model wouldn't waste things like running noop queries, SELECT 1s, doing table search when it's not really relevant. We can answer these types of important questions that Jimmy is asking right now. The other thing that really helped us was meeting users where they are. A lot of agent interfaces are actually web, for example, ChatGPT. We actually started with the Slack interface because OpenAI is a really Slack heavy company. People post analytics updates on Slack all the time. People ask data questions on Slack all the time. We realized the key to success was actually getting people to ping Kepler with their analytics questions instead. Now I'm going to talk about some lessons learned.

It turns out if you give the model too much information, especially when it's overlapping, it can get really confused. We realized this initially because we have a lot of tool calls that are a little similar just because one might use service auth, one might use user auth, but the model was getting really confused because it just couldn't understand the little subtle nuances. We actually ratcheted down the tool calls that agent can use for easier tool discovery. The other thing is that we found the results were worse when there were really specific instructions. This is because there's just so many different types of questions that people ask. While there might be a similar general overall path, there's a lot of little branches in logic.

Being overly prescriptive actually hurt us because the model would try to follow an exact set of instructions that didn't really make sense for maybe their question. We actually changed our prompting to be a little more general so that while Kepler would get a rough certain point, we leave it to the reasoning of GPT-5 to understand the exact path that it should take. Because at the end of the day, it does have all that context.

What's Next? #

What's next up for Kepler? One thing we'd like to do is to fine-tune a model dedicated specifically for Kepler. There's a lot of little areas where Kepler doesn't get things right. For example, some SQL quirks. We have a lot of data on the questions that users ask and the right SQL that can be generated. The plan is to use that data and train a model on it so that Kepler gets even better for internal use cases. We care a lot about our users' trust in Kepler. We want the responses to be validated and correct. That's super important to our success, because if people don't trust Kepler why would they use it? That's why we're planning to build in extra validation steps so Kepler can check itself like a human would. One example process is, you might look at a number at a dashboard somewhere after you run the query. Kepler can take this result and can do a comparison just to make sure that the number matches, and so that gives higher confidence that actually that is the right answer.

Key Takeaways #

If there's anything you take away from this talk, I hope these three things stick with you. Firstly, it's really important to have context beyond just table metadata. The code and rich context that I talked about and the company context goes a long way. Also, incorporating memory is really valuable so that your agent can continuously improve. Evals are particularly important to make sure that your model remains consistently good.

Questions and Answers #

Participant 1: I'm curious about the user personas of the people that are using Kepler. Is it the data analysts that were doing analytics before? Is it the end users who now don't need them and they just ask Kepler, or somewhere in between?

Bonnie Xu: We actually started out with the data scientists primarily because they have the most context and what a right answer might be. Especially because initially the responses just weren't as good. Now that Kepler has gotten a lot better, we've branched out. We now have users across GTM. We have users in finance and econ, API as well. There are folks working on Sora on ChatGPT that also ask Kepler questions. We've branched out to the whole company at this point.

Participant 2: What strategies do you have to manage the context window for your agent?

Bonnie Xu: I think putting everything into the embedding and then returning that at runtime is really helpful. The right search is actually pretty good. Then the other nice thing is that it can just pick up little pieces. That's how we do it efficiently. I think the other thing, we also have like some tagging actually when looking things up so that it can find the right information, and we do limit the search to an extent. We're also looking into reducing the amount of tool calls or just getting to the right answer faster. That's also one way we reduce the context window size as well just to make sure that the responses still work at the end of the day.

Participant 3: Do you use the questions asked to the bot to then drive the data modeling? Is that agentic as well? If you've got lots of users consistently asking a question about some tables that haven't been joined, do you then bring that background? For the fine-tuning, what signal are you using? Is there a user action that happens in Slack after they've got their analysis that gives an insight into whether you've got good feedback or not?

Bonnie Xu: Your first question is about whether we use the top questions for evals? We actually do. Our evals, they're manually generated now just so we can get the right set, but they're pretty much based on the top common questions that people might answer and also the important ones to get right. Like ChatGPT weekly active users, or what's our revenue for Sora, or something else. Yes, we do actually use that input as informing. That's a lot of the basis of our eval question pairs.

Then for your second question on fine-tuning and how we get the right inputs for that, we do actually have feedback. There is a manual feedback mechanism that people can use to upvote or downvote. Maybe you've seen this in ChatGPT where you can thumbs up, thumbs down, that's like a really important source for us. We also do get just feedback from Slack threads, for example, so actually in the thread. Those are also inputs, and even just the conversations. We'd like to extend this though, but we could actually look at all the conversations and do a semantic analysis, for example. That's one other way. We aren't currently doing that right now, but we'd like to move in that direction.

Participant 4: I'm assuming that for your evals you do them before, like assuming that there isn't any memory. How do you make sure that when there is memory included in the context that the memory didn't actually regress the performance of the tool?

Bonnie Xu: The memory check is actually pretty lightweight since it just does RAG search. Yes, that's true. Sometimes maybe it's like a bad memory, it actually does meaningfully present a failure that shouldn't be the case. In those cases though, that's again why the chain of thought is useful. We can inspect what's going on. Then the memory pruning to make sure that these memories make sense. Pretty much it's just a little bit of auditing in our side to make sure that that happens. There are bad memories too, absolutely.

Participant 5: A question around memory. You talked about having different levels of memory, user, team, and global scope. I'm assuming that you generate these memories in some kind of offline process. You also allow editing on these memories so that users can make any small changes. If that be the case, in this memory pipeline process, if you're rerunning this pipeline process, how do you ensure that the user edited memories are not overwritten? Because like, now, if the memories are going to be idempotent in nature, meaning if I have a conversation, and if I'm generating a memory, even if I rerun, I should be getting the same memory UUID or something of that sort. That's what my assumption is. If that is not the case, then how are you dealing with the conflicts? Now, if you rerun the same pipeline, you might actually generate a different memory, then you would end up in a conflict. First is, how do you avoid overwriting? If not, how are you avoiding the conflicts?

Bonnie Xu: Your question is basically about conflicting memories. I think there's a couple of things that we do in this case. Firstly, we are actually planning to introduce a way that users can see their memories. The whole reason for scopes is that you don't pollute global memory if you shouldn't. If I'm a user, and I'm like, I have this very specific analytics case. This other user is like, no, actually do something very differently. Those are things that are more meant for personal memories. Even within scopes, memories can conflict. Let me take a step back here. For memories, we actually do insert them at runtime, and then we prune them offline, since there could be a lot of them. It's that offline pruning process that helps take care of a bit of that.

Obviously, there still is that gap between when that offline job runs and when it live runs. The other nice thing is that we have the model to fall back on when it does retrieve a memory that is incorrect, or just doesn't make sense in the context of the question, because that will sometimes happen even in its own knowledge search. It is able to use that as just a signal and either disregard it, or maybe don't take it as highly. We've seen that actually in the chain of thought. In most cases, again, it works out. In these edge cases, like you mentioned, it's just another signal. That's why we provide so many signals, because if at least 80% of the signals are good, that's usually enough to push in the right direction.

Participant 6: Is there any thinking about either open sourcing it as a framework, or maybe as a service for enterprise? Are you finding that users are using Kepler instead of the source systems themselves, just because it's easier?

Bonnie Xu: Yes, of course. The first question was open source. I do love open source. I don't know if I'm the right person with the right authority, unfortunately, to make that decision.

Then on your second question of using Kepler versus just directly querying the source. I think, at least from what we've heard from our users, directly using Kepler is a lot faster. It's more productive, just because when you're looking at different sources, you have to go across like that. All these different sources, you might be doing some curation yourself. Let's say, you're doing Kepler DAC. You're looking at Databricks for your data catalog, and then you're looking at Codex for some code files or whatever. Then maybe you're looking at Airflow for your Airflow job. That's a lot of stuff you have to do. Then you have to connect the dots. Versus Kepler is really that layer on top, that abstraction that does it for you. That's why it's just a lot faster. We've actually seen a lot of folks too, because Kepler will operate independent of them, they just launch a couple of Kepler questions, and they come back to it. You just become so much more productive that way.

See more presentations with transcripts

source & further reading

infoq.com — original article