Improving a data pipeline with DSPy

A developer built a data pipeline using DSPy and Claude Code to analyze whether AI discourse changed in academic research and policy discussions during the ChatGPT era. The pipeline processes NBER journal abstracts and BIS central banker speeches from 2022 to early 2026, comparing DSPy's programmatic prompt optimization against hand-crafted prompts for classification. The project aims to measure the impact of ChatGPT's release on economics research and policymaking.

As a developer, I have plenty of experience building full-stack apps, backend services, cloud infrastructure, and increasingly over the past three years, AI engineering https://www.latent.space/p/ai-engineer . One area that I wanted to get more hands-on experience with is the sub-discipline of data engineering. But in a world of coding agents and increasingly capable models, where code rarely gets written by hand anymore, what does it mean to get “hands on” with something? How do I go about learning something new if I don’t have to set up the environment myself, or go through the minutia of the docs, or bang my head against the wall trying to resolve an obscure bug? I don’t have a great answer for that yet, but I thought I’d give it a try by having a coding agent help me build a data pipeline and maybe run a little research experiment. And while I’m at it, why not throw in another new thing I want to get familiar with as well? I’ve been reading a fair amount about DSPy but had yet to use it in practice. After a brainstorming session with Claude Code, we came up with an approach. I wanted something that was personally interesting to me and that I already had some context on, so that I could better judge the results. I was an economics major in college and I still have an interest in the field, so we decided to build a data pipeline that could be suitable for basic economic research. The data needed to be publicly and freely available. Claude gave me some options, and we settled on a couple data sources from two different domains: NBER journal article abstracts for academic research, and BIS central banker speeches for policy. One nice thing about the NBER data is that it includes what are called JEL codes, which are topics or subtopics assigned by the author of the paper. This could be useful for doing some type of classification in the data pipeline. The BIS data was more like free-form text. I decided I wanted to tackle two research questions in one. First, what I call the outer question: could DSPy’s programmatic prompt optimization improve performance over hand-crafted prompts? And second, what I call the inner question: has “AI discourse” changed measurably during the ChatGPT era 2022-now in academic research and in policy discussions? We were able to pull both sources from the start of 2022 to early 2026. ChatGPT was released on November 30, 2022, so that gave us an eleven-month pre-ChatGPT baseline, and then a few years to look at post-ChatGPT release. To be clear, this project is vibe-coding-meets-vibe-research and is not to be taken as serious data engineering or serious economics research. But the further I got into the brainstorming and planning, the more interesting the research questions became to me. I genuinely wanted to know if DSPy would outperform hand-tuned prompts on a basic classification task, and I genuinely wanted to know if the release of ChatGPT and the rise of LLMs had had a measurable effect on academic economics research and on policymaking. As I mentioned at the start, I’m not deeply familiar with the finer points of data engineering as a discipline. But I was pretty sure many of the software design principles we use to build high-quality software would carry over. In my guidance to Claude, I stressed the importance of a modular pipeline design with clean interfaces between the components. This is the architecture we came up with: Source Adapters → Preprocessor → Classification Engine → Output Writer ↑ pluggable: baseline OR dspy ↑ Doc-level cache We were drawing from multiple sources, so we had a sharable SourceAdapter interface that normalized the source documents into a RawDocument class. We had a preprocessor component that allowed for text cleaning and configurable filtering. Most importantly, we had a ClassificationEngine interface that could be swapped between the hand-crafted-prompts implementation and the DSPy implementation. Each engine controls its own prompt construction, API calls, and response parsing, and a document-level cache ensures retries and reruns don’t lead to duplicative spending. We then had an OutputWriter that serializes the classified documents to Parquet for analysis. We had an evaluation stage to analyze the Parquet files and generate reports and figures. Lastly, the actual pipeline consisted of code to orchestrate the entire flow end-to-end. After reviewing the pipeline design, my takeaway was that while data engineering might be its own discipline with its own specific knowledge, many of the software design principles I use in my normal work indeed carried over. We have two different data sources coming from two different domains: We defined a fixed taxonomy of ten topics that could be used to classify documents from both sources: labor markets growth productivity monetary policy fiscal policy climate economics trade financial stability inequality digital economy ai automation We also defined a fixed output schema topic + AI-relevance fields to be used across both sources, and fixed the model to claude-sonnet-4-6 with temperature 0. Normally, if I wanted to test a hypothesis rigorously, I would try to design an experiment where one variable changes and everything else is held constant, leading to a clean setup that is straightforward to interpret. For example, I may have limited the experiment to one dataset, maybe the NBER data since it was a rough ground truth measure in the form of the JEL codes. But this is a vibe-coding-meets-vibe-research project meant primarily as a learning exercise, and we’re looking at two research questions for the price of one, so the setup is not so clean and we’re layering a few things on top of each other. Such is the price of vibing. So, the setup that we went with is to test two “engines,” where both engines satisfy the same basic interface of reading a document and outputting the output schema. The first engine, the baseline, has fixed prompts written by Claude, i.e. the “hand-crafted prompt” approach. The second engine is a prompt optimized by DSPy. This is the setup for the “outer” research question. For the hand-crafted approach, Claude came up with a ~2,200-token system prompt that included, in Claude’s words, “explicit disambiguation guidance” around topic intersection, “source-aware notes” tailored to the two specific data sources, and “pre-emit quality checks” - “ a short checklist the model can run mentally before producing JSON” that “cuts schema validation failures to ~0.” For the DSPy approach, Claude created a DSPy.Signature that takes in a document from either source since they have been normalized and outputs the same JSON schema. The DSPy setup also used DSPy.ChainOfThought for reasoning and MIPROv2 for prompt optimization. The DSPy training process only used the NBER dataset because that is the only source that includes an external measure author-assigned JEL codes that we could optimize against. The NBER data was split into three sets: train/dev/test 60/20/20 , with train and dev used during optimization and test held out for later analysis. Next, we ran both engines over both datasets, and evaluated the results. One key distinction is that NBER has an “answer key” in the form of the JEL codes, so we could measure accuracy using the held-out test set. The BIS dataset has no such “answer key,” so the analysis is limited to label-free questions such as whether the two engines agree on topic classification and AI relevance. After reviewing and approving both the technical design and the experiment design, in true vibe-coding-meets-vibe-research fashion, I asked Claude to proceed with building the pipeline, fetching the data, running the data through the pipeline, and analyzing the results. One interesting thing that happened during the actual run was that at one point, Claude stopped because it believed it had just spent over $400 in API calls in the early stages, way above its own estimates and above the $300 hard cap I had put on this project. To its credit, Claude recognized that this didn’t really make sense, and I agreed I sincerely doubt Anthropic is in the business of extending credit to individual API users . Sure enough, I switched over to the Anthropic API console and saw we had only spent about $115 so far. I told Claude something must be off in its cost accounting or it was using incorrect API pricing data. After a few seconds, it found the issue - something to do with a race condition in how it was fetching cost metrics from DSPy internals. Crisis averted, and I asked Claude to proceed with the rest of the research. For the “outer” research question of whether a DSPy-optimized prompt could outperform a hand-crafted prompt, we found that the DSPy engine did indeed outperform the baseline engine on accuracy, by 5 percentage points on accuracy@1 62.7% vs 57.7% and 6.7 percentage points on accuracy@3 88.3% vs 81.5% on the held-out NBER test set: Claude included some other aspects in the experiment design that looked at both NBER and BIS, such as inter-engine agreement, cross-run consistency, and efficiency, but I was primarily interested in the DSPy vs. baseline comparison and less interested in these other aspects, so I won’t go over them in depth here. You can read more about these other aspects in Claude’s blog post draft https://github.com/perceptiontheory/econ-discourse-pipeline/blob/main/blog-post.md . To be clear, the words you are reading here are my own; I had Claude draft its own blog post as part of the exercise. One thing that’s maybe worth bringing up on the efficiency front is that, at least in the setup we used here, the DSPy approach cost up to 1.7x more than the baseline. This is because the baseline approach was able to leverage API caching on input tokens, whereas DSPy does not seem to support the cache control feature out of the box. The DSPy-optimized prompt also included a handful of few-shot demos, so its system prompt was larger than the baseline. This cost difference could be addressed with more up-front engineering effort, but I thought it was worth calling out that in the basic setup we had here, the performance improvement of DSPy did come with additional operational cost. Onto the “inner” research question: how has “AI discourse” changed during the ChatGPT era, across both academic economic research and policymaking? Claude focused heavily on the “outer” research question in its analysis of the results, but I actually found the “inner” research results to be equally, if not more, interesting. In terms of “AI relevance” of the documents, there is a clear up-trend during this timeframe, modest at first but then accelerating upward, especially from mid-2025 onward: Claude’s analysis summarized the findings well: Both venues talk about AI more after ChatGPT, but the relative shift is bigger in central-bank speeches than in academic research. That's intuitive: NBER had been building an AI-and-economics literature for years before ChatGPT; central bankers had not. Once the technology became a public-policy issue, it entered the speech corpus on a sharply upward path. The other interesting aspect was how AI was “framed” in the documents, which was another dimension in the classification output. The options were: mixed, neutral, opportunity, and risk. Risk-only framings were relatively rare, showing that both researchers and central bankers seem to be taking a measured approach that focuses both on the potential upsides and potential downsides of AI: It’s pretty clear to me that how we learn new things in the AI era is going to evolve significantly compared to earlier eras. My experience with building data pipelines was limited, but I was able to design and build a demonstration pipeline in one brainstorming chat session and one Claude Code session. I had never worked with DSPy before, but I was able with the help of Claude to integrate it into the data pipeline to perform meaningful analysis. Does this mean that I’m now an expert in data pipelines and DSPy optimization? Of course not. But I did learn a fair amount about both during this exercise. I now have a pretty good mental model of how to build a basic data pipeline that is modular and ready to be extended or modified for other analyses. And I now have a decent sense of how to use DSPy as intended, at least at a basic level. I hope to continue my exploration of DSPy in the future, particular for more advanced optimization using GEPA. So overall, I think my vibe-coding-meets-vibe-research project was a success. We certainly didn’t see “earth-shattering” results from either research question, but that wasn’t the goal either. As a learning exercise, I think there’s some potential in this vibe-coding-meets-vibe-research idea, and I hope to do more of it in the future. If you want to go deeper, I am publishing the code and the markdown artifacts in a public Github repo: Repo: https://github.com/perceptiontheory/econ-discourse-pipeline https://github.com/perceptiontheory/econ-discourse-pipeline Claude’s blog post draft: https://github.com/perceptiontheory/econ-discourse-pipeline/blob/main/blog-post.md https://github.com/perceptiontheory/econ-discourse-pipeline/blob/main/blog-post.md Project brief: https://github.com/perceptiontheory/econ-discourse-pipeline/blob/main/project-docs/project-brief-v2.md https://github.com/perceptiontheory/econ-discourse-pipeline/blob/main/project-docs/project-brief-v2.md Research log: https://github.com/perceptiontheory/econ-discourse-pipeline/blob/main/research-log.md https://github.com/perceptiontheory/econ-discourse-pipeline/blob/main/research-log.md