{"slug": "learning-to-replicate-expert-judgment-in-financial-tasks", "title": "Learning to Replicate Expert Judgment in Financial Tasks", "summary": "A proprietary model trained on high-quality human annotations outperforms all frontier models on financial information filtering tasks, achieving over 80% accuracy at a fraction of the cost, while Gemini, Claude, and GPT variants averaged only ~50% accuracy on the same tasks.", "body_md": "## Judging information\n\nOutperforming the market is hard. When every investor has access to the same sources of public information, alpha must come from unique insight built on taste and judgment. A strong investor’s judgment is difficult to articulate and teach directly to others, whether human or AI. It comes from experience.\n\nEven when we decompose an investor’s job into its simplest constituent tasks, those tasks turn out to be surprisingly difficult for LLMs. In this post, we consider a simple special case: filtering and processing financial documents to surface information relevant to investment decisions.\n\nInvestors are bombarded with information every day: news articles, research reports, company documents, emails, internal write-ups, and more. Reading is the easy part. The real work is the small, repeated judgments carried over it — filtering, interpreting, segmenting, and identifying where the useful signal lies. These judgments are embedded throughout an investor’s daily workflow and consume substantial time.\n\nWe wanted to see if we could automate the information triage task: identifying what is relevant and interesting to read. This alone could greatly augment investors’ productivity, letting them spend their freed up attention on higher-level synthesis and decision making.\n\nGiven that LLMs perform poorly on simple financial tasks, we asked: is it possible to teach LLMs financial judgement? We find that with **high-quality human annotations**, we can teach LLMs to interpret text with expert-level taste and judgement. **Our proprietary model outperforms all frontier models we tested on information accuracy and recall, at a fraction of their cost.**\n\nWe describe our training process and results on a subset of data cleared for public release. Based on our results, we further describe the seeds of a vision of *differentiated intelligence*, with models tuned for specific organizational needs.\n\n## Frontier model performance\n\nWe evaluated models on six information filtering tasks drawn from investors’ daily workflows. Beyond these tasks, we have many others internally that show similar patterns to these six tasks: frontier models we tested on underperform compared to our internally trained models.\n\nWe measured accuracy — the percentage of documents that were correctly labeled according to our investors. For classification tasks, we also calculated the F1 score.[F-score](https://en.wikipedia.org/wiki/F-score) (Wikipedia).\n\nThese tasks are trivial for investors, but they get stuck when articulating their decision process. Consider the following example of classifying a news article as relevant to an investment professional below:\n\nThe Greenland example is unlikely to be taken seriously given the context of the article, while the China tariffs are highly relevant. Yet both examples touch on geopolitics and finance.\n\nIn contrast to our investors, frontier models we tested on perform surprisingly poorly. Variants of Gemini, Claude, and GPT averaged a mere ~50% accuracy when given a prompt that simply states each of the six tasks to perform.\n\nWe first tried to improve LLM performance with stronger prompting. Our experts wrote instructions based on real task descriptions, and also suggested reframing certain tasks. For example, while an article about a small IPO is clearly financially relevant, it lacks the broad significance that would make it interesting to a Bridgewater macro investor. LLM performance on the article classification task improved when they were asked to sort news stories into three labels: relevant and interesting, relevant but uninteresting, and irrelevant.\n\nThese changes boosted their accuracy from a coin flip to the mid-70s. We saw no further gains in accuracy from automatic prompt-optimization methods. With our best prompts the frontier models we tested on still achieved less than 80% accuracy — the threshold investors expect from a system they could trust in their daily workflow.\n\nOur results also suggest that newer models aren’t improving rapidly at this task, especially per dollar spent. GPT 5.4 costs 43% more than 5.2 but is only marginally more accurate.\n\nAn explicit prompt can only convey the intuition an expert is able to put into words, while the judgments that matter most are often the hardest to articulate. Fine-tuning sidesteps this: rather than contorting the expert’s intuition into a static prompt, the training process lets the model develop its own judgment. Could we train open-weight models to outperform frontier models we tested on these tasks?\n\n## Training dataset construction\n\nThe first challenge of training a custom model was acquiring a dataset that reflects **high quality investor taste**. In particular, much of the information is only useful through an investment professional’s judgement. Take the example of a small cap IPO: this would be interesting for small cap investors but less to a macro fund like Bridgewater. Assessing these kinds of cases requires judgement and Bridgewater context.\n\nWe initially sourced a dataset from vendors providing non-expert labeling. Models trained on this dataset still performed poorly. After examining the reasoning traces of the model we realized that the labels in the dataset were often wrong. Since expert labelers are costly, we devised a verification scheme that routes only the contested examples to experts.\n\nThe scheme worked as follows: we trained a model on the dataset from non-expert labelers, then evaluated it on the same data. Examples where the model’s answer differed from the labelers’ were sent to our experts for reevaluation — if a model couldn’t match an example from its own training set then either the example is genuinely difficult, or the original label was wrong. This procedure was used to clean the training set data; the final evaluation was done on a held out test set.\n\n## Training recipe\n\nWe trained our models on Tinker from Thinking Machines Lab.[Tinker](https://thinkingmachines.ai/tinker/). Tinker allowed us to iterate quickly without worrying about GPU infrastructure.\n\nWe chose Qwen3-235B as the base model as its fine-tuning performance is widely studied in the academic literature.\n\nWe began with standard GRPO and importance-sampling loss as a simple, critic-free starting point. This baseline approach resulted in a massive jump in the model performance, but it still fell short of our desired 80% threshold.\n\n| Model / Training | Average Accuracy | Average Pos F1 |\n|---|---|---|\n| Qwen Base | 44.8% | 55.24% |\n| Qwen + GRPO | 73.48% | 88.95% |\n\nWe make the following modifications to our training recipe to push performance farther:\n\n### 1. Interleaved batching\n\nFor our multi-task training recipe, we compared three batching strategies: training each task sequentially, fully mixing tasks within a batch, and interleaving one batch per task in round-robin order. We found interleaving worked best, improving accuracy by 12.1% over fully mixed batches.\n\n### 2. CISPO loss with asymmetric clipping\n\nWe used CISPO loss with asymmetric clipping[CISPO loss with asymmetric clipping](https://arxiv.org/abs/2510.13786) (arXiv). to replace the standard importance-sampling loss. Across the loss functions and clipping schemes we tried, this performed best, improving accuracy by 10.1% over the importance-sampling baseline.\n\n### 3. On-policy distillation with strong teachers\n\nWe train with on-policy distillation[On-Policy Distillation](https://thinkingmachines.ai/blog/on-policy-distillation/), Kevin Lu in collaboration with others (Thinking Machines). (OPD), constructing the advantage as follows:\n\nThe reward is penalized when the student drifts from the teacher’s distribution, regularizing the policy while it learns the task.\n\nEvery 20 steps, we promote the current checkpoint to the teacher — but only if validation accuracy has reached a new high, so we never distill toward a weaker model. This gave a further 3.1% gain over a frozen base-model teacher.\n\n## Results\n\nFinding the optimal training recipe required several iterations of different approaches. Tinker’s accessibility allowed us to run fast experiments and refine our approach.\n\nOur trained model improves average accuracy from 78.2% to 84.7%, a 6.5% gain. Error rate falls from 21.8% to 15.3%, meaning the trained model makes 29.8% fewer mistakes than the best frontier baseline. We find this level of accuracy is sufficient for our daily work.\n\nOur trained model is also vastly cheaper due to its smaller size: a 13.8x reduction in inference costs per task. As we plan to rely on more models trained to help with specific tasks and to scale AI across the organization, cost is an important consideration.\n\nWe ablated each part of our training recipe to show how each portion contributes to performance.\n\n| Training Method Ablations | Average Accuracy | Avg Pos F1 |\n|---|---|---|\n| Qwen + Final Recipe | 84.66% | 92.99% |\n| Interleaved Batching | 72.18% | 89.01% |\n| CISPO + Asymmetric Clips | 74.56% | 90.64% |\n| OPD | 72.39% | 87.93% |\n| OPD w/ Best Val Accuracy Teacher | 81.55% | 89.41% |\n\n## Conclusion\n\nFrontier models we tested on struggle with relatively simple financial tasks, and model advances don’t improve performance much. In contrast, we’ve shown that **high quality proprietary datasets** labeled by expert investors and used for fine-tuning create custom models that understand our context and perform well on our tasks. We have found that this outcome holds true well beyond the six tasks we’ve discussed in this post.\n\nAside from higher accuracy, custom models are also substantially cheaper. We expect to see more productivity gains from custom model training in the future, especially with the availability of training infrastructure like Tinker that enables rapid experimentation.\n\nOur results show the possibility of a future of differentiated intelligence, where custom models tuned to specific organizational needs outperform frontier models.\n\n## Citation\n\nPlease cite this work as:\n\n```\nSu, Sarah; Zhu, Kevin; Xiao, Emily; Alur, Rohan; Kang, Daniel (Bridgewater AIA Labs), \"Learning to replicate expert judgment in financial tasks\",\nThinking Machines Lab: News, June 2026.\n```\n\nOr use the BibTeX citation:\n\n```\n@article{su2026expertjudgment,\n  author = {Sarah Su, Kevin Zhu, Emily Xiao, Rohan Alur, Daniel Kang (Bridgewater AIA Labs)},\n  title = {Learning to replicate expert judgment in financial tasks},\n  journal = {Thinking Machines Lab: News},\n  year = {2026},\n  note = {https://thinkingmachines.ai/news/learning-to-replicate-expert-judgment-in-financial-tasks/}\n}\n```\n\n", "url": "https://wpnews.pro/news/learning-to-replicate-expert-judgment-in-financial-tasks", "canonical_source": "https://thinkingmachines.ai/news/learning-to-replicate-expert-judgment-in-financial-tasks/", "published_at": "2026-06-30 19:46:09+00:00", "updated_at": "2026-06-30 19:49:50.258688+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "ai-tools", "natural-language-processing", "ai-research"], "entities": ["Bridgewater", "Gemini", "Claude", "GPT"], "alternates": {"html": "https://wpnews.pro/news/learning-to-replicate-expert-judgment-in-financial-tasks", "markdown": "https://wpnews.pro/news/learning-to-replicate-expert-judgment-in-financial-tasks.md", "text": "https://wpnews.pro/news/learning-to-replicate-expert-judgment-in-financial-tasks.txt", "jsonld": "https://wpnews.pro/news/learning-to-replicate-expert-judgment-in-financial-tasks.jsonld"}}