{"slug": "one-of-legal-s-hottest-startups-is-helping-lawyers-finally-answer-is-the-ai-s", "title": "One of legal's hottest startups is helping lawyers finally answer: Is the AI's work any good?", "summary": "Crosby, a legal tech startup, released the Redline Bench on Wednesday, a benchmark to measure how well AI models perform contract review tasks. The tool aims to solve the legal industry's challenge of defining 'good' legal work, as ambiguity in contract edits makes evaluation difficult. Initial tests placed ChatGPT 5.5 at the top with a 50.5% score, followed by Gemini 3.5 Flash and Claude Opus 4.8.", "body_md": "Legal technology wants its [vibe-coding](https://www.businessinsider.com/ai-coding-agents-tools-software-engineering-jobs-future-2025-6) moment. But first, it has to prove the tools can think like a lawyer.\n\nTaking up the task is Crosby, a startup-meets-law-firm that sells basic legal services to companies, including [Cursor](https://www.businessinsider.com/cursor-ceo-michael-truell-spacex-elon-musk-anthropic-2026-6) and Rogo. On Wednesday, it released the Redline Bench, a tool built to measure how well artificial intelligence models perform real-world legal tasks, starting with contract review.\n\nSoftware engineers have spent the past few years watching these systems get shockingly good at writing code and debugging errors. Now legal tech companies are chasing a similar prize: artificial intelligence that can review contracts, spot risks, and haggle terms faster and cheaper than lawyers.\n\nBut law has a problem that coding does not, says [Ryan Daniels](https://www.businessinsider.com/crosby-ceo-job-interview-work-trial-sunday-2026-4), a former in-house lawyer turned Crosby founder. \"It's really hard to define 'good' or 'bad,'\" he said.\n\nModels can write code that either runs or breaks. Legal work is a murkier target. A sales contract can be edited, or \"redlined,\" in lots of defensible ways, Daniels explains. A change that one lawyer sees as prudent, another might call too aggressive.\n\nThat ambiguity has become a headache for companies racing to automate legal work, from the scrappy neofirms to the model labs themselves. [Anthropic](https://www.businessinsider.com/anthropic-legal-ai-tool-plugin-competition-lexisnexis-thomson-reuters-2026-2) has spent the past few months courting in-house lawyers with tools built for them. That push has been closely watched by investors. Earlier this year, Anthropic's new legal plugin stirred a sell-off in [legal tech stocks](https://www.businessinsider.com/anthropic-cowork-legal-plugin-publishing-stocks-legalzoom-thomson-reuters-relx-2026-2).\n\nBenchmarks are one of the main ways companies track progress. The labs building frontier models use them as stress tests, measuring whether a new system is better at tasks than the last one.\n\nCoding has [hundreds of benchmarks](https://arxiv.org/html/2503.05860v2#S1) for evaluating models. But the legal industry still lacks a shared way to answer the question: Is the AI's work any good?\n\nCrosby has been working on a new yardstick. The company pulled its engineers and lawyers into a tactical unit called Crosby Intelligence to build agents for Crosby's law firm and a benchmark to grade them against. That team includes engineer Sharan Ramjee, who worked on transformer models to sniff out fraud at Stripe, and Ross Weiser, a lawyer who joined from elite law firm Sullivan & Cromwell.\n\nCrosby also partnered with Micro1, a company that helps model-makers recruit expert workers, to find more lawyers who could help define what counts as good legal work.\n\nTo build the benchmark, senior lawyers simulated software deals and marked the contract changes they considered most important at each stage of the negotiation. Those changes were turned into weighted criteria.\n\nWhen Crosby runs a new test, it gives models the same contracts and asks them to make their own edits. Then a panel of three judges compares these redlines with the lawyer-built rubric. The judges vote pass or fail on each item, and the final score shows how often the models made the kinds of edits that lawyers considered important.\n\nRedline Bench will be made public so any lab can put its models through Crosby's paces. Crosby also plans to regularly release reports tracking how major models compare.\n\nThe first release of the Redline Bench put ChatGPT 5.5 at the top of the heap, with a score of 50.5%, meaning the model's redlines matched half of the edits that lawyers prioritized. Gemini 3.5 Flash followed at 45.1%, and Claude Opus 4.8 scored 44.4%.\n\nCrosby was able to test Anthropic's highly capable new model, [Fable 5](https://www.businessinsider.com/why-white-house-ordered-export-controls-anthropic-mythos-fable-2026-6), only once before Anthropic pulled it off the shelves. The results were promising, with a score of 47.3%. When access is restored, Crosby will run the benchmark again and update it.\n\nCrosby isn't the only company trying to measure how the models stack up. Harvey, one of the best-funded legal startups, has released benchmarks for case law research and contract review.\n\nAnthropic and [OpenAI](https://www.businessinsider.com/how-openai-lawyer-nicole-diaz-uses-ai-for-legal-work-2026-6) also build their own benchmarks to measure performance on real-world tasks. But Daniels said those results can be hard to trust. Over time, the labs eventually tune their systems to perform well on their own tests, he said.\n\nThe stakes are bigger than a scoreboard. Billions of investment dollars are riding on the promise that artificial intelligence can lower legal bills and absorb work that used to pile up on the general counsel's desk.\n\nLawyers will only use the tools if they trust them. Crosby wants to give them a reason to.", "url": "https://wpnews.pro/news/one-of-legal-s-hottest-startups-is-helping-lawyers-finally-answer-is-the-ai-s", "canonical_source": "https://www.businessinsider.com/crosby-releases-redline-bench-evaluate-ai-models-for-contract-review-2026-6", "published_at": "2026-06-17 09:30:01+00:00", "updated_at": "2026-06-17 09:59:50.203314+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-tools", "ai-products", "large-language-models", "ai-research"], "entities": ["Crosby", "Ryan Daniels", "Anthropic", "ChatGPT 5.5", "Gemini 3.5 Flash", "Claude Opus 4.8", "Micro1", "Sharan Ramjee"], "alternates": {"html": "https://wpnews.pro/news/one-of-legal-s-hottest-startups-is-helping-lawyers-finally-answer-is-the-ai-s", "markdown": "https://wpnews.pro/news/one-of-legal-s-hottest-startups-is-helping-lawyers-finally-answer-is-the-ai-s.md", "text": "https://wpnews.pro/news/one-of-legal-s-hottest-startups-is-helping-lawyers-finally-answer-is-the-ai-s.txt", "jsonld": "https://wpnews.pro/news/one-of-legal-s-hottest-startups-is-helping-lawyers-finally-answer-is-the-ai-s.jsonld"}}