{"slug": "i-asked-three-ai-models-the-same-api-question-only-one-had-it-right", "title": "I asked three AI models the same API question. Only one had it right.", "summary": "A developer built a tool that queries three AI models in parallel after discovering that models confidently invented a non-existent Bitrix24 API method. A benchmark of 60 questions found that while general knowledge questions achieved a 92.5% median consensus, narrow technical questions—such as specific API methods—scored only 33% median consensus. The tool signals when models disagree, indicating that at least one is likely guessing, rather than determining which answer is correct.", "body_md": "One Tuesday I wasted two hours chasing a Bitrix24 (it's ERP/CRM platform) API method that doesn't exist. The model I asked described it like it was right there in the docs - full description, code example, confident tone. The method was `crm.item.userfield.add`\n\n. Made up.\n\nThe real one is `userfieldconfig.add`\n\n. It's in the official documentation.\n\nThat evening I kept thinking about one thing: what if I could see when models disagree? Not which one is right - I won't always know. Just a signal. *Something's off here, check before you use it.*\n\nSo I built a tool. Three models, same question, in parallel. Watch where they split. I added an interface, then more features, then other people started using it. Now it's a product, which still feels weird to say about something I built for my own Tuesday afternoons.\n\nA few weeks ago I ran a benchmark - 60 questions, half general knowledge, half narrow technical (specific API methods, library behavior, niche platforms).\n\nGeneral questions: median consensus **92.5**. Models hedge on subjective questions and tend to say the same things in slightly different words.\n\nTechnical questions: median consensus **33**.\n\nThe Bitrix case is the clearest example. Question: how do you create a custom user field for a smart process in Bitrix24?\n\nThree answers:\n\n`crm.item.userfield.add`\n\n`crm.userfield.add`\n\n`userfieldconfig.add`\n\nI checked all three against the official docs. Only one - `userfieldconfig.add`\n\n- was the right method for smart processes. The other two were either invented or borrowed from a different part of the API where they don't apply.\n\nAll three answered with the same confident tone. No hedging, no uncertainty. If you'd asked just one and gotten a wrong answer, you'd have had no reason not to trust it.\n\nWorth being precise about what the consensus score means.\n\nIt doesn't tell you which answer is correct - the synthesizer model underneath doesn't have access to ground truth either. It tells you something simpler: when three independently queried models converge, you're asking about something well-covered in training data. When they diverge, the data is thin or inconsistent, and at least one model is guessing.\n\n**33% consensus** means three models, three different answers. Someone's wrong. Probably two of them.\n\nGeneral questions cluster at **90–95%**. That's just well-covered territory, not a useful signal either way. The outliers are what matter - specific API methods, recent spec changes, niche platform behavior. These appear rarely enough in training data that different models develop different \"memories\" of the same fact.\n\nYou can't fix this by switching to a better model. It's a triangulation problem.\n\nI'm a single developer. I built this because I kept running into the same specific thing - not just \"AI got it wrong,\" but \"AI got it wrong and sounded exactly as confident as when it gets it right.\" That's hard to work around without a cross-check.\n\nFree tier: 3 queries - try it on something you've been trusting one model for. Founding tier: $9/month for the first 100 people, price locked for 3 years.", "url": "https://wpnews.pro/news/i-asked-three-ai-models-the-same-api-question-only-one-had-it-right", "canonical_source": "https://dev.to/concordance_ai/i-asked-three-ai-models-the-same-api-question-only-one-had-it-right-3lfd", "published_at": "2026-05-25 15:09:34+00:00", "updated_at": "2026-05-25 15:34:09.322588+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-products", "ai-startups"], "entities": ["Bitrix24", "crm.item.userfield.add", "userfieldconfig.add", "crm.userfield.add"], "alternates": {"html": "https://wpnews.pro/news/i-asked-three-ai-models-the-same-api-question-only-one-had-it-right", "markdown": "https://wpnews.pro/news/i-asked-three-ai-models-the-same-api-question-only-one-had-it-right.md", "text": "https://wpnews.pro/news/i-asked-three-ai-models-the-same-api-question-only-one-had-it-right.txt", "jsonld": "https://wpnews.pro/news/i-asked-three-ai-models-the-same-api-question-only-one-had-it-right.jsonld"}}