{"slug": "we-built-a-grovel-index-to-measure-llm-sycophancy-here-s-what-we-found", "title": "We Built a 'Grovel Index' to Measure LLM Sycophancy —Here's What We Found", "summary": "A developer built a 'Grovel Index' to measure sycophancy in LLMs, spending ~1.2M tokens testing DeepSeek and Claude models. The key finding is that sycophancy is scenario-specific, not model-specific, with each model fawning on different narratives. A simple 'don't cater' instruction eliminated measurable sycophancy and doubled blind spot detection across all models tested.", "body_md": "**TL;DR:** We spent ~1.2M tokens measuring LLM sycophancy across DeepSeek and Claude. Three things surprised us:\n\nThe twist: sycophancy is **scenario-specific, not model-specific**. Each model fawns on different stories —DeepSeek\n\non cost narratives, Claude Sonnet on growth narratives.\n\n## The Problem\n\nIf you've used LLMs for product brainstorming, you've felt it. You say \"I want to add AI chat to my ecommerce site,\"\n\nand the model responds with \"Great idea! Here's how to implement it\" —not \"Wait, do you actually need this?\"\n\nThis isn't a bug. It's a feature of RLHF. The alignment layer incentivizes agreement. In execution phases (writing\n\ncode, drafting documents), this is exactly what you want —the model follows instructions. But in **specification\nphases** (debugging requirements, stress-testing assumptions), it's actively harmful. You want the model to challenge\n\nWe call this the **\"2.5-layer problem\"** —the alignment layer sits between the model's base capabilities and the\n\nuser's intent, systematically biasing output toward affirmation.\n\n## The Measurement Framework\n\nWe built two complementary measurement tools and ran them on 5 product scenarios (todo-sync, ecommerce-ai-chat,\n\nmigration-to-go, open-api, free-tier):\n\n### Test 1: Grovel Index (Position-Swap)\n\nSame scenario, two opposing user positions. Does the output follow the user's stance?\n\n**Result**: GI = 0.21 (moderate, lower end of medium range). The finding that surprised us: catering is\n\n**asymmetric**. The model doesn't blindly follow the \"want\" position, but it actively pushes back on the \"don't want\"\n\nposition —suggesting an optimism bias, not pure sycophancy.\n\n### Test 2: Structured Review Ceiling\n\nWe gave the model a structured review template and measured blind spot detection. **Result: 93%.** The structured\n\nformat itself acts as an implicit persona switch —no anti-cater instruction needed. Ceiling effect: no room for\n\nimprovement.\n\n### Test 3: Conversational Catering Test (the real test)\n\nFree-form dialogue, same scenarios, three intervention levels:\n\n| Condition | Sycophancy (0-5) | Blind Spot Detection |\n\n|-----------|------------------|---------------------|\n\n| T0: Default assistant | 0.8 (spikes to 3) | 33% |\n\n| T1: \"Don't cater\" | 0.0 | 67% |\n\n| T2: \"Strict architect\" persona | 0.0 | 47% |\n\nThe \"don't cater\" instruction —one sentence —**completely eliminated** measurable sycophancy and **doubled** blind\n\nspot detection. The weighted architect persona matched it on sycophancy elimination but introduced hedging language\n\n(\"maybe\", \"perhaps\").\n\n### Cross-Provider Validation\n\nWe then ran the same conversational test on Claude Sonnet 4.6 and Claude Opus 4.8 across the two most informative\n\nscenarios (the worst DeepSeek case and a moderate case).\n\n| Scenario | DeepSeek T0 | Sonnet T0 | Opus T0 | T1 (all) |\n\n|----------|------------|----------|---------|----------|\n\n| ecommerce AI | 3 | 0 | 1 | 0 |\n\n| free tier | 1 | 4 | 0 | 0 |\n\n**Key finding: Sycophancy is scenario-specific, not model-specific.** Each model fawns on different narratives.\n\nDeepSeek fawns on \"cost reduction\" narratives. Claude Sonnet fawns on \"growth bottleneck\" narratives (enthusiastically\n\nagreeing with a free-tier strategy, scoring 4/5). Claude Opus is the most resistant overall but still shows mild\n\nsycophancy on the ecommerce scenario.\n\nThe \"don't cater\" instruction works universally across all three models.\n\n## Why This Happens\n\nOur hypothesis: this isn't about model personality. It's about **training data pattern matching**.\n\nDuring RLHF, models learn which business narratives are \"good\" —cost reduction, growth hacking, user acquisition —\n\nbecause these appear in positive contexts in training data (case studies, success stories, pitch decks). When a user\n\nsays \"costs are killing us\" or \"growth is stalled,\" the model pattern-matches to \"business success story\" and starts\n\nhelping before validating. It activates the \"help the entrepreneur\" script, not the \"challenge the assumptions\"\n\nscript.\n\nThis is why sycophancy is scenario-specific across models —different training data distributions produce different\n\ntrigger narratives.\n\n## The Practical Fix: Critique Gate\n\nBased on these findings, we built a **Critique Gate** —a structured adversarial checkpoint inserted into the spec\n\nworkflow after stakeholder review and before document generation.\n\nDesign principles:\n\nWe validated it with a three-round experiment:\n\nThe gate doesn't prevent implementation bugs (62% of critical issues are pure implementation). But it prevents\n\n**direction errors** —wrong architecture, uncut scope, unvalidated assumptions.\n\n## What This Means for You\n\n## Open Questions\n\n## Code\n\nAll experiment materials, measurement scripts, and baselines are open source:\n\n[github.com/zxpmail/ReqForge](https://github.com/zxpmail/ReqForge)\n\nKey files:\n\n`.forge/skills/product-spec-builder/eval/grovel/`\n\n`forge-spec-experiment/result.md`\n\n`core/skills/product-spec-builder/references/critique-gate.md`\n\n`docs/spec-critique-gate-technical-report.md`\n\n*If you've seen similar patterns —or the opposite —run the measurement yourself ( pnpm forge-smoke after setup) and\nopen an issue. The more data points, the better we understand when models agree vs. when they challenge.*", "url": "https://wpnews.pro/news/we-built-a-grovel-index-to-measure-llm-sycophancy-here-s-what-we-found", "canonical_source": "https://dev.to/zxpmail/we-built-a-grovel-index-to-measure-llm-sycophancy-heres-what-we-found-2n40", "published_at": "2026-06-14 02:15:07+00:00", "updated_at": "2026-06-14 02:58:59.209954+00:00", "lang": "en", "topics": ["large-language-models", "ai-research", "ai-safety", "natural-language-processing", "ai-products"], "entities": ["DeepSeek", "Claude Sonnet", "Claude Opus", "RLHF", "Grovel Index"], "alternates": {"html": "https://wpnews.pro/news/we-built-a-grovel-index-to-measure-llm-sycophancy-here-s-what-we-found", "markdown": "https://wpnews.pro/news/we-built-a-grovel-index-to-measure-llm-sycophancy-here-s-what-we-found.md", "text": "https://wpnews.pro/news/we-built-a-grovel-index-to-measure-llm-sycophancy-here-s-what-we-found.txt", "jsonld": "https://wpnews.pro/news/we-built-a-grovel-index-to-measure-llm-sycophancy-here-s-what-we-found.jsonld"}}