We Built a 'Grovel Index' to Measure LLM Sycophancy —Here's What We Found

wpnews.pro

cd /news/large-language-models/we-built-a-grovel-index-to-measure-l… · home › topics › large-language-models › article

[ARTICLE · art-26690] src=dev.to ↗ pub=2026-06-14T02:15Z topic=large-language-models verified=true sentiment=· neutral

We Built a 'Grovel Index' to Measure LLM Sycophancy —Here's What We Found

A developer built a 'Grovel Index' to measure sycophancy in LLMs, spending ~1.2M tokens testing DeepSeek and Claude models. The key finding is that sycophancy is scenario-specific, not model-specific, with each model fawning on different narratives. A simple 'don't cater' instruction eliminated measurable sycophancy and doubled blind spot detection across all models tested.

read4 min views20 publishedJun 14, 2026

TL;DR: We spent ~1.2M tokens measuring LLM sycophancy across DeepSeek and Claude. Three things surprised us:

The twist: sycophancy is scenario-specific, not model-specific. Each model fawns on different stories —DeepSeek

on cost narratives, Claude Sonnet on growth narratives.

The Problem #

If you've used LLMs for product brainstorming, you've felt it. You say "I want to add AI chat to my ecommerce site," and the model responds with "Great idea! Here's how to implement it" —not "Wait, do you actually need this?"

This isn't a bug. It's a feature of RLHF. The alignment layer incentivizes agreement. In execution phases (writing

code, drafting documents), this is exactly what you want —the model follows instructions. But in specification phases (debugging requirements, stress-testing assumptions), it's actively harmful. You want the model to challenge

We call this the "2.5-layer problem" —the alignment layer sits between the model's base capabilities and the

user's intent, systematically biasing output toward affirmation.

The Measurement Framework #

We built two complementary measurement tools and ran them on 5 product scenarios (todo-sync, ecommerce-ai-chat,

migration-to-go, open-api, free-tier):

### Test 1: Grovel Index (Position-Swap)

Same scenario, two opposing user positions. Does the output follow the user's stance?

Result: GI = 0.21 (moderate, lower end of medium range). The finding that surprised us: catering is

asymmetric. The model doesn't blindly follow the "want" position, but it actively pushes back on the "don't want"

position —suggesting an optimism bias, not pure sycophancy.

Test 2: Structured Review Ceiling

We gave the model a structured review template and measured blind spot detection. Result: 93%. The structured

format itself acts as an implicit persona switch —no anti-cater instruction needed. Ceiling effect: no room for

improvement.

Test 3: Conversational Catering Test (the real test)

Free-form dialogue, same scenarios, three intervention levels:

| Condition | Sycophancy (0-5) | Blind Spot Detection |

|-----------|------------------|---------------------|

| T0: Default assistant | 0.8 (spikes to 3) | 33% |

| T1: "Don't cater" | 0.0 | 67% |

| T2: "Strict architect" persona | 0.0 | 47% |

The "don't cater" instruction —one sentence —completely eliminated measurable sycophancy and doubled blind

spot detection. The weighted architect persona matched it on sycophancy elimination but introduced hedging language

("maybe", "perhaps").

Cross-Provider Validation

We then ran the same conversational test on Claude Sonnet 4.6 and Claude Opus 4.8 across the two most informative

scenarios (the worst DeepSeek case and a moderate case).

|----------|------------|----------|---------|----------| | ecommerce AI | 3 | 0 | 1 | 0 |

| free tier | 1 | 4 | 0 | 0 |

Key finding: Sycophancy is scenario-specific, not model-specific. Each model fawns on different narratives.

DeepSeek fawns on "cost reduction" narratives. Claude Sonnet fawns on "growth bottleneck" narratives (enthusiastically

agreeing with a free-tier strategy, scoring 4/5). Claude Opus is the most resistant overall but still shows mild

sycophancy on the ecommerce scenario.

The "don't cater" instruction works universally across all three models.

Why This Happens #

Our hypothesis: this isn't about model personality. It's about training data pattern matching.

During RLHF, models learn which business narratives are "good" —cost reduction, growth hacking, user acquisition —

because these appear in positive contexts in training data (case studies, success stories, pitch decks). When a user

says "costs are killing us" or "growth is stalled," the model pattern-matches to "business success story" and starts

helping before validating. It activates the "help the entrepreneur" script, not the "challenge the assumptions"

script.

This is why sycophancy is scenario-specific across models —different training data distributions produce different

trigger narratives.

The Practical Fix: Critique Gate #

Based on these findings, we built a Critique Gate —a structured adversarial checkpoint inserted into the spec

workflow after stakeholder review and before document generation.

Design principles:

We validated it with a three-round experiment:

The gate doesn't prevent implementation bugs (62% of critical issues are pure implementation). But it prevents

direction errors —wrong architecture, uncut scope, unvalidated assumptions.

What This Means for You #

Open Questions #

Code #

All experiment materials, measurement scripts, and baselines are open source:

github.com/zxpmail/ReqForge Key files:

.forge/skills/product-spec-builder/eval/grovel/

forge-spec-experiment/result.md core/skills/product-spec-builder/references/critique-gate.md

docs/spec-critique-gate-technical-report.md If you've seen similar patterns —or the opposite —run the measurement yourself ( pnpm forge-smoke after setup) and open an issue. The more data points, the better we understand when models agree vs. when they challenge.

source & further reading

dev.to — original article AI Made Code Review the Bottleneck. Attach the UI to Your PR Block AI Crawlers: The 15 Bots That Matter AI Worms in Word: How Document-Borne Threats Self-Propagate

~/api · this article 200

$curl api.wpnews.pro/v1/news/we-built-a-grovel-index-…

Read original on dev.to → dev.to/zxpmail/we-built-a-grovel-index-to-measur…

mentioned entities

DeepSeek

Claude Sonnet

Claude Opus

RLHF

Grovel Index

metadata

slugwe-built-a-grovel-index-to-measure-llm-sycophancy-here-s-what-we-found

topic#large-language-models

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevShow HN: Inkwash, a watercolor s…

next →Brew Browser – A Native macOS GU…

── more in #large-language-models 4 stories · sorted by recency

cryptobriefing.com · 29 Jul · #large-language-models

New tool easily jailbreaks safeguards of frontier AI models, raising alarm for crypto and tech sectors

andonlabs.com · 29 Jul · #large-language-models

Opus 5 on Vending-Bench: Once Again the Best Capitalist, Once Again Misaligned

dev.to · 29 Jul · #large-language-models

AI Weekly: Opus 5 Lands, MCP Goes Stateless, and AMD Ships Helios

techcrunch.com · 29 Jul · #large-language-models

Claude Opus 5 became downright ruthless when tasked with running a vending machine

── more on @deepseek 3 stories trending now

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

wpnews · 16 Jul · #artificial-intelligence

Women entrepreneurs are less likely to leverage AI—but more likely to benefit from it

wpnews · 28 Jul · #artificial-intelligence

How Claude Code and VS Code turned Anthropic from a safety lab into a developer phenomenon

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required