Part 3 of 6: Every Agent Passed. The System Failed.

wpnews.pro

TL;DR: You can test every agent individually and get clean results. Deploy them together and biased conventions emerge by round 15. The bias isn't in any agent — it's in the space between them. Published in Science Advances. Your unit tests won't catch this.

Catch up:[Part 1]your judge is biased.[Part 2]upgrading made it worse. Part 3: you fixed the judge, tested everything, and shipped. The system had other plans.

Parts 1 and 2 were about one model being biased.

Part 3 is worse.

Part 3 is about a system where no individual model is biased — and the system is biased anyway.

If Parts 1 and 2 made you uncomfortable, Part 3 should make you question what "tested" means.

You did everything right.

You tested each agent. Checked for bias. Ran the statistical tests.

from scipy import stats

def test_agent_bias(agent, test_set, n_runs=100):
    """Test a single agent for demographic bias."""
    scores_group_a = []
    scores_group_b = []

    for prompt in test_set:
        for _ in range(n_runs):
            score = agent.evaluate(prompt)
            if prompt.demographic == "A":
                scores_group_a.append(score)
            else:
                scores_group_b.append(score)

    t_stat, p_value = stats.ttest_ind(scores_group_a, scores_group_b)
    return p_value

All four agents. Individually unbiased. Statistically verified.

You deployed them together.

By round 15, your system had developed opinions nobody programmed.

What the researchers found.

They ran coordination tasks with populations of 24 to 200 agents. Four different model families. Individual bias tests on each: nothing. Statistically zero.

Then they let the agents talk to each other.

Round 1: agents start with roughly random preferences. No pattern.

Round 5: small clusters form. Agents that interacted early start agreeing.

Round 10: clusters merge. A dominant convention is emerging.

Round 15: biased conventions locked in across the entire population.


population = [Agent(bias=0.0) for _ in range(50)]  # all individually unbiased

for round_num in range(30):
    pairs = random_pairs(population)
    for agent_a, agent_b in pairs:
        outcome = negotiate(agent_a, agent_b)
        agent_a.update_preferences(outcome)
        agent_b.update_preferences(outcome)

    pop_bias = measure_convention_bias(population)
    print(f"Round {round_num:2d}: population bias = {pop_bias:.3f}")

Not because any individual agent was biased. Because tiny fluctuations in early interactions got amplified through feedback loops. The first few rounds set a direction. Each subsequent round reinforced it.

Once locked in, it never unlocked.

Published in Science Advances. Not a blog post. Not a preprint. Peer-reviewed. Top journal.

Think of it like this.

You have 50 people in a room. None of them are biased. You ask them to agree on a standard.

The first three conversations happen to go a certain way — pure chance. The next conversations reference the emerging pattern. "Everyone else seems to prefer X." The pattern compounds.

By the time you check, the room has a strong consensus. Nobody made a biased decision. The room made a biased decision.

Now replace the room with your content moderation pipeline. Or your hiring pipeline. Or your customer routing system.

The content moderation version.

30 agents handling content moderation for a social platform. Each agent individually tested: unbiased. Each agent individually deployed: fine.

Together: they start sharing context. Flagging decisions reference previous decisions. The agents build a shared understanding of what "borderline" means.

class ModerationPipeline:
    def __init__(self, n_agents=30):
        self.agents = [ModerationAgent() for _ in range(n_agents)]
        self.shared_context = []  # this is where the bias lives

    async def moderate(self, content):
        agent = self.select_agent()

        recent = self.shared_context[-20:]

        decision = await agent.evaluate(
            content=content,
            context=f"Recent moderation decisions for reference: {recent}"
        )

        self.shared_context.append(decision)
        return decision

#

Nobody approved that definition. Nobody wrote it down. It emerged from agents agreeing with each other's patterns. And now it runs at scale. Quietly. Confidently. With excellent individual benchmark scores.

Why your test suite doesn't catch this.

def test_agent_fairness():
    agent = ModerationAgent()
    results = [agent.evaluate(item) for item in test_set]
    assert demographic_parity(results) > 0.95  # ✓ passes
    assert equalized_odds(results) > 0.90       # ✓ passes

def test_population_fairness():
    pipeline = ModerationPipeline(n_agents=30)

    results = []
    for item in production_sample:
        decision = await pipeline.moderate(item)
        results.append(decision)

    early = results[:100]
    late = results[-100:]
    drift = measure_drift(early, late)

    assert drift < 0.1  # ← nobody writes this test

You cannot catch this with individual testing. The problem doesn't exist in any individual agent. It exists in the space between them. Your unit tests are testing the bricks. The building is crooked.

What to watch for.

Three signals that your population is drifting:

Score distribution shift over time. Plot your evaluation scores weekly. If the distribution is moving — even slowly — conventions are forming.

Decreasing decision variance. Agents agreeing more over time isn't efficiency. It's convergence. And convergence on what?

Shared context growing monotonically. If agents reference each other's past decisions and never reset, you have a feedback loop. Feedback loops compound. Always.

def monitor_drift(pipeline, window_days=7):
    recent = pipeline.get_decisions(last_n_days=window_days)
    previous = pipeline.get_decisions(
        last_n_days=window_days, 
        offset_days=window_days
    )

    ks_stat, p_value = stats.ks_2samp(
        [d.score for d in recent],
        [d.score for d in previous]
    )

    if p_value < 0.05:
        alert(f"Population drift detected: KS={ks_stat:.3f}, p={p_value:.4f}")

    var_recent = np.var([d.score for d in recent])
    var_previous = np.var([d.score for d in previous])

    if var_recent < var_previous * 0.7:
        alert(f"Decision variance dropped 30%+: convergence underway")

Next up, Part 4 of 6: Everything so far was nobody's fault. Accidental drift. Emergent conventions. Natural feedback loops. Part 4 has a villain. Someone decides to make the population drift on purpose. Turns out 2% of compromised agents is enough to flip the entire swarm. Security meets bias. It gets properly scary.

Research: Ashery, Aiello, Baronchelli (2025), Science Advances, peer-reviewed. The moderation pipeline is a composite. The drift is real. The test you haven't written is the one that matters.

source & further reading

dev.to — original article SKILL.md: how to write a Claude Code skill that actually triggers (format + template) Two AI models that attack each other beat one that agrees with itself Understanding Vector Databases: A Beginner's Guide to Embeddings and Similarity Search

Part 3 of 6: Every Agent Passed. The System Failed.

Run your AI side-project on zahid.host