I Analyzed 1,000 AI-Generated Blog Posts for Quality. Here's the Data.

wpnews.pro

Last year, I was doing something that felt increasingly absurd: manually reading AI-generated content to decide if it was "good enough."

PostAll — the content automation tool I've been building — was producing hundreds of blog posts per week for clients. And I had no systematic way to evaluate quality at scale. I was spot-checking. Vibes-checking, really. That doesn't work at volume.

So I built a programmatic quality analysis pipeline, ran it over 1,000 AI-generated posts, and let the numbers tell me what my gut was missing.

The findings surprised me. A few of them genuinely changed how I think about AI content quality.

First, a definition of terms, because "quality" is almost meaninglessly vague in this space.

I broke quality into five measurable dimensions:

I used 1,000 posts across three categories: SaaS product descriptions, long-form "how-to" articles (1,200–2,000 words), and listicles (500–900 words). All were generated by PostAll using GPT-4o, with various prompting strategies.

The analysis pipeline isn't complicated, but the piece that makes it useful is the batch processing layer:

import anthropic
import language_tool_python
import textstat
from dataclasses import dataclass
from typing import Optional
import json

@dataclass
class QualityReport:
    post_id: str
    flesch_reading_ease: float
    flesch_kincaid_grade: float
    grammar_errors_per_1000_words: float
    keyword_density: float
    structural_score: int  # 0–5 based on element presence
    flagged_claims: list[str]
    overall_score: float

tool = language_tool_python.LanguageTool('en-US')

def analyze_readability(text: str) -> dict:
    return {
        "flesch_reading_ease": textstat.flesch_reading_ease(text),
        "flesch_kincaid_grade": textstat.flesch_kincaid_grade(text),
    }

def analyze_grammar(text: str) -> float:
    matches = tool.check(text)
    word_count = len(text.split())
    errors = [m for m in matches if m.ruleId not in STYLE_RULE_IDS]
    return (len(errors) / word_count) * 1000

def analyze_structure(text: str, expected_elements: list[str]) -> int:
    score = 0
    lowered = text.lower()
    for element in expected_elements:
        if element in lowered:
            score += 1
    return score

The STYLE_RULE_IDS

exclusion is important. LanguageTool flags passive voice, comma splices, and other style choices that are sometimes intentional. Without filtering, style suggestions inflate the error count and make decent posts look broken.

For factual accuracy, I used a two-step approach: first, extract named entities and statistics using Claude, then cross-reference against a curated knowledge base for the client's domain. This doesn't catch everything — but it catches the most common failure modes (wrong dates, made-up statistics, hallucinated product names).

def extract_verifiable_claims(post_text: str) -> list[str]:
    client = anthropic.Anthropic()

    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Extract all factual claims from this post that could be verified:
            - Statistics with numbers
            - Named products, companies, or people with specific attributes
            - Dates and years
            - Prices or measurements

            Return as a JSON array of strings. Only include claims, not opinions.

            Post:
            {post_text}"""
        }]
    )

    try:
        claims = json.loads(response.content[0].text)
        return claims
    except json.JSONDecodeError:
        return []

Here's what came back after running 1,000 posts through the pipeline.

Post Type	Avg. Flesch Reading Ease	Avg. FK Grade Level
Product descriptions	52.3	11.2
How-to articles	44.7	13.1
Listicles	61.8	9.4

For context: a score of 60–70 on the Flesch scale is considered "standard" — roughly plain English. A grade level of 8 is the target for most consumer-facing content.

The problem: the how-to articles were reading at a college level on average. When I looked at the worst offenders (grade 16+), the pattern was consistent — they were posts where I'd given the model a technically dense brief without explicitly requesting plain language output.

The fix was embarrassingly simple:

READABILITY_INSTRUCTION = """
Write at a 7th to 8th grade reading level (Flesch-Kincaid).
Use short sentences. Prefer common words over technical jargon
unless the term is essential and explained on first use.
Aim for a Flesch Reading Ease score above 60.
"""

Adding this to the system prompt dropped the average grade level from 13.1 to 9.4 across the how-to category. One instruction. Measurable improvement.

Average across all 1,000 posts: 2.1 errors per 1,000 words.

For human-written content, industry benchmarks sit around 3–5 errors per 1,000 words for first drafts. So the AI was actually outperforming average human first drafts on raw grammar.

But here's what the aggregate hides: the distribution was bimodal. 80% of posts had fewer than 1 error per 1,000 words. The remaining 20% had 8+ errors — almost always in posts that exceeded 1,500 words and involved a complex prompt with multiple constraints.

The hypothesis: when the model is managing a lot of constraints simultaneously (tone + structure + keyword requirements + specific claims to include), grammar degrades. It's spending its "attention" elsewhere.

My mitigation was to break complex posts into two passes: structure and content first, then a second pass focused specifically on prose quality and grammar. Error rate in the high-complexity category dropped from 9.3 to 2.8.

Target range for SEO: 1–2% density for the primary keyword.

Density Range	% of Posts
Under 0.5% (keyword stuffing avoided, but too thin)	23%
0.5–1% (acceptable)	31%
1–2% (target range)	29%
2–3% (slightly over-optimized)	12%
Over 3% (keyword stuffed, will hurt rankings)	5%

The surprise: I expected over-stuffing to be the main failure mode. Turns out under-stuffing was almost 5x more common. The model was avoiding the keyword in an apparent attempt to sound natural — which is good instinct but overcorrected.

This one is prompt-sensitive. Explicitly specifying "include the primary keyword [X] approximately 8–12 times in a 1,000-word post" got the distribution into the target range for 74% of subsequent posts.

I scored each post on five structural elements:

Score	% of Posts
5/5	38%
4/5	41%
3/5	15%
2/5 or below	6%

The most commonly missing element: the hook. 34% of posts opened with a definition, a generic statement about the topic, or the phrase "In today's digital landscape." (I have seen "In today's digital landscape" more times than I have seen sunrises. The model defaults to it under pressure.)

The second most commonly missing element: the concrete example. The model would describe a concept clearly but not ground it in a specific scenario. This is fixable with one line in the prompt: "Include at least one specific real-world example or case study."

This is the one that should concern you.

Of the 1,000 posts, the pipeline flagged 147 posts (14.7%) as containing at least one unverifiable or contradicted claim.

Common failure types:

The hallucinated citations were the most dangerous. They looked authoritative. They had the right format. The study just didn't exist.

Three concrete changes came out of this analysis:

1. Readability is now a post-generation check, not just a prompt instruction.

Every post runs through textstat

before it's delivered. If the Flesch-Kincaid grade is above 10, the post goes back for a rewrite pass with an explicit readability constraint.

2. Factual claims trigger a verification flag.

Any post with more than three verifiable claims (per the Claude extraction step above) gets flagged for human review before delivery. This catches the hallucinated-citation failure mode without requiring human review of everything.

3. Two-pass generation for anything over 1,200 words.

Pass one: structure and content. Pass two: prose quality, grammar, and consistency check. The compute cost is real — roughly 2x token usage for long-form. The error rate improvement justified it.

I'd instrument this from day one, not month six.

The manual spot-checking I was doing before the pipeline wasn't just slow — it was inconsistent. I'd catch a grammar issue in one post and miss a hallucinated citation in the next. The programmatic check is less nuanced than human review, but it's consistent, which turns out to matter more at volume.

I'd also add a semantic coherence score earlier. Right now I can tell you a post has a grade-8 reading level and 1.3% keyword density. I can't tell you programmatically whether the argument in the post actually holds together. That's the next frontier.

Grammar is not the quality problem.

Every client who asks me "is the AI content good?" is thinking about grammar errors. The data says grammar errors are rare — and when they appear, they're predictable and fixable.

The real quality problems are factual accuracy and structural consistency. A post with perfect grammar that cites a study that doesn't exist is not a quality post. A post that opens with "In today's digital landscape" and never grounds the argument in a specific example is not a quality post.

You can't catch either of those by reading a paragraph. You need to measure them.

The pipeline code is on GitHub — link in my bio. If you're building content automation and haven't instrumented quality yet, this is the starting point I'd recommend.

What's your approach to quality scoring AI content at scale? Especially curious if anyone's cracked semantic coherence programmatically — that's the gap I haven't solved.

source & further reading

dev.to — original article 🤗 Find the Pokemon you are w. PokéAPI, your resume & embeddings Part 2: Prompt Engineering for Growth: Creating Viral Wizarding Content Cut your LLM bill by 30 to 70%: the levers that work

I Analyzed 1,000 AI-Generated Blog Posts for Quality. Here's the Data.

Run your AI side-project on zahid.host