This article was originally published on LucidShark Blog.
Something shifted on GitHub this year. Open any trending page, search for almost any library name, and you will find dozens of repositories that share a familiar fingerprint: a README generated in seconds, a handful of Python or TypeScript files with functions stretching hundreds of lines, zero test files, and a commit history that reads "initial commit" followed by "add features" followed by nothing.
The Hacker News thread from June 24 put a number to the feeling many developers already had. Commenters described browsing GitHub and finding repository after repository that looked functional at a glance but fell apart on closer inspection. Not because the code was obviously wrong, but because it had never been measured against any quality standard before being pushed.
The core problem: AI tools make it trivially easy to generate code that compiles and appears to work. Nothing in the default vibe-coding workflow measures whether that code is maintainable, tested, or structurally sound before it hits a public repository.
Walk through a typical vibe-coded repository and you find the same patterns repeating. Functions that do twelve things at once, with cyclomatic complexity scores above 50 where anything above 10 is considered a maintenance liability. Utility logic copy-pasted verbatim across four files because the AI regenerated the same helper each time it needed it. A tests/
directory that either does not exist or contains three smoke tests that assert the application starts without crashing.
None of this is invisible. These metrics are measurable. A cyclomatic complexity of 52 on a routing function is not a matter of taste; it is a number that predicts how many bugs will emerge when someone tries to modify it six months later. A duplication ratio of 40 percent across a codebase is not a style preference; it is a guarantee that fixing a bug in one place will leave the same bug alive in three others.
The problem is not that AI writes uniquely bad code. It is that AI writes code at a volume and speed that overwhelms any informal quality signal humans previously relied on. A developer working alone might naturally notice they had copied the same function three times. An AI agent spinning through ten files in thirty seconds does not have that check, and the developer watching the output rarely s to run a complexity analyzer.
Stars, forks, and issue counts are the metrics most developers use to evaluate an unfamiliar repository. The assumption is that a project with 800 stars has been vetted by 800 people who thought it was worth bookmarking. That assumption made reasonable sense in an era when creating a repository required meaningful human effort.
It does not hold anymore for two reasons.
First, the repositories being created now are being created faster than any community can evaluate them. A project pushed on Monday can accumulate stars by Wednesday from people who read the README and saw working demo output without ever running the test suite or reading a single function body.
Second, stars measure interestingness, not correctness. A repository that generates impressive-looking output from a three-line prompt will get stars. Whether the underlying code has a cyclomatic complexity of 8 or 80 is invisible to anyone who does not run an analyzer. The same applies to test coverage: a project at 0 percent coverage looks identical to one at 80 percent from the outside.
Lagging vs. leading indicators: Stars and forks are lagging signals. They reflect past interest, not current code health. Cyclomatic complexity, coverage floors, and duplication bounds are leading signals. They predict future maintenance cost and defect rate before a single bug is filed.
Issues are slightly better because they surface after someone has actually tried to use the code. But filing an issue requires effort, and most people who encounter a confusing function simply close the tab rather than report it. The gap between actual code quality and visible issue count can be enormous.
Three metrics do most of the work when you want a fast, objective read on whether AI-generated code is shippable.
Cyclomatic complexity counts the number of independent paths through a function. A function with a complexity of 1 has no branches. Every if
, else
, for
, while
, case
, and catch
adds 1. The widely cited threshold for "easy to understand and test" is 10 or below. Between 10 and 20 is moderate risk. Above 20 is high risk. Above 50 is a function that will not be safely modified by anyone, human or AI, without a significant chance of regression.
AI coding agents frequently produce functions above 20 when asked to implement anything with real branching logic, because they optimize for correctness at a single point in time rather than for long-term maintainability. A function that handles 15 edge cases in one block is correct today and unmaintainable tomorrow.
Test coverage floors enforce a minimum percentage of code exercised by automated tests. A floor of 80 percent on line coverage does not guarantee the tests are good, but it does guarantee that 20 percent or less of the code has never been executed in a controlled environment. For AI-generated code specifically, coverage gaps tend to concentrate on error handling paths: the catch
blocks, the null checks, the branch that only fires when a third-party service returns an unexpected status code. These are precisely the paths that cause production incidents.
Duplication bounds cap the percentage of code that is copied from elsewhere in the same repository. A duplication ratio above 15 to 20 percent is a reliable signal that the codebase has not been refactored and that the same logic will need to be maintained in multiple places. For AI-generated code, duplication tends to be high because the agent regenerates common patterns rather than importing a shared utility, especially across long sessions where earlier decisions are no longer in the context window.
The right place to enforce these metrics is before the code ever leaves your machine, not in a CI pipeline that runs after the pull request is open. A pre-commit gate that runs in under ten seconds costs nothing and catches the class of problems that reviewers miss under cognitive load.
LucidShark integrates with Claude Code via MCP and runs complexity analysis, coverage checking, and duplication detection as a local hook. The configuration is a single JSON file in your project root.
{
"quality_gates": {
"complexity": {
"max_cyclomatic": 15,
"max_cognitive": 20,
"fail_on_violation": true
},
"coverage": {
"minimum_line_coverage": 80,
"minimum_branch_coverage": 70,
"fail_below_threshold": true
},
"duplication": {
"max_duplication_ratio": 0.15,
"min_token_length": 50,
"fail_on_violation": true
}
},
"hooks": {
"pre_commit": true,
"on_file_write": true
}
}
With on_file_write
enabled, LucidShark runs the check the moment Claude Code writes a file, before the developer even sees the output. If a function comes back with complexity 47, the gate fires immediately with the exact function name and line number rather than surfacing the issue three days later in a PR review.
The fail_on_violation: true
flag is the critical setting. Without it, the gate reports problems but does not block the commit. That might seem like a gentler approach, but in practice a non-blocking gate accumulates warnings that developers learn to ignore within a week. The entire value of a quality gate comes from it being a hard stop, not an advisory.
Threshold calibration: If your codebase already has files above these thresholds, start with a higher limit and tighten over time. A gate set at complexity 30 that always passes is more useful than one set at 15 that the team disables on day two. The goal is a ratchet, not a cliff.
For teams using Claude Code with the MCP server, LucidShark exposes a lucidshark_analyze
tool that Claude can call directly during a session. This means the agent itself can check its own output before writing the file, catching problems in the generation loop rather than at commit time.
The GitHub AI code dump phenomenon is not going to be solved by better AI models alone. A more capable model will still produce high-complexity code when asked to implement something complex without constraints. It will still omit tests when the user does not ask for them. It will still duplicate logic across a session when the earlier abstractions are no longer visible in the context window.
The constraint has to come from outside the model. Quality gates are that constraint. They convert a soft expectation ("write good code") into a hard requirement ("code above complexity 15 does not commit"). That is a different category of enforcement, and it is the only one that does not rely on the developer remembering to check something manually every time.
The repositories filling GitHub right now are not filled with bad code because the people who created them are bad developers. They are filled with unreviewed code because nothing in the workflow reviewed it. That is a tooling gap, and it has a straightforward fix.
Start gating your AI output today. LucidShark is open source, runs entirely on your machine, and integrates with Claude Code via MCP in under five minutes. Add complexity thresholds, coverage floors, and duplication bounds to your project config and every file your AI agent writes will be measured before it commits. Install LucidShark on GitHub or read the full MCP setup guide to get your first gate running today.