Third-parties should focus on scrutinising system cards

wpnews.pro

By default, I expect system cards will get worse, which would be bad. Some mechanisms could improve system cards, but I expect they will be outweighed. In any case, I think third-parties should focus on scrutinising system cards — this seems like a great activity for outsiders in the current strategic landscape. I'll sketch what that could look like, and offer some recommendations.

It would be bad if system cards degraded. #

It's good for the outside community to have an accurate sense of the risks, so they can respond appropriately. For example: investing more resources into cyber-hardening, or other activities for making things go well. - If labs felt pressure to evaluate the risks accurately, they'd be better incentivised to reduce them.
If the risks were high enough, and a lab communicated that, then this might prompt drastic government action.
It's very plausible that, if labs build misaligned AIs that take over, then most of the employees had a genuine but incorrect belief that the AIs wouldn't take over, based on evidence that was actually flimsy and misleading. So it's important that third-parties provide epistemic checks on the labs, and scrutinising system cards seems like a great mechanism for that.

By default, I expect system cards to get worse, because… #

The system will get genuinely more complicated.
There will be more kinds of model, trained with more kinds of technique, interacting in more kinds of way — plus dozens of ad-hoc patches for the problems that arise.
Already no one person holds the whole system in their head, and the fraction that fits in one head will keep shrinking.
Any overall safety judgment will come from loosely aggregating many lines of evidence, rather than from a clean structured safety case.
I expect the cards themselves to get sloppier.
They'll be rushed, written in less wall-clock time, because the pace of AI progress will accelerate.
They'll be increasingly AI-generated: AIs will automate more of the experiment design, coding, analysis, and writing. That raises worries about slop, sabotage, and the like. See Current AIs seem pretty misaligned to me (Ryan Greenblatt, 15th April 2026).- When AIs are obviously bad at writing system cards, then authors won't rely on them. When AIs are good at writing system cards, then it's fine to rely on them. But there might be a U-shaped effect — system card quality might degrade when AIs are just barely passable.
Capability and propensity evals will run on earlier checkpoints rather than the deployed model, partly as an adaptation to (a).
Scrutiny will get harder.
Third-parties will have less time to scrutinise the cards, because the duration will shrink between when the cards are published and when a catastrophe might occur.
Authors will increasingly have to refer to private information about the system, because (i) labs will hold more algorithmic secrets (due to R&D acceleration) and (ii) labs will guard those secrets harder, due to competitiveness and security concerns.
There will be stronger incentives to mislead — this is my main worry.- The objective risk will rise, so labs will need to be increasingly misleading if they want to claim that risk is low.
Governments (and other actors) will grow more worried, and will be ready to act drastically if they judge the risk is too high. These actions would probably involve slowing down or otherwise hindering the lab.
As the race nears its end stages, the stakes of slowing down or losing will become more apparent to the labs.
Given (a–c), labs have a stronger incentive to keep people in the dark.
They'll keep their own government in the dark, to avoid regulation or drastic domestic action.
They'll keep foreign governments in the dark, to avoid military conflict.
They'll keep the public in the dark, to avoid mass civil unrest.
They'll keep their own employees in the dark, to avoid internal pushback.

Some mechanisms could improve system cards. #

AI automation could allow more thorough experimentation and analysis, possibly of higher quality. For example, AIs can run more replications and ablations of the experiments. They can analyse more of the transcripts, etc.
Labs will try harder — possibly much harder — to produce good cards, because the cards matter more.
External actors — governments, third-parties, the public, rival labs — will have more incentive to scrutinise them.

Third-parties should focus on scrutinising system cards. #

This would pressure labs to maintain their quality, i.e. spend more resources on them, grow the team contributing to the system cards, and push less for optimistic conclusions.
If third-parties could show that system cards were degrading, then other actors would rely on them less. It's good for people to be calibrated about card quality, even if this doesn’t raise the quality itself.
If third-parties showed that the authors had significantly underestimated the risks, then that might trigger drastic government action.

I'll sketch what this might look like. #

I’ll list some recommendations for how third-parties could ensure system-card quality. But these recommendations are pretty contingent: as the strategic landscape shifts, I'd expect the recommendations to change.

Top-tier

Maintain a list of improvements. Someone should keep a Google doc with a prioritised list of concrete interventions for improving system cards, share it with anyone at the labs, and update it as complaints come in.Write critical reviews. When a lab publishes a system card, work through it and:- Check whether any argument is locally invalid — the conclusion doesn't follow from the premises;

Check whether any premises are unrealistic;
Check whether any premises or conclusions clash with our best science, and (if possible) commission that science where it doesn't yet exist;
Check for alternative explanations of the observations — e.g. training-gaming, eval awareness, collusion, unfaithful CoT, and so on;
Hunt for egregious bugs and errors.

Pass sections of the card to the relevant third-parties. For example, if the card uses a particular eval, ask that eval's authors what they make of the results, and fold their views into your review.Talk with lab employees. Explain why card quality matters, what concerns you about the current cards, and what lab employees can do to help.Publicise your takes. Twitter threads seem pretty cheap, once you've formed an opinion. You could also reach out to legacy media and offer to give interviews.

Second-tier

Talk to governments. Explain why system cards matter, what concerns you about the current cards, and what governments can do to help.Pressure labs to share key artefacts. Push them to give third-party risk assessors the most important artefacts — transcripts, code — where those assessors say it matters. When a lab leans on private information (“our private evals suggest X”), press them to disclose it to a trusted third-party, or treat the claim with scepticism.Praise labs with good system cards. And praise the good bits in a bad system card. Invite authors of good system cards on your podcast to talk about them. Build consensus that the cards are getting sloppy (if they are). I think "system cards are shoddy but labs deny this" looks much worse than "system cards are shoddy but labs admit this, say it's because they are rushed and lack sufficient understanding of the system, and want a coordinated slow-down so they can tread more carefully".**Pressure labs to uplift third-parties. **If system cards grow increasingly complicated, and third-parties have a shorter duration to properly scrutinise them, then it's important that third-parties are accelerated with AI assistance. Therefore, labs should provide third-parties with the best internal models and scaffolds. Labs should also treat system card scrutiny as a target capability. Track whether card quality still matters. I can imagine that system cards stop being important for making AI go well, and you don’t want to keep the strategy going due to inertia.Create an org for all this. The org could do many of these activities in-house, and coordinate the community’s efforts. It could also be a more credible entity for governments to approach when they need help reading a system card.

Shoddy system cards are better than no system cards. #

Labs shouldn't face more hostility for publishing a shoddy system card than publishing no system card whatsoever. This situation would (i) differentially help the transparent labs, and (ii) incentivise labs away from transparency. Both effects would push labs away from transparency, which would make things far far more dangerous. This could be a terrible outcome of pushing third-parties to scrutinise system cards.

To inoculate against this, I think scrutiny of system cards should be paired with hostility towards less transparent labs. To illustrate:

"In [company]'s system card, they reveal they accidentally did X during training. They claim this is okay because [bla]. We disagree because [bla]. Note that [rival company] hasn't revealed whether or not they did X, so their deployment is possibly much more dangerous. We call on [rival company] to reveal whether they did X."

Discuss

source & further reading

lesswrong.com — original article What comes with cheap math? The arithmetic hierarchy of real functions Anthropomorphic Misalignment research needs stronger evidence