Can public chat data predict real-world AI misalignments?

wpnews.pro

cd /news/ai-safety/can-public-chat-data-predict-real-wo… · home › topics › ai-safety › article

[ARTICLE · art-30540] src=lesswrong.com ↗ pub=2026-06-17T03:53Z topic=ai-safety verified=true sentiment=· neutral

Can public chat data predict real-world AI misalignments?

OpenAI researchers tested whether public chat data from WildChat can predict real-world AI misalignments, finding that deployment simulations using public conversations can estimate rates of undesirable model behavior, offering external evaluators a way to assess frontier models without access to private production data.

read1 min views36 publishedJun 17, 2026

This is an unofficial automated linkpost.

Frontier AI models are increasingly used in settings with real economic, legal, and societal consequences. As a result, governments, AI safety organizations and independent researchers need ways to evaluate how these systems behave under realistic conditions.

Traditional evaluations use hand-written, synthetic, or adversarial prompts to stress-test known risks and compare models under controlled conditions. But these prompts can be narrow, unrepresentative, or recognizable as tests. An alternative, complementary way to evaluate how models behave in the real world is often to look at real conversations users have with them. LLM developers can do this internally, by sampling examples from production data to check whether models responded appropriately and how often different failures occur. Evidence grounded in real usage helps close the gap between benchmark results and deployment behavior [1], and is less vulnerable to models behaving differently simply because they are being tested [2,3,4]. But outside evaluators generally cannot access this evidence. Because real user conversations are private, labs usually cannot share them with AI safety organizations, academics, or independent researchers. As a result, the most informative evidence about frontier model behavior relies on data that is often available only to the labs that built them.

Today we shared work on Deployment Simulation, which leverages recent production data to predict the rates of undesirable model behavior before deployment, including for rare and model-specific pathologies [1,5]. In this blog, we ask whether external groups can use this technique to evaluate frontier language models by switching the source dataset for a publicly available substitute, WildChat [6].

Continue reading at alignment.openai.com →

source & further reading

lesswrong.com — original article Existential Risk from AI: RLVR that rewards red teaming the training environment Mathematicians may be worried, but AI-for-science is going to be great, recursively self-improving, and we’re going to learn loads

~/api · this article 200

$curl api.wpnews.pro/v1/news/can-public-chat-data-pre…

Read original on lesswrong.com → www.lesswrong.com/posts/TexabXFDJ8vzTBt2P/can-pu…