{"slug": "outrunning-your-headlights", "title": "Outrunning your headlights", "summary": "A peculiar side effect of model intelligence in discovery-based research is that it's possible to run every statistical analysis and burn millions of tokens without gaining intuition on how to solve a problem. Pre-AI, the effort of constructing analysis pipelines forced researchers to consider whether their methods made sense, but now agentic coders remove that friction, leading to analytical cosplay that robs scientists of the ability to form opinions and ask nuanced follow-up questions.", "body_md": "This is exactly the right place to probe. Gromov-Wasserstein is genuinely dimension free. Partial and semi-relaxed are precisely the mechanisms for the abstention/coverage problem we have. Want me to make a new branch and run run_entropic_gwot and invoke semi-relaxed GWOT between your models’ RDMs?\n\n*Wtf does that even mean? Eh, could be interesting to see the result.*** Enter**\n\nA peculiar side effect of model intelligence in discovery-based research is that it’s possible to run every statistical/quantitative analysis under the sun, burn millions of tokens and gain little intuition on how to make headway into your problem.[[1]](https://www.lesswrong.com/feed.xml#fng4m1g52081r)\n\nPre-AI, the effort associated with constructing a pipeline to run any meaningful quantitative analysis incurred a real cost. Indeed, it forced one to consider whether the analysis made any sense at all. People who had the necessary skills to execute were often thoughtful about their analysis by necessity. You couldn’t just *apply Mendelian Randomisation *if you didn't know of its existence*;* there was effort required to both articulate your question to search for the right tools and de-risk by considering whether a tool fit your question. [2] This forced one to earn some intuition for how it worked and what to expect. The pain that such friction inflicted is now alleviated by the CLI agentic coder of your choice. Some would argue that this frees humans to spar with an intellectual partner at a new layer of abstraction rather than concern themselves with the specifics of code. I would argue that a productive spar requires both parties to hold opinions they are willing to defend and on a day-to-day basis this is not what I see.\n\nInstead, the trajectory of decisions feels almost pre-ordained, primarily driven by three effects. The first is this veneer of problem-specific competence as models string shibboleths together. Shibboleths often serve as a heuristic for expert opinion e.g. septic patient with febrile tachycardia versus high temperature and heart rate. Deferring to expert opinion (or what masquerades like one) feels only natural. The second is an illusion of control as each *recommended* option is contrasted against straw-men alternatives. The third is sycophancy (one less obvious than the verbal submissiveness), an insidious bias towards writing tests that are likely to produce favourable empirical results or worse interpreting them in such way as to support the framing you seem to desire.[[3]](https://www.lesswrong.com/feed.xml#fn3ufp1eyk9qs)\n\n*Great news, 90/126 nominal p-values are significant. Would you like me to proceed with the write-up?* [[4]](https://www.lesswrong.com/feed.xml#fnbv9zxw1987k)\n\nHolding on to one’s critical eye is like cupping sand when the path of least resistance is seductive and one **Enter** away.\n\nPerhaps most concerning is that after several rounds of this analytical cosplay, one is almost entirely robbed of the ability to form *an* opinion. As though waking up from a dissociative fugue, [5] the question\n\nFurthermore, our odometer for progress has not recalibrated. That is to say, *running analyses *still provides the rewarding sensation of making headway with none of the earned intuition necessary to ask nuanced follow-up questions or indeed call out approaches that are suspect. This is especially dangerous given models inject assumptions in subtle ways when motivated to produce a positive result. One such concrete example I have noticed in my own bioinformatics work is that language models have a tendency to recall baked-in knowledge to facilitate discovery analysis e.g. classify cells in single cell RNA sequencing by manually printing a list of cell-specific gene markers to match against rather than using a canonical approach with a reference atlas. [Anthropic's BioMysteryBench](https://www.anthropic.com/research/Evaluating-Claude-For-Bioinformatics-With-BioMysteryBench) proudly touts this as model capability (which I guess in one sense is savant-like were it to occur in a human) but what I would also argue is* unwanted behaviour* in the context of autonomous research. Simultaneously it would be foolish to starve yourself of this intelligence substrate for the puritan ideal of *understanding everything*.\n\nIn sum, better solutions must be built to protect ‘belief quality’ but, in the meantime, I simply suggest we form a strong opinion with an anticipation of being proven wrong before letting the slop cannon rip. Introduce a little friction, slow down a little, and don’t **Outrun Your Headlights.**\n\n*Or build high beams.*\n\nI want to carefully define what I mean here by intuition in the context of discovery work. I refer to the working map of territory you carry in your head about a problem that is refined by stress-testing it against counterfactuals and alternate hypotheses. It allows you to curate targeted questions that meaningfully update your priors.\n\nYes one can argue that there was no shortage of statistical abuse and practitioners not respecting the assumptions of their instruments. However, I see this as a distinct problem from language models suggesting approaches which might be completely foreign to the user; approaches which said user necessarily have no intuition to judge any result by.\n\nThe average LessWrong participant is less likely to be a victim of these effects but I describe what I believe to be pervasive in the broader research community.\n\nYes I am aware that in some analyses such as differential expression where the number of concurrent tests is high, nominal p-values can still be a useful signal of directionality. The problem is that models err towards optimism and misuse of such 'loopholes'.\n\nI use this pejoratively; there is a formal psychiatric condition specified in the DSM-V describing sudden awakening followed by distress and retrograde amnesia for one's own identity.\n\nI will set aside the catastrophic alternative where one is happy to take a result at face value and isn’t conscious enough to even recognise that not understanding the analyses which led to the result is not okay.", "url": "https://wpnews.pro/news/outrunning-your-headlights", "canonical_source": "https://www.lesswrong.com/posts/L8YFcCw5ex3qjLyoJ/outrunning-your-headlights", "published_at": "2026-05-31 13:46:53+00:00", "updated_at": "2026-05-31 13:58:17.135558+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-research", "ai-tools", "machine-learning", "large-language-models"], "entities": ["Gromov-Wasserstein", "Mendelian Randomisation"], "alternates": {"html": "https://wpnews.pro/news/outrunning-your-headlights", "markdown": "https://wpnews.pro/news/outrunning-your-headlights.md", "text": "https://wpnews.pro/news/outrunning-your-headlights.txt", "jsonld": "https://wpnews.pro/news/outrunning-your-headlights.jsonld"}}