So we put an LLM next to a perfectly ordinary ML pipeline, and of course things got weird
There is a very specific kind of disappointment that only machine learning can produce. It is the moment when you have finally cleaned the dataset enough that it no longer looks like it was assembled by raccoons fighting in a spreadsheet, you trained a model that is not embarrassing, the metrics are decent enough to stop people from asking whether this was all a waste of money, and then you realise that the thing is still not particularly good. Not bad. Just aggressively, stubbornly average. The kind of average that makes you stare at feature importances as if they might, out of pity, confess what they are missing.
We had one of those pipelines not too long ago. Nothing exotic, no shiny research project, just a normal piece of applied machine learning. Some structured fields, some text, a few values that were technically optional but in practice missing exactly when they would have been useful, and a model that did more or less what we asked of it, provided we did not ask for too much. In other words, a very normal ML system, which means it was held together by statistics, glue code, and the collective hope that nobody would change the input format again.
Then, as these things tend to happen these days, somebody suggested putting an LLM in front of it.
Not instead of it, mind you. Just in front of it. As a helper. As a civilized little preprocessing assistant that would take the messy bits, especially the text, and turn them into something our existing model could digest without immediately developing opinions about random whitespace and half-finished sentences. And on paper, that sounded almost suspiciously reasonable, which is always the moment when I become nervous, because in software anything that sounds too reasonable is usually hiding three months of nonsense behind a friendly diagram.
The first experiments were, annoyingly, quite good. We had text fields that people had been trying to tame with rules, regexes, and a level of optimism that should probably be regulated, and the LLM just looked at them and extracted structure in a way that felt almost rude. Intent categories suddenly made sense, descriptions that had previously required a small archaeological expedition could be normalised into something consistent enough to use, and all the ugly language that humans produce when they are in a hurry, annoyed, or both, stopped looking like noise and started looking like usable signal. That is the part people like to show in slide decks, because it is real and it is impressive: you can take a system that was struggling with messy inputs and, with relatively little effort, give it better features than the team would have produced by hand in the same time. Traditional ML absolutely loves that, because traditional ML does not care where your features came from as long as they are useful and do not fall apart the moment somebody writes “cant log in” instead of “cannot login”.
But of course that is not where the story ends, because if it ended there we would all be living in LinkedIn posts by now, and I refuse to believe reality is that cruel.
The first thing the LLM changed was not the model quality. It changed our speed. That turned out to matter much more. Before, every new idea about feature extraction meant writing code, fitting it into a pipeline, waiting for a run, finding out that the idea was mediocre, and then pretending that this had at least been a valuable learning experience. With an LLM in the loop, a lot of that turned into experimentation at conversational speed. We could ask for candidate labels, summarized fields, normalized categories, rough confidence buckets, little bits of structure that would previously have been too annoying to build “just to try it”. That shifts the economics of iteration completely. You stop protecting every hypothesis as if it were a family heirloom and start trying ten things because trying them is suddenly cheap. It is a very real advantage, and it is probably the biggest one. Not because the LLM is magical, but because many ML teams spend an absurd amount of time wrestling raw material into a shape that a model can tolerate. If you reduce that friction, the whole system gets faster. It is worth noticing where that friction actually lived, though. None of those gains came from putting the LLM inside the model. They came from putting it upstream of the model, in the slow, messy, design-time part of the work, the part that happens long before any production system has woken up and started caring.
Unfortunately, faster is not the same thing as better, and this is where the AI marketing brochures usually develop a mysterious cough and change the subject.
Because the other thing the LLM gave us was a whole new class of failure modes. Before, if a preprocessing step was wrong, it was often gloriously wrong. A parser crashed. A rule misfired. A field was empty when it should not have been, and the system complained in a way even tired humans could understand. The LLM did not do that. It was far more polite. It would produce something plausible, often elegant, occasionally even insightful, and sometimes completely misleading in a way that was difficult to notice because nothing exploded. It just quietly shifted meaning. A slightly wrong classification here, a neat-sounding summary there, a category that looked stable until you saw enough edge cases, and suddenly your downstream model was learning from a world that was coherent enough to pass casual inspection and wrong enough to become expensive later. That is one of the more dangerous properties of these systems: they fail in complete sentences. Which is fine, sometimes even charming, when you are sitting in front of a notebook nodding along. It is a noticeably less appealing property once the same thing is wedged onto the hot path between your customers and your database, quietly redefining what a category means while everybody else is in a meeting.
At some point, because engineers are incapable of leaving a half-dangerous idea alone, we also tried using the LLM to generate additional training data. There are few phrases in modern tech that should trigger more suspicion than “synthetic data at scale”, because it is usually followed by either miracles or tragedy, and most teams are not lucky enough to get the miracles. The use case itself was sensible enough. We had rare cases that did matter, and like everybody else on earth we did not have enough examples of them. So we asked the model to generate variants, enrich classes, create examples that looked like the kind of thing we wished users had already produced for us. And again, in a narrow sense, it worked. The dataset got richer. The model got more exposure to patterns it had previously barely seen. Some results improved. The dashboards looked happier. Everybody became briefly insufferable.
Then came the awkward part, namely noticing that generated data has the same problem generated confidence always has: it can be very convincing while being ever so slightly detached from reality. The LLM was not lying exactly. Lying implies intention, and I am not prepared to grant a statistical parrot that much agency. But it was inventing examples from pattern knowledge rather than lived data, which means it was very good at producing the kind of thing that should exist and somewhat less reliable at producing the kind of thing that actually does. If the real world is messy in one direction and your synthetic data is neat in another, the model will happily learn the neat version and then meet reality later with the expression of a Victorian child being shown a nightclub. Again, useful tool, very real value, but only if you are disciplined enough to treat generated data as suspect until proven otherwise. And, just as importantly, only if you treat the generation step as something that happens once, deliberately, in a corner of the build pipeline where somebody is paying attention, rather than something that quietly regenerates itself in production because that seemed clever at the time.
After a while the actual lesson became fairly obvious, and also much less glamorous than the hype people would have you believe. LLMs do not replace traditional ML particularly well in these kinds of systems, but they are extremely good at making traditional ML more effective around the edges. Wherever you have unstructured junk, ambiguous text, manual labeling work, feature ideas that are expensive to prototype, or brittle transformations maintained by somebody who has not felt joy since 2022, an LLM can help a lot. It can translate the messy world into a shape your boring, dependable model can use. And boring, dependable models are still wonderful things. They are cheaper, easier to reason about, easier to monitor, and much less likely to wake up one day and reinterpret half your taxonomy because the prompt changed by two words. There is a lot to be said for machinery that behaves like machinery, and rather a lot more to be said for keeping the unpredictable parts of your stack as far away from the runtime as you can politely arrange.
So yes, LLMs can absolutely turbocharge the training of traditional ML systems, but not because they sprinkle intelligence dust over your pipeline and turn your random forest into a philosopher king. They do it by making the annoying parts less annoying, the exploratory parts faster, and the unstructured parts more tractable. That is already a big deal. It just is not the kind of big deal people like to print on banners. The banner version is “AI transforms machine learning”. The useful version is “you can now get better input into your existing models faster, provided you also accept a fresh basket of subtle errors, evaluation headaches, and operational weirdness”. Not as catchy, I know. Marketing departments everywhere are devastated.
And that, really, is the whole thing. If your pipeline is already decent, an LLM can act like a very capable, slightly unreliable intern who reads a lot, works quickly, and absolutely needs supervision before anything goes into production. That can be immensely valuable. If your pipeline is bad, though, the LLM will not save you. It will simply help you industrialize the badness with tremendous efficiency, which, to be fair, is also a kind of achievement.
So the interesting question is not whether LLMs will replace traditional ML. In most practical systems, that is the wrong question asked by people who enjoy buying new hammers and then declaring every object in the room suspiciously nail-shaped. The better question is where your current pipeline still depends on tedious human translation, brittle rules, and a heroic willingness to clean up chaos manually. That is where an LLM earns its keep. Not in the center, where you want boring reliability, but at the boundaries, where the world arrives in all its glorious, inconsistent stupidity and somebody has to turn it into features.
And once you have stared at that arrangement long enough it stops looking like a workaround and starts looking like a design. Most of the value we got out of the LLM came from doing its messy, probabilistic, slightly theatrical work before anything important happened. By the time real traffic was hitting the model in production, the LLM was no longer in the room. What it left behind was structure, the kind of dull, predictable schema that does exactly the same boring thing every time you call it. The LLM had not become part of the runtime. It had quietly become part of the build step. Without anybody putting it on a slide, we had started treating the language model as a compiler, which is, mildly embarrassingly, the most useful thing it has ever been asked to do. A compiler is allowed to be slow, expensive, and occasionally weird, because the artifact it produces is not. Nobody worries about their compiled binary having a creative afternoon. The cleverness happened earlier. The runtime is boring on purpose, which is the one property production engineers actually want, even when they cannot bring themselves to admit it out loud in front of the AI vendors.
And once you look at traditional ML through that same lens, it becomes a little hard to ignore that a fitted model is exactly the same shape of object. You spent a great deal of electricity, money, and unreasonable optimism turning data into a frozen artifact, and from that moment on, inference is just a function call. Same input, same output, no tokens, no improvisation, no quiet stylistic drift between Tuesday and Thursday. The intelligence was committed to the artifact, and now the artifact just runs. In a strict sense it is an even purer version of the idea than anything an LLM produces, because there is no sampling involved at all. The model is not deciding anything in the moment. It already decided, weeks ago, on a cluster somebody is still being billed for. Two different kinds of expensive thinking, two different kinds of frozen artifact, and at runtime the same desirable property of refusing to surprise anyone.
Which suggests a rather different way of putting these systems together. Not a pipeline with an LLM stapled to one end and a model stapled to the other, held vaguely upright by a script that nobody is willing to refactor, but a small, governed set of deterministic steps, some of them produced by a language model thinking very hard about intent, others produced by a training run thinking very hard about data, sitting next to each other at runtime looking equally innocent, equally observable, and equally unwilling to invent new behaviour just because it is Tuesday. Keep the cleverness at build time. Keep the runtime aggressively unsurprising. Treat the two useful halves of modern AI, the one that turns language into structure and the one that turns data into prediction, as the same kind of citizen in the same boring, governed pipeline. The system we ended up with was a small, slightly accidental version of that idea.
The much more interesting version is the one where it is not accidental at all.