Searching for a Black Cat in a 2000-Dimensional Dark Room: A Machine Learning Algorithm Tournament

wpnews.pro

Epigraph:

“The experiment is over. The results look as though I’ve slightly broken the laws of physics governing conventional tabular ML.”

“Perhaps this is a competition that was never meant to be?”

“Consider this an invitation to replicate.”

Hide a needle in a haystack? Oh, yes!

Welcome to my little testing ground.

In this article, I’ll tell you how I threw twenty-one machine learning algorithms into the ring to duke it out — from good old linear regression, k-NN, and Random Forest to the holy trinity of tabular kings (XGBoost, LightGBM, CatBoost), a handful of multi-layer perceptrons, and attention-based neural networks. And I forced them all to solve a problem that seems completely absurd at first glance (or is it only at first glance?).

Most machine learning benchmarks, like MNIST or Titanic, have been run into the ground ages ago. Convolutions win on images, gradient boosting wins on tabular data. Predictable. Boring. So, I decided to set up a special stress test — a competition of a slightly different format, pushing tabular data algorithms to their absolute breaking point.

And yes… there will also be a newcomer in this race, one most of you have probably never heard of. It’s not hyped up, and it doesn’t have an army of fanboys on Kaggle. But it does have a slick name: the Polyharmonic Cascade. It’s a deep architecture derived from the principles of the theory of random functions and indifference. In this test, it played the role of the ultimate underdog. But what it did to the heavyweights looks like straight-up cheating. More on that later.

So, what is this task, exactly?

Let me start by outlining the first idea.

Usually, when people talk about the basics of machine learning, the very first example they draw looks something like this:

But this is way too simple.

Then they usually draw something slightly more complex:

Still too simple. Here, the classes can be easily separated by a basic non-linear surface. Something like this, for instance:

Resembling a hyperbolic paraboloid.

If we frame the problem right from the start — setting the goal to obtain this surface, this function. We are now dealing with a regression task. The more non-linear the function, and the more local minima and maxima it has, the harder it is to train. Makes sense.

How about a function defined by an image like this? Beautiful, isn’t it?

We take this image and turn it into a regression task. Two input features — the x and y coordinates. The output is the normalized brightness, from 0 to 1. No classification, no cats and dogs. Just give it a point on the plane and tell me how bright it is there.

If we turn this into a 3D surface, here’s what we get: That’s already a bit more complex than a hyperbolic paraboloid, right? Let the ML methods try to learn this!

But I bet you’re going to say that it’s still just basic interpolation and way too easy. Hold your horses — this was only the introduction.

A 512x512 pixel resolution gives us a dataset of 262,144 examples, which we can split into 240,000 for the training set and 22,144 for the test set. We’ll shuffle them randomly beforehand, of course. Not a bad setup for testing.

But I’ll say it again: you, just like me, probably looked at this and thought that the task still seems way too simple for modern methods! It sounds… well, like ordinary interpolation, even if it is of a highly complex two-dimensional function.

What’s the catch?

But I wouldn’t be me if I didn’t turn this task into a little nightmare. I took it upon myself to artificially inflate the dimensionality of the input space. Instead of just the two features, x and y, we’re going to feed the model 500, 1000, and eventually 2000 features. Only two of them will be the real coordinates (though the algorithms won’t know which ones). The remaining 498, 998, and 1998 features will be noise.

But this isn’t just any noise. Each noisy feature is simply a randomly shuffled version of those same x and y values. On their own, they are indistinguishable from the useful signals: same distribution, same amplitude, same mean. It is impossible to tell the real coordinates apart from the fake ones just by looking at the input data.

This is the classic feature selection problem, but taken to the absolute extreme. The algorithm has to not only learn a highly complex function but also figure out exactly which two variables out of thousands it actually needs to do so.

But this is a classic real-world scenario. Financial data, genomic sequences, readings from hundreds of sensors. You never know in advance which measurements are important and which are garbage.

And now, for an unexpected twist. A matrix rotation!

Well, well… The task has gotten noticeably harder and more interesting. But I really wanted to add something else to it! One extra little option to make it even more beautiful and fascinating!

In the basic version, the two informative features are just two columns in the data table (even if we don’t know which ones).

Many algorithms, especially tree-based ones, are great at finding useful features through brute force: “What if I split the data by feature number 5? What about feature number 117?”. This works when the informative axes align with the coordinate axes.

But what if I do this: take this entire multidimensional feature space (both informative and noisy) and rotate it by a random angle? Multiply it by a random rotation matrix? Without changing the distance between points, without altering the geometry. Just a simple orthogonal transformation.

After a rotation like this, no single feature holds complete information. The useful signal is smeared in a thin layer across all two thousand coordinates. If previously the useful signal was lying on an imaginary “floor” (along the X and Y axes), after the rotation, this “floor” has flipped and is now hovering somewhere at a weird angle in 2000-dimensional space.

For a human, or an algorithm analyzing the columns one by one, the task has become absolutely unsolvable. How do we measure success or failure?

After training, each algorithm receives a new test set of 262,144 examples. But with one crucial detail: the noise features in the test data are brand new, freshly generated random numbers (again, via permutation, but using a different method). If an algorithm simply memorized the training set, it will instantly fail on the test.

The result is measured by a single number — the Root Mean Square Error (RMSE) relative to the reference image. RMSE = 0 means perfect performance. RMSE ~0.2, as we will see, is the equivalent of saying, “I give up, here’s the average brightness.”

Furthermore, during the final testing phase of the trained model, we can arrange this test set in the exact same sequential order as the pixels in the image. This way, the model’s output can easily be turned into a new image, which can then be visually and intuitively compared to the original. Alternatively, we can plot and visually compare the two 3D surfaces.

For example, something like this: So, let’s see what happened when I threw down the gauntlet to modern methods.

Who made it to the starting line?

Any self-respecting competition needs a roster of participants. I rounded up about two dozen of them, counting the variations in hyperparameter settings.

Participants:

And right here, I need to explain a certain methodological decision.

I know perfectly well how scientists operate. If someone comes along and says, “This algorithm of mine tore the competition to shreds,” the first reaction is usually, “You just botched the implementation of the other algorithms or tweaked the hyperparameters so they’d fail on your specific task.” And that’s a fair bit of paranoia.

So, to cover my reputation, I decided to do it in the most cynical way possible. I didn’t write the code for the widely known methods myself.

I described the entire essence of the competition in detail — the data format, the noisy features, the dirty trick with the matrix rotation, the success criteria — and provided the data generation code. Then, I left it entirely up to Claude Opus 4.6.

I asked the top-tier (at the time) neural network to write the tests, set up adaptive hyperparameters, and basically do everything in its power to make the boosting algorithms and neural nets show their absolute best results on this task.

Claude nailed it: the code was clean, included early stopping, and had fine-tuned settings for different dimensionalities. It suited up the existing methods in the best armor it could think of.

However, during the testing phase, a second thought hit me: what if Claude, in its overzealousness, overcomplicated things somewhere (say, cranked up certain hyperparameters too much out of fear of the thousands of noisy features)? So, for some of the methods, I created duplicate scripts. Absolutely bare-bones methods, with out-of-the-box default settings. Let the truly fittest survive.

And finally, just to keep it real, I wrote one of the neural network variants entirely myself, in pure PyTorch, the way regular folks used to do it back in the day.

Now, onto the participants. If you’d like, let’s briefly go through them by group. Or you can just skip this part (if you’re already familiar with them all) and jump straight to the results.

If we group the methods by their combat tactics, we get the following picture: The Linear Classic

Ridge regression is a method that tries to draw a straight line through the points. Or a plane. Or a hyperplane (depending on the number of dimensions). Either way, it’s powerless against my winding surface.

But I included it intentionally: let it serve as the benchmark of helplessness! Someone’s got to take on that role, right?

If any method shows the same RMSE as linear regression, we’ll consider that it hasn’t learned a single thing. Label on the diagrams — Ridge.

k-Nearest Neighbors (k-NN) This guy has a photographic memory: he memorizes all the training examples and, for a new object, looks for the nearest neighbors. But add some noisy features, and the concept of “nearest” loses all meaning. Let’s see how he handles the curse of dimensionality.

Label on the diagrams — kNN.

Support Vector Regression (SVR) A classic kernel method. Mathematically beautiful, but on large datasets, its cubic complexity with respect to the number of examples turns training into a deep meditation. I had to use a subsample of 10,000 points.

Label on the diagrams — SVR(10k). The Tree Aristocracy

Random Forest and its extreme counterpart, ExtraTrees. Both build ensembles of trees. Old school, reliable. They are capable of brute-forcing through coordinate axes. Let’s see how far that gets them. For each of them, there are two versions: one fine-tuned by Claude, and one “out of the box.”

Label on the diagrams — RF(cl), ExtraT(cl), RF(def), ExtraT(def). The Boosting Heavyweights

The elite of tabular competitions gathered right here.

HistGradientBoosting — fast and efficient, inspired by LightGBM.

LightGBM itself — a master of histogram tricks and GPU utilization.

XGBoost — a veteran who has survived hundreds of Kaggle battles.

CatBoost — a Yandex product, famous for its ability to handle categorical features (which we don’t have, but it’s a good guy nonetheless).

Each of them got two versions: the one tuned by Claude and the factory default (except for CatBoost, whose default settings performed worse and thus didn’t make it into the competition). They all know how to find important features… as long as those features remain separate columns in a table.

Labels on the diagrams — HistGB(cl), HistGB(def), LGBM(cl), LGBM(def), XGB(cl), XGB(def), CatBoost. Neural Networks

Three different architectures (in PyTorch, Adam optimizer). The first two are the result of Claude’s creativity: multi-layer networks with a bottleneck designed to force the model to extract the very essence. The third is my own implementation, architecturally simpler. Theoretically, neural networks are perfectly capable of learning a coordinate rotation. The question is whether the optimization algorithm can handle it, and whether they’ll have enough patience and computational resources to pull it off.

Labels on the diagrams — MLP-1, MLP-2, MLP-3. TabNet

An exotic guest. A neural network specifically designed for tabular data, featuring an attention mechanism that is supposed to select important features on its own. Three versions with different settings. Let’s see if the attention lives up to the hype.

Labels on the diagrams — TabNet-1, TabNet-2, TabNet-def.

The Anomaly. Polyharmonic Cascade

The Polyharmonic Cascade stands apart from this motley crew. It is not a neural network. Its training process is structured differently from standard gradient descent. It’s a participant who showed up to the competition with its own, slightly suspicious training regimen. I wrote all of its code myself, in accordance with a recent cycle of scientific papers (December 2025), links to which (on ArXiv) will be provided at the end. Pure, untouched mathematics.

Labels on the diagrams — PHC Ne (where N is the number of epochs). In every test, the Polyharmonic Cascade (to better understand how its quality depends on training time) was tested multiple times with a varying number of epochs.

The code for all tests is available on GitHub (link at the end).

You can check it, reproduce it, find bugs, or improve it. I’m all for it.

And now, with all participants armed and the rules of fair play observed (even if it meant bringing in a third-party AI), let’s see what all of this has led to.

My testing ground is based on an Intel® Core™ i9–10920X processor and an NVIDIA GeForce RTX 3070 8GB GPU. For algorithms where it was possible and beneficial, training was done on the GPU. My hardware might not be perfect, and other configurations could yield different results (primarily in terms of timing).

The Start of the Competition: Warm-up and the First Blow to the Gut

As in any worthy competition, the first rounds were just warm-ups. But they matter: they quickly show who actually understands the rules of the game. A pure sanity check.

Three warm-up rounds:

Round 1: Just 2 honest features at the input, no noise.
Round 2: 10 features: 2 useful + 8 noisy “doppelgangers”.
Round 3: The same 10 features + a random rotation of the feature space.

Caveat: with only two honest features at the input, doing a rotation makes no sense. It would just be rotating the plane itself. The two features would remain just as informative as they were. Therefore, we only apply the rotation once noise is added.

Round 1. The 2D Sprint

When the input is just x and y, the task turns into ordinary 2D interpolation of a 3D surface. Most participants handled it.

Let’s break down the first round in a bit more detail to show what different RMSE values actually mean in practice.

Two dropped out of the competition.

Linear regression (Ridge) predictably gave up immediately. But it left the competition with its head held high, having fulfilled its vital role as the benchmark of helplessness, scoring an RMSE of 0.207 (an important value for future comparisons).

As you can see from the “reconstructed” image, an RMSE above 0.2 means the function was essentially not reproduced in any way.

Support Vector Regression, SVR(10k), even on a reduced subsample, also failed (0.202), took almost 8 minutes, and was sent packing (perhaps better parameters could be found for it, but in this tournament, it was the coup de grâce).

The neural networks trained for about 13 to 19 minutes. Does gradient descent just need time to “break through” to a complex surface? In the end, the neural networks roughly split into two groups. MLP2 and MLP3 showed RMSE values of 0.0614 and 0.055. The rest (MLP-1 and all TabNets) showed significantly weaker results, getting stuck in the 0.14–0.16 range. The attention mechanism, it seems, couldn’t figure out what to pay attention to when the choice is trivial.

Let’s visualize what these RMSE ranges mean, for better understanding later on.

An example is Tabnet-def with an RMSE of 0.14796. This RMSE value corresponds to picking out the largest details, the hills and valleys, but a complete absence of any fine details.

If we look at it as a surface: Now, let’s look for comparison at what MLP-3 achieved with a result of 0.055.

You can see that an RMSE of 0.055 means the neural network reproduced almost all the main non-linearities of the function, only diverging from the reference in the fine details. In other words, an RMSE in the 0.055 range is already pretty decent.

If we look at the training curve: It’s clear that the neural network could have kept training and improving its result, but I intentionally limited the number of epochs so it wouldn’t stick out too much in terms of runtime compared to other methods in this test (especially since the result was already good enough to advance to the next rounds).

The heavyweights — **LightGBM, XGBoost, and HistGradientBoosting **— cleared the hurdle in the first round with results around 0.023–0.029. Better than the neural networks, and much faster to boot.

Here is the result from LightGBM (tuned by Claude) with an RMSE of 0.02645. As you can see, when the RMSE drops below 0.03, the quality is exceptionally high, and at first glance, it’s hard to tell the reconstructed function apart from the reference unless you squint at the tiniest details.

CatBoost lagged slightly behind the other boostings (0.047), but it’s a respectable result for the first round.

Random Forest, ExtraTrees, and k-NN showed almost perfect results in the first round, hitting an RMSE around 0.002–0.005. However, there’s a suspicion that in this first round, these methods (k-NN for sure) simply memorized all the points on the surface (a trick that won’t be possible to pull off in any of the following rounds).

The Polyharmonic Cascade, at 50 epochs, showed a result on par with the best boostings, though slower than most of them. But give it just a little more time, and at 100 epochs (about 4 minutes of training), it hits an RMSE of 0.016.

This means the Polyharmonic Cascade isn’t the best off the starting line (kNN and ExtraTrees will steamroll it in 2D), but it quickly breaks into the “very respectable” zone as an approximator of highly non-linear functions.

Warm-up Round 2. Enter the Noise

The rules change. The 2D sprint is over. The algorithms get 10 features: 2 useful ones and 8 noisy ones (which, as a reminder, are just shuffled versions of x and y).

**k-NN **dies instantly and drops out of the competition. Its RMSE jumps from 0.003 to 0.22. In a (mere) ten-dimensional space, the concept of a “nearest neighbor” loses all meaning. The method that just perfectly reproduced the image in the previous round turned into a pumpkin. The curse of dimensionality in all its glory.

As for the surface on the test set, k-NN churned out something resembling animal fur. Beautiful, but completely wrong.

The trees no longer yield perfect results, but they hold up excellently in the 0.0242–0.038 RMSE range. They know how to brute-force through features and find the useful ones. The best of the bunch is the “out-of-the-box” ExtraTrees, scoring 0.0242 in under 8 seconds.

The boostings start to sweat a little, shifting into the RMSE > 0.04 range. By the way, a funny moment: Claude’s “smart” tuning for XGBoost resulted in a failure of 0.11 RMSE after more than 6 minutes of training, while XGBoost with default settings churned out 0.047 in 32 seconds.

The MLP neural networks trained for about the same amount of time as last time — 13 to 16 minutes. For MLP-1, the addition of noise, for some inexplicable reason, actually did it a favor, leading to a sharp improvement compared to the previous round, dropping from 0.146 to 0.0538. MLP-2 and MLP-3 got 0.07 and 0.055. Overall, they’ve become comparable in quality to some of the boostings.

TabNet, training for 15 minutes to half an hour, achieved an RMSE of 0.12–0.15. They might need different settings/modes, but in the state used here, they look weaker than the rest.

The Polyharmonic Cascade behaves very similarly to the previous round (as we can see, adding 8 columns of garbage doesn’t bother it at all), showing an RMSE of 0.0266 at 50 epochs (2 minutes of training) and 0.02 at 100 epochs (4 minutes of training).

Next up is where things get really interesting.

Warm-up Round 3. The Blow to the Gut (Matrix Rotation) And now, I applied that very “matrix rotation” to the coordinate system, the one that gives algorithms an existential crisis. The informative features stopped being separate columns; the signal was smeared across all 10 dimensions.

I certainly expected the algorithms to start making mistakes. But I didn’t expect a mass suicide.

Look at the numbers. Our “I give up” threshold is at the RMSE 0.2 mark.

That’s it. Absolutely all the trees and boostings collapsed in an instant, showing results on par with linear regression on the 10-feature test with rotation. Early stopping triggered in the scripts — the algorithms themselves realized they couldn’t solve the task and refused to learn.

And who survived?

The neural networks. Their RMSE stayed within the 0.055–0.084 range.

TabNet survived, but churned out its usual 0.14–0.16 result.

The Polyharmonic Cascade — 0.0274 at 50 epochs and 0.0224 at 100 epochs. The Cascade just kept building its surface as if nothing had happened.

The First Divide.

Right here, on a miserable ten features, the competition split into two leagues.

The No-Rotation League. The trees and boostings will continue to compete in these tests; they can only work when the informative axes align with the coordinate axes. We will monitor them fairly and closely. But we already know: if the space rotates, they are helpless.
The Rotation League. The neural networks (including TabNet) and the Polyharmonic Cascade participate here. They will compete in both categories: with and without rotation.

So, the warm-up is over. The real stress test is just beginning.

In the first full-fledged test, we add 498 phantom features that perfectly mimic the real ones. The useful signal now makes up a mere 0.4% of the input data.

This is where the real pressure on the algorithms’ architecture begins.

Round 4. Cutting Down the Forest, Chips Fly. (500 features, no rotation)

Five hundred features is serious business. But in this round, the informative features still lie strictly along two specific axes. Theoretically, methods like trees and boostings are capable of finding them. But what happens in practice?

The trees present a paradox. The versions “fine-tuned” by Claude completely break down: RandomForest and ExtraTrees yield an RMSE > 0.2 and drop out of the competition.

However, the bare-bones default versions prove to be fighters: ExtraTrees hits an RMSE of 0.0537 in 8 minutes, and RandomForest gets an RMSE of 0.0564 in 30 minutes. Which is quite good.

Boostings.

HistGB with Claude’s settings yielded an RMSE of 0.156 in a minute and a half. An attempt to forcibly crank up its number of iterations to massive values improved the result to 0.134, but it cost 1 hour and 12 minutes.

HistGB (default) trained very quickly, in 20 seconds, reaching an RMSE of 0.1338, but it hit the same mediocre quality ceiling as it did with Claude’s settings.

LightGBM, XGBoost, and CatBoost are holding steady in the 0.08–0.12 range. They are quite operational at five hundred features, but they are already lagging behind the trees.

Neural Networks.

MLP-1 and TabNet-1 couldn’t handle it and drop out.

MLP-2 reached an RMSE of 0.1881 after half an hour of training, which is a very weak result. But still, we’ll give it credit and let it pass to the next rounds.

**MLP-3 **— kudos. It trained honestly for half an hour and showed an RMSE of 0.077, which is better than the best boosting (though worse than the trees). Slow, but it works.

TabNet-2 trained for a whopping 83 minutes and yielded an RMSE of 0.177, which is weak, but we’ll give it a chance to participate further.

TabNet-def, the best of the TabNets, trained in 15 minutes with an RMSE of 0.1626.

The Polyharmonic Cascade. An RMSE of 0.03071 in 7 and a half minutes, and an RMSE of 0.02097 in 15 minutes. This is not a typo. At five hundred features, it showed the exact same accuracy as it did at ten. And with 500 features, its RMSE drops significantly below the best trees and boostings, and it does so in a perfectly reasonable amount of time.

Round 5. The Meat Grinder for Neural Networks. (500 features, with rotation)

I activate the matrix rotation mode. Now, the “two useful columns” vanish. All that’s left is a hidden 2D plane, rotated in a 500-dimensional space.

As became obvious during the warm-up (Round 3), trees and boostings don’t survive a rotation. Therefore, I won’t provide their data here (though I did run them just in case, but there were no surprises — they are completely non-operational).

The battlefield is cleared. Only the neural networks and the Polyharmonic Cascade remain.

And this is where the brutality begins.

You can hear the crunch of breaking gradients, and out of 6 neural network variants, only one actually survives: MLP-3. It spent 28 minutes training and managed to pull off an RMSE of 0.086. The rest, alas.

You might be interested in seeing MLP-3’s training curve, which looks somewhat like a cardiogram.

Let’s move on.

Can we consider TabNet-2, which took 84 minutes of training to push its RMSE below the 0.2 mark and reach 0.1977, a success? Obviously not. But still, due to this mini-anomaly, I decided to allow it into the next rotation round.

The Polyharmonic Cascade? An RMSE of 0.03258 in 7 and a half minutes, and an RMSE of 0.022 in 15 minutes.

Did you catch that? 0.02097 without rotation in Round 4, and 0.022 with rotation. The difference is minimal. The training time is exactly the same. The Cascade practically didn’t even notice the rotation.

And here is its training curve for comparison. A strange feature here is that the error curves on the training set and the validation set (remember, the original sample was split into these, while an additionally generated test set is used for the final result) practically coincide and merge. There isn’t even a hint of any overfitting process.

Let’s compare the reconstructed images from the test set processed by the MLP-3 neural network and the Polyharmonic Cascade.

The difference is clearly visible to the naked eye.

But 500 features isn’t deep space just yet! Let’s see what happens when we double, and then quadruple, the noise dimensionality.

So, we now have 1000 features, and we’re crossing a psychological milestone.

Round 6. A Thousand Features Without Rotation. The Air Gets Thin

In this mode, the coordinate axes don’t lie. The informative features match the table columns.

Random Forest and ExtraTrees, in their “default” configurations, demonstrate a stubborn survivability: RMSE 0.057 and 0.059, but at the cost of 18 and 49 minutes of computation.

Boostings. They are alive and capable of finding informative features, but their results stagnate in the 0.089–0.18 range. The best result among them, 0.0899, comes from LightGBM (with default settings). XGBoost — 0.118. CatBoost, having spent 25 minutes, yields 0.125. (Some methods worked faster, some slower, but adding more iterations no longer changed the picture.)

Neural Networks.

MLP-2 couldn’t handle a thousand features anymore and leaves the competition. MLP-3 spent over an hour on the track and showed 0.0906 — slightly worse than at five hundred features. Slow, but the quality is comparable to the best of the boostings.

TabNet-2 and TabNet-def are conditionally alive, but showing weak results. TabNet-2 — RMSE 0.1703 in 1 hour 17 minutes. TabNet-def — 0.1546 in 44 minutes.

The Polyharmonic Cascade.

In the first 50 epochs (3 minutes) with an RMSE of 0.08219, the Cascade overtakes all the boostings and neural networks.

In 100 epochs (6 minutes), it hits an RMSE of 0.04492, leaving the trees behind and taking the lead.

By the twelfth minute, the result is already 0.02991. And after half an hour of training, an absolutely perfect 0.02045.

Round 7. A Thousand Features With Rotation. Two Solitudes

I rotated the thousand-dimensional space. And only two were left.

TabNet-2 finally confirmed its inability to learn and dropped out.

Only MLP-3 and the Polyharmonic Cascade remained.

MLP-3 trained for 1 hour 13 minutes and reached an RMSE of 0.07225. And for some reason, that’s even better than in the round without rotation.

The Polyharmonic Cascade.

It reached roughly the same metric (close to MLP-3) of RMSE 0.07525 in just 3 minutes.

In less than 12 minutes, it hit the 0.04419 level.

But then, at the 500th epoch, it showed a slightly worse result of 0.04549. Did the rotation and 1000 features finally take a toll on the Cascade’s quality?

Let’s look at the training curve.

It seems there was just an unfortunate spike right around the 500th epoch. By the way, unlike the 500-feature round, the train and val curves here diverge just a tiny bit, no longer overlapping perfectly.

Round 8. Coda

Two thousand features. The final competition in the “no rotation” mode.

Here, in the final stage, two neural networks drop out of the race: TabNet-def and MLP-3, having failed to handle the task.

Surprisingly, the sole surviving neural network was TabNet-2, which showed the weakest results throughout the entire tournament. And that’s how it finished it, with its RMSE of 0.1622, reached after an hour and a half.

Boostings.

In the “no rotation” mode, they all survived and made it to the finish line.

But in terms of quality, by the 2000-feature mark, they had severely lost ground. Moreover, the boostings with default settings showed better results than those with Claude’s settings.

HistGB(def) and XGB(def) worked quickly, finishing in about 1 minute. But the results are weak: 0.1453 and 0.1226.

CatBoost thought for 46 minutes and got a result of 0.1288.

The best RMSE among the boostings, 0.0977, was shown by LGBM(def), wrapping up in 13 minutes. Not great, but I think it’s a respectable result, given the number of noisy features at the input.

Trees.

Stubborn opponents. In the “no rotation” mode, they provided the best quality among all the classic methods.

ExtraTrees achieved an RMSE of 0.06034 in 36 minutes.

Random Forest reached an RMSE of 0.0617 in 2 hours 4 minutes.

Not brilliant, but reliable.

And finally, the Polyharmonic Cascade.

Judging by the metrics, Round 8 was tougher on it than the previous ones.

But nevertheless. After 15 minutes of training at 200 epochs, it already had an RMSE of 0.05291, which is better than the metrics of all other participants.

By the 500th epoch (37 minutes), it further improved the result to 0.0471.

Round 9. A Black Cat in a Dark Room

Now I rotated the two-thousand-dimensional space and launched the final trial.

The battle is over, silence has fallen. Only one survived.

The MLP-3 neural network, despite having 1000 features well within its grasp, couldn’t train on 2 thousand and drops out in the final round.

The Polyharmonic Cascade finishes the competition in proud solitude.

RMSE 0.08254 in four minutes.

RMSE 0.0507 in seven and a half minutes.

RMSE 0.03895 in fifteen minutes.

RMSE 0.03227 in 37 minutes.

For some inexplicable reason, the Polyharmonic Cascade worked even better in this final, most difficult round than ever before. Something beautiful is happening here. It showed results not only better than its own in Round 8 (2k features without rotation), but it also surpassed its own metrics from Round 7, which had 1000 features with rotation. Let’s take a closer look at the results of the Polyharmonic Cascade in the final round.

Aftertaste

So, the tournament is over. And what’s the bottom line? We see that modern tabular ML is a brilliant, fast, but axis-dependent mechanism. XGBoost, LightGBM, CatBoost, HistGBM, Random Forest, ExtraTrees — they all think in terms of “split by feature X, then by feature Y.” As long as the nature of the data is aligned with the coordinate axes, they tear everyone and everything to shreds. But just apply a random linear transformation to the data (which is incredibly common in the real world), and this heavy artillery starts firing blanks.

We see that neural networks possess the right intuition (they understand rotation). But they train slower and struggle when there are too many noisy features. And when there are 2000 of them, they drown in the noise. Although I fully admit that the neural networks I used here might not have been implemented or tuned in the best possible way. Therefore, the question of their capabilities remains slightly open.

And we saw the newcomer. The Polyharmonic Cascade doesn’t use gradient descent in its classical form and doesn’t brute-force through features. In this test, it proved capable of grasping complex geometry where others see only noise. Links to publications describing the theory of the Polyharmonic Cascade and the results of its tests on other, more official benchmarks are provided at the end of the article.

A small digression:

I, of course, anticipate one of the obvious arguments in future comments that tend to follow such tests: “Why don’t you just apply PCA before the boostings (or trees), wise guy!”.

And I conducted such an experiment (the code is included in the repository). I took the best boosting algorithm from Round 8, LightGBM, and tried to apply it in Round 9 with rotation, but using Principal Component Analysis (PCA) beforehand. And no matter how many principal components I chose — 3, 5, 10, 50, 2000… LightGBM consistently output 0.215 and stopped early. It seems PCA not only failed to save the day, but died itself.

Thank you to everyone who read this far. I hope it was at least 10% as fascinating for you to read as it was for me to conduct this competition.

I ran this experiment alone, on my home hardware. But science does not exist in a vacuum (unless required by the experiment’s conditions).

What’s next? Therefore, as I said in the epigraph: this is an invitation to replicate. Everything described in the article can be run on your computer. No black boxes, no closed datasets.

Repository with the full test code, data generator, and cascade implementation:

https://github.com/xolod7/black-cat Polyharmonic Cascade

**Scientific articles of the cycle (arXiv):**

[https://arxiv.org/abs/2512.12731](https://arxiv.org/abs/2512.12731)

[https://arxiv.org/abs/2512.16718](https://arxiv.org/abs/2512.16718)

[https://arxiv.org/abs/2512.17671](https://arxiv.org/abs/2512.17671)

[https://arxiv.org/abs/2512.19524](https://arxiv.org/abs/2512.19524)

Still have questions? Want to discuss? In the comments, on GitHub, in private messages — I’m open.

Let’s figure it out together

source: original article Searching for a Black Cat in a 2000-Dimensional Dark Room: A Machine Learning Algorithm Tournament was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article AsyncIO in Python: What It Actually Is and Why Your ‘Async’ Code Might Not Be Async Building Long-Running Claude Managed Agents: Why State Matters More Than Compute The Building Blocks of LangGraph (Part 0)

Searching for a Black Cat in a 2000-Dimensional Dark Room: A Machine Learning Algorithm Tournament

Run your AI side-project on zahid.host