Tensordyne Converts AI Matrix Math To Logs To Crank Up Inference Oomph

wpnews.pro

Right off the bat, let’s give a shout out to the mathematician propeller-heads who create the transformations that make it possible to do all kinds of high performance computing to simulate, model, and generate insight from massive amounts of noisy data.

Transformations are the key to such codes, and they rely on math that predates computing as we know it by centuries. There are all kinds of neat transformations.

My personal favorite is the Fourier transform, which breaks down complex signals into their component sine and cosine waves and which are named for French Enlightenment-area mathematician and physicist Jean-Baptist Joseph Fourier, are a key component of both HPC and AI codes, of course, and are a nightmare to do with a pencil, let me tell you. (I could not do it today if you put a gun to my head, unless you gave me time to rifle through my bookshelf and read up. . . .)

A more recent transformation example is Google’s TurboQuant quantization, which shrinks the memory wall for AI inference. TurboQuant takes the input vectors for an AI workload, applies a random rotation matrix to the vectors, does scalar quantization of that data, and then rotates it back into vector space. (The math is more complex than this.) The upshot is that TurboQuant compresses the key-value cache by a factor of 6X. (I really need to dig into this Google innovation. Apologies for the delay.)

Tensordyne, one of the second generation of AI chip startups that is focused heavily on inference, has its own transformation twist at the center of its “Napier” AI inference engines, and like many good ideas, it sounds absolutely and perfectly obvious once you say it. Here is the crux of it: Multiplying matrices of numbers is hard, even for computers with matrix multiplier units. But adding banks of numbers is a lot easier. (Even your own neurons like addition more than multiplication.) And the key insight from Tensordyne’s founders is that if you convert data to logarithms, you can add them and avoid the multiplication overhead entirely. And, the Napier chip, named after Scottish mathematician, astronomer, and physicist John Napier of the late Renaissance, who invented logarithms and also was an early user of the decimal point and was therefore one of the first geeks to get us on the path to a floating point processing unit, does all of this conversion from matrix to logs all under the covers, invisibly.

Here’s the upshot: This shift to log math provides more than an order of magnitude better performance, lower price, and lower power consumption than hybrid architectures from Nvidia and Amazon Web Services. This is clearly something that AI inference needs, as the vast wealth that Nvidia has amassed so aptly demonstrates. As I am fond of saying, economic substitution is a law, it’s not just a good idea. If the Tensordyne architecture pans out and can support all of the big inference engines, and Tensordyne can get enough HBM to make them in high volume, this may be a DeepSeek moment for AI hardware.

RK Anand, one of the co-founders of Tensordyne, tells The Next Platform that Broadcom is the chip shepherd for the Napier compute engine and that with Broadcom being the third largest buyer of HBM memory on the planet and the third largest buyer of chip wafers from Taiwan Semiconductor Manufacturing Co, supply shortages are not going to be any bigger of an issue for Tensordyne than it is for anyone else.

“As long as we are within lead times, we can get any volumes customers need,” Anand states emphatically.

With that, let’s learn a little about Tensordyne’s people and dive into its AI inference system architecture, which is code-named “Pareto” appropriately enough, given that the Pareto curves for AI inference are going to be a deciding factor in AI system purchases.

Jumping From Cars To Datacenters

RK Anand and Gilles Backhus co-founded a company called Recogni back in September 2017, and the idea was to create AI inference systems for the automobile industry. Anand has a long history in Silicon Valley, and both co-founders are experts in signal processing, which comes as no surprise.

Anand, who is Tensordyne’s chief product officer, got his bachelor’s degree in electronics and communications engineering from Manipal Institute of Technology in 1988 and his master’s degree in computer engineering from Syracuse University in 1990. He spent six and a half years in the microprocessor division at Sun Microsystems after graduation as a senior engineering manager, and was the founding vice president of engineering when Juniper Networks was started in July 1996 to take on Cisco Systems in datacenter routers. Anand stayed at Juniper until September 2012, and two years later he was the chief executive officer at Kumu Networks, which invented full duplex wireless networks. (This is where Anand ran across Backhus, who did a thesis on full duplex wireless at Stanford University under professor Sachin Katti, who was also a chief technology officer at Intel for four years) He did a startup called OttoQ, which was an app to find open parking spaces, which was shut down when Recogni was started.

Backhus, who is the company’s vice president of AI, got his master’s pf science in elektrotechnik from the Technical University of Munich in 2015, and jumped right into a machine learning startup in Munich before doing a short stint at Kumu Networks where he was also working on that thesis at Stanford University. Backhus worked at a few startups for a few years until starting Tensordyne with Anand.

The two have built a company with over 120 people that has raised $176 million across three rounds and that has just successfully taped out the Napier compute engine. In July 2022, Anand and Backhus hired Marc Bolitho, an expert in ADAS systems for self-driving cars as well as other embedded auto systems, to be Tensordyne’s chief executive officer. This was about the time the company started pivoting towards the datacenter and away from cars.

This was obviously the right move, considering how AI is now driving trillions of dollars in new IT budget globally over the next few years. No matter how many cars we all buy, no margins are sweeter than those that can take some revenue away from Nvidia, which has commanding market share in AI training and inference.

For now. Nvidia will have a huge and profitable business, but eventually, other companies will get access to HBM and chip wafers, they will innovate, and they will beat Nvidia for at least some of the business. This is as inevitable as AI itself was the moment that the first bit of symbolic logic was pressed into clay 12,000 years ago.

A Logarithmic Dynamo For Tensor Flows

Sadly, we have not yet had a deep dive on the Tensordyne Napier chip architecture. We suspect one will come out between now and whenever the Napier chips starts shipping to customers at the end of the second quarter in 2027. Right now, according to Anand, the plan is to get cloud access to Napier engines by the end of 2026, with customers getting beta TDN Pod systems employing them in the first quarter of 2027.

But the company has given us information about the system architecture and how it stacks up to alternative high performance AI inference systems out there in the world.

Backhus did give us a little hint of what is going on under the hood of the Napier chip:

“The main compute workhorse has 48 cores, and it has a very state of the art, very transformer architecture inspired aperture. It is 128 by 128 systolic array, basically, but with lots of extra features, such as a novel accumulator design, for example. It can also do some legacy AI workloads that are still important to some customers very efficiently as well.”

So it has 48 of those cores, and they are being coupled with these vector processing units. The vector processing unit also has arithmetic units, but it can also use a lookup table, and it can work completely in parallel. So, if you have a softmax at the end of a matrix multiply, for example, you can do this in parallel, you don't have to do it in sequence. You can interleave them and pipeline them, basically. And then we have the RISC-V cores, and that kind of rounds out the three computing categories that we have on the silicon.”

Here is a die shot of the Napier chip:

If you look at it carefully, you have HBM controllers on the left and right of the cores, and what appears to be I/O controllers along the top and bottom. The RISC-V cores could be embedded in in the north and south blocks, but I don’t think so. You can see the four banks of compute, each with two columns of six Napier logarithmic cores, and the vector engines are probably embedded there.

In the center of those columns there are what we presume are massive banks of SRAM memory, and if you look even closer, you will see four little squares of something. My guess is each of those four little squares is a block of four RISC-V cores, for a total of 16 cores per device.

“Per rack, we have 320 Xeon cores, and then we have another 4,608 RISC-V cores per rack,” Backhus confirmed. “We are taking a two-tiered approach to the whole CPU problem. Whatever happens very close to the AI compute, and as part of the token loop, as part of the autoregressive loop of LLMs, that most of that gets executed in the RISC-V cores. So MoE routing is going to be there, checking for certain dictionary rules when you know that you want to discard certain tokens is there, and so on. Then in terms of interfacing with the slightly less frequent world of inference serving, that then happens on the Intel Xeon. And then if you want to go even through the 200 Gb/sec Ethernet link to another pod, you can totally do that. We have lots of bandwidth in our system. We have 64 200 Gb/sec links per rack that you can use to go to other systems, which is a very high bandwidth interface.”

It is not clear if the Napier chip is comprised of chiplets or is a single monolithic die. What Anand did confirm to me was that the RISC-V cores run at 1.5 GHz and the log cores run at 1.33 GHz, and that the whole Napier complex had 138 billion transistors implemented in a 3 nanometer process from TSMC. This is a serious chip. But here’s the thing. The Napier complex is nowhere near the reticle limit of current CMOS processes. So it is small. And the Napier chip only burns 300 watts compared to 1,200 watts for a “Blackwell” B300 GPU accelerator.

This is what happens when you add logs of floating point numbers rather than multiply floating point numbers.

This also means a rack of 288 of these devices can be air-cooled, which will be a boon for hedge funds, big banks, insurance companies, and others who have metropolitan datacenters and cannot easily support the power density or the weight of large GPU supercomputers.

Here are the specs for the Napier chip:

The Napier chip supports NVFP4, FP8, and FP16 data formats and processing, which is all that is needed at this point for AI inference. It processes 2.1 petaflops at dense FP8 precision. The chip has four banks of HBM4 memory, each with 36 GB and totaling 144 GB for the whole shebang with 4.7 TB/sec of bandwidth. Importantly, there is 256 MB of SRAM on the chip, with a combined 40 TB/sec of aggregate bandwidth. This is HBM capacity and bandwidth that is comparable to many GPUs shipping today.

Tensordyne puts nine of these chips into a compute tray, with a 40-core Xeon processor on it for host control and some decode work. It is not clear how much memory this host has, but it does not look like much based on this schematic:

The compute tray as two 200 Gb/sec QSFP ports to link to the outside world, and there are six ports on the back of the tray that are for the proprietary TDNLink interconnect that is used to link 72 of these Napier chips together in a 72-chip pod that has eight trays. There is an 8 TB NVM-Express flash card on the tray as well. This complex is 1U high and has 19 petaflops of aggregate performance at FP8 precision, backed by 1.3 TB of HBM memory with 42 TB/sec of bandwidth and 2.25 GB of SRAM with 360 TB/sec of bandwidth.

The Napier compute trays are linked over a backplane much like Nvidia did with its NVLink Switch, using a proprietary, cell-based transport and a homegrown protocol called TDNLink.

Cell-based architectures have been used as the interconnect fabric for big NUMA servers for almost three decades now, and this particular cell-based setup provides all-to-all links for the Napier accelerators within a single chassis. This cell-based architecture might even be derived from Juniper routers for all we know, or given Anand’s background, it may be homegrown. The slide above says it has over 100,000 deployments, which seems to argue that it is licensed from Hewlett Packard Enterprise, which owns Juniper now and which is also a manufacturing partner for the pair of top of rack switches in each TDN Pod.

By the way, don’t underestimate that less than 1 microsecond for a single hop over TDNLink. Anand says that a hyperscaler told them that with NVLink Switch, the average latency in the real world with real AI workloads is 10 microseconds to 13 microseconds. NVSwitch is very wide, but it ain’t necessarily as fast as we are all probably assuming. (We’d love to confirm this number with Nvidia, of course.)

What we would call a chassis, Tensordyne calls a TDN Pod, and it looks like this:

And here is how you rack it up:

The rackscale system, which is using Ethernet for an interconnect, weighs in at 608 petaflops at FP8 precision, with 42 TB of HBM, 74 GB of SRAM, 266 TB of flash, and 275 TB/sec of aggregate TDNLink bandwidth across those four chasses. And it “only” burns 120 kilowatts and can be air cooled.

So what does this mean for large-scale inference workloads. Anand says this is how you need to think about it:

You can get the same throughput as a modern GPU system with around 2,000 racks (this does not assume you are adding Groq or Cerebras systems to do the decode, but rather are using all-GPU systems) with only 350 racks to drive 1 billion tokens per second. Or, you can use the same floorspace as that monster GPU system and push through six times as many tokens – which implies six times the revenue – and use 30 percent less power than that GPU system.

Here is a more direct comparison that Tensordyne has cooked up, pitting a combination of “Rubin” GPUs for prefill and Groq 3 LPUs for decode from Nvidia as well as AWS Trainium3 for prefill and Cerebras CS-3s for decode driving a 2 trillion parameter mixture of expert model:

There are some holes in the data, with AWS not yet providing pricing for its Cerebras decode service. But one thing I can tell you for sure. If Napier takes off, those prices are going to have to come way down – or Nvidia, AWS, and others are going to have to partner with Tensordyne.

This is not really a DeepSeek moment so much as it is proprietary CPUs and even first generation RISC CPUs from like likes of Sun, HPE, and IBM meeting X86 CPUs in the datacenter. And that was a freaking bloodbath, with Sun and HPE knocked out of the market and IBM hanging on by the skin of its installed base that cannot leave trillions of lines of homegrown applications so easily.

Somebody better hire some lawyers to see how tight those Tensordyne patents are. Or, get out a checkbook to buy Tensordyne before all hell breaks loose. AMD, this is a good time for you to move. Or maybe even HPE or Google. . . .

source & further reading

nextplatform.com — original article AI Dominates The Microsoft Conversation, But Not The Company’s Business Just How Rosy Are Those AI Infrastructure Spending Forecasts? The Rackscale AI System Roadmaps That AMD Is Using To Chase Money

Tensordyne Converts AI Matrix Math To Logs To Crank Up Inference Oomph

Jumping From Cars To Datacenters

A Logarithmic Dynamo For Tensor Flows

Run your AI side-project on zahid.host