Tensordyne Tapes Out LNS-Based AI Chip, Claims Huge Power Advantages

wpnews.pro

AI chip startup Tensordyne has taped out its data center inference chip, which the company said will offer an order-of-magnitude improvement in power efficiency compared with leading GPU alternatives. The company said systems based on its chip can achieve 17× the tokens per second per Watt versus an Nvidia GB300-based system for the same workload, or 13× the tokens per second per rack (see editor’s note at the end of this article).

The two biggest challenges in the data center AI inference market are speed and cost, said Tensordyne co-founder and VP of AI, Gilles Backhus.

“Everybody wants fast AI, and everybody needs low-cost AI,” Backhus said, noting that interest in Cerebras and Groq indicates the market is willing to pay a premium for faster tokens. “That’s a hard challenge, especially as models are still getting bigger.”

Open-source models are at one trillion parameters, while closed-source models are approaching ten trillion, Backhus said.

View All “It’s more important than ever that the best tokens available also come cheap,” he said. “Otherwise, certain business models and certain applications will not be unlocked. We believe we have the first system that can address both within one technology.”

System performance

Tensordyne’s chip, a hardware accelerator for the company’s logarithm-based math scheme, is built on TSMC 3 nm and will consume 300 W per package. It will offer 2.1 PFLOPS (dense FP8) compute with 144 GB HBM3e.

Named after the inventor of logarithms, John Napier, Tensordyne’s 72-chip Napier server is air-cooled at 30 kW and occupies a quarter of a rack. It includes 10 TB of HBM, enough to hold a 10T model in FP4. Tensordyne is positioning this against a full-rack Nvidia Blackwell-based system (72 chips per scale-up domain was selected to ease transition from NVL72-based infrastructure, Backhus said).

Each air-cooled full rack of four Naper servers (288 chips) offers 608 PFLOPS of dense FP8 compute, 74 GB of SRAM, 42 TB of HBM, and consumes 120 kW.

Tensordyne’s advantage is based on math. The company uses a proprietary number system, Pareto, based on the logarithmic number system (LNS). LNS is not new, but there has not been dedicated hardware acceleration for it until now. Tensordyne’s IP is its proprietary approximation for efficient addition in the log domain and its hardware implementation.

“We’ve tested it on any model you can think of,” Backhus said. “This is not an approach where we ask customers to train their models in our math or calibrate it, because we understood we can’t ask customers to do that, it would be too much hassle.”

Tensordyne’s software stack handles all conversion, hiding it entirely from the user or exposing Tensordyne’s lower-level Python-based language if required. Hyperscale customers are using a mix of PyTorch for higher-level definitions with Triton for lower-level definition, Backhus said. AI agents can be used to convert GPU-specific code.

“You can translate from any framework to any other framework, as long as you can give the agent a couple of examples and a clear knowledge base or a wiki that it can learn from and work from,” he said. “We’ve seen this work really well for dense models, for MoEs, for basically any model.”

Tensordyne’s hardware calculates microscaling/dynamic quantization on the fly in real time (somewhat analogous to Nvidia’s transformer engine).

With Tensordyne’s math system, less silicon area is used for computation, so the chip can have around 5× the SRAM of a current-generation GPU, 256 MB. This means multiple operations can be folded into a single operation without going to HBM, resulting in performance advantages. Extra silicon area can also be dedicated to balancing accelerated compute, SRAM, and CPU area; on-chip CPUs are used for MoE routing and some decode-loop operations. The accelerator is a 48-core design sized with transformers in mind, but it can handle legacy workloads efficiently too, Backhus said.

Cell-based NoC

Tensordyne uses a patented cell-based network-on-chip (NoC), which reduces tail latency, particularly important when parallelizing workloads over many chips. This is a key part of how Tensordyne achieves fast decode, said Tensordyne co-founder and chief product officer RK Anand.

“Like with the internet, MoEs have small messages that are bursty and random,” he said. “This fabric, with its low-latency focus, can handle congestion extremely well, because it’s cell-based. This has proven to be a huge advantage as it allows us not to go to multi-vendor disaggregation, but stay within one silicon and one system.”

Tensordyne’s single-hop chip-to-chip latency is less than 1 microsecond.

Disaggregated hardware solutions for fast tokens are meeting some of the industry’s needs, but they are incomplete without specialized networking design, Anand said, noting that there are also operational challenges with splitting workloads across different coding environments.

“[There is also] a reliability implication, which is that […] there’s a direct relationship between the number of devices, the number of racks, and the overall effective reliability of the system,” he said.

Tensordyne’s system can do both prefill and decode “exceptionally well,” for the largest models in the world, Anand said. A typical setup for a 2T MoE model might use one of four 72-chip servers in a rack for prefill and the other three for decode, achieving 1,300 tokens per second per user, he said. Tensordyne’s cost per million tokens in this scenario is around $11, an order of magnitude less than next-generation multi-architecture disaggregated solutions, according to the company’s figures.

Tensordyne partnered with HPE Juniper on the system’s scale-up interconnect and chassis. Chips inside a server are connected via vertical PCBs holding networking switches; avoiding cables improves reliability, Backhus said.

Two 200 Gb/s links per compute tray handle connectivity to wider data center networks via Ethernet. Each tray can hold 8 TB of hot context or KV cache in an NVMe SSD.

Napier can also run many models in parallel for agentic workloads, Backhus said, within a single scale-up domain (chips in the same scale-up domain can talk to each other faster via RDMA).

“Communicating between multiple models within one agentic stack is now possible,” he said. “This can massively reduce the tail latency and the speed at which agents operate, so multiple users, multiple workers, multiple models can coexist on a single [Napier].”

Systems are due to start shipping by Q2 2027, with a development cloud up and running for remote performance characterization by the end of 2026.

Editor’s note: Tensordyne’s simulations suggest its rack-scale systems will achieve 3 million tokens per second per megawatt, versus 183,000 for racks of NVL72-GB300s, based on public benchmark figures from InferenceX. The same figures show a single Tensordyne rack’s throughput at 363,000 tokens per second per rack, with an NVL72-GB300 rack at 27,400. This is for DeepSeek-R1-670B at the same high interactivity (210 tokens/second/user) at FP4.

Tensordyne Tapes Out LNS-Based AI Chip, Claims Huge Power Advantages

See also:

Run your AI side-project on zahid.host