SANTA CLARA, Calif. — When EE Times visited Tenstorrent CEO Jim Keller’s office a year ago, the whiteboard outside his office door read: “We’re going to WIN!” On returning a year later, it reads: “Holy Shit, That’s Fast!”
In the aftermath of Tenstorrent’s TT-Deploy event, where the company showed initial demonstrations of what its chips can do when deployed at scale, Keller told EE Times that Tenstorrent can beat the performance of both GPUs and more specialized AI hardware with its BlackHole Galaxy server.
Keller believes AI inference is ultimately a networking and memory problem, and that Tenstorrent’s architecture is now proving that at scale.
At TT-Deploy, the company demonstrated performance across a range of workloads. For example, 16 Tenstorrent Galaxy servers (512 chips) can inference DeepSeek-671B at up to 350 tokens per second per user at batch 32.
View All Tenstorrent’s fast tokens are a direct result of its ability to easily split large tensors across hundreds of chips, Keller said. Galaxy boxes have 56 Ethernet ports per box, while GPU servers might have eight external ports per server.
Keller invokes Rent’s Rule, developed at IBM in the 1960s, which states that the I/O required by a block of logic grows sub-linearly with the amount of logic; in practice, this means compute area grows faster than the available beachfront for communication. This is often a fatal flaw for other architectures, he said.
“There are no new laws,” he said. “The fundamentals of AI computation are rooted in HPC from the 1970s which have been well understood for decades.”
Successful AI infrastructure still comes down to balancing compute, memory, and I/O, he said.
“AI is mostly matrix computation and non-linear vector operations, and then to make it run fast, you need sufficient SRAM to hold the data and results for computation, and a buffer for data to move between memory, tensor processors and chips, which we have,” he said. “If you make the memory way too big it doesn’t help very much, and if it’s too small it’s really bad.”
Tenstorrent competitor Cerebras, under scrutiny for large model performance following a hugely successful IPO, released performance figures for Kimi K2.6 (1T). This is the biggest model it has tackled publicly so far; Cerebras said it can hit 981 tokens per second on its CS3 hardware.
According to Keller, Tenstorrent can beat this performance with large deployments of its BlackHole Galaxy servers at a fraction of the hardware cost.
“The Cerebras [IPO and subsequent valuation] was helpful, especially as we’re going to beat them on everything,” he said. “Challenge accepted!”
Disaggregated inference
Market leader Nvidia has licensed technology from Groq to accelerate the decode portion of LLM inference in a technique known as disaggregated inference. Three racks of Nvida CPUs and GPUs are required, roughly one rack for prefill and two to hold the enormous KV cache, per single rack of Groq chips for decode.
Tenstorrent does not need any additional steps for fast decode, Keller said.
“I am often asked how we handle the KV cache,” he said. “It’s in the DRAM on the same chips as the decode, we don’t even think about it. We’re really good at that.”
The key is that Tenstorrent can connect arbitrary numbers of tensor processors together, Keller said. With enough chips, tensors will fit entirely into SRAM, but if the number of chips is not sufficient, the data can be streamed in from DRAM at the expense of some performance. Architectures without any DRAM, like Groq and Cerebras, cannot do this, he pointed out.
“They can scale to big models, they just need a lot of hardware,” he said. “Our answer is that even relatively modest-sized hardware can run big models, but if you want super-fast token rates, we can move the token rate anywhere we want.”
Could Tenstorrent hardware be used alongside GPUs for decode acceleration, similar to Nvidia’s disaggregated architecture?
“We have a customer who is using Galaxy to accelerate the GPUs they bought,” Keller said. “We have a PCIe card with our BlackHole chip on it, and we use Layer 2 Ethernet for transport, so it was pretty easy to hook up.”
The customer doubled or tripled their token rate using this method, Keller said.
“If they had bought only Tenstorrent in the first place, it would have been cheaper, because we can do prefill as well, and it’s cleaner,” he said. “But [the customer] had already bought the GPUs and they wanted to leverage their investment.”
Productizing this idea is currently a “maybe,” Keller added.
Workload co-design
The perception that hyperscalers and frontier labs have an advantage in hardware design as they are vertically integrated (i.e., they know their workloads intimately so they can co-design chips and models) may have been overstated, Keller said. Tenstorrent, like other companies, has optimizations for some popular non-linear functions in its hardware, but these can be tweaked in successive generations of silicon when needed.
The important things at the chip scale are building for large models, getting precision right, and properly dealing with both huge KV caches and compute-bound workloads like diffusion, Keller said.
“So far, everything works fine if you have a balance of DRAM, SRAM, computation, matrix-vector, and a NoC—Rent’s Rule seems to be solid,” he said.
Another old rule becoming applicable in new ways is Amdahl’s Law, which is generally applied to illustrate that the speedup of any workload is constrained by parts that cannot be accelerated.
“Agentic computing is an Amdahl’s Law problem,” Keller said. “AI took an outrageous amount of compute, so CPUs would send the AI task and wait around for it to finish… agentic has started driving CPU demand because AI finally got fast enough to be bottlenecked by the scalar part of the problem.”
Aiming for IPO
Keller declined to comment on reported takeover bids from companies including Intel and Qualcomm, confirming only that he has indeed met with the CEOs of both companies, as well as all the major hyperscalers, in order to pitch them Tenstorrent’s hardware IP.
“I’m hoping to get a big deal out of one of those guys, because our RISC-V CPU IP is great,” he said. “One of the hyperscalers is also looking at our AI IP for a small chip.”
While hyperscalers have developed their own big chips for AI, smaller AI chips like those used in edge devices cannot just use a cut-down version of the same IP, Keller said. Tenstorrent’s AI IP is designed to be scalable, and it has been fully productized (it comes with everything needed to scale from, say, one to 1,000 cores, Keller said). The two big exits for Tenstorrent’s startup competitors in the last six months have been an (effective) acquisition and an IPO. Tenstorrent is aiming to IPO, Keller confirmed, and is building out its supply chain and international presence with that in mind.
“Right now our investors are very hot on IPO,” he said.
Does Tenstorrent’s potential as a decode accelerator necessarily make it an attractive acquisition target for a GPU company? Keller said some kind of strategic deal or joint go-to-market is more likely.
Both sovereign infrastructure and the big frontier labs want to control their own destiny when it comes to hardware and software, he said. “Lots of things could happen,” he added.
Following TT-Deploy, Tenstorrent has received orders for its hardware, Keller said, with the biggest purchase order being for a 96-Galaxy cluster to be shipped outside the U.S. (96 Galaxies is 3,072 Blackhole chips). Tenstorrent’s biggest customer to date remains AI& in Japan, whose CEO is former Tenstorrent executive David Bennett.
“Some of what happened is a bunch of people had $100-million orders with Nvidia, but Nvidia won’t ship for a year, so they’ve taken a $20-million Tenstorrent machine because it’s a lot cheaper,” Keller said. Tenstorrent is in the process of building 1,000 Galaxy servers, at least half of which have already been sold, he said.
“Our stuff is working pretty good, we have ten customers with Galaxies on site, we’re past the proof-of-concept stage,” Keller said. “We’re starting to get follow-on orders… I want to get ten happy customers, and then 20, and then 30.”
Read also:
[Tenstorrent Previews Large Compute Cluster, Generates Video Faster Than Real Time](https://www.eetimes.com/tenstorrent-previews-large-compute-cluster-generates-video-faster-than-real-time/)
[Tenstorrent Unveils Next-Gen Servers for Fast Tokens, No Disaggregation Needed](https://www.eetimes.com/tenstorrent-unveils-next-gen-servers-for-fast-tokens-no-disaggregation-needed/)