Gemini Flash Lite transcription of the HBM explainer video

wpnews.pro

Here is the English transcription of the video @ https://x.com/RJDAIGOGO/status/2068160949606133955, which explains the technology behind High Bandwidth Memory (HBM).

00:00 This is the mainboard of a consumer gaming graphics card, the 5090. The most eye-catching part is the GPU chip in the middle.

00:06 Surrounding the GPU chip, we can see a ring of VRAM modules. The instructions and data needed for GPU operations, as well as temporary results, are stored here.

00:14 Now, this is a node from an NVIDIA GB300 AI data center. This node contains four GPUs, but if you look at their surroundings, you won't see any VRAM modules like the ones on the 5090. Does this mean AI server GPUs don't need VRAM to work?

00:27 Certainly not. Look closely. Beside these four GPUs, there are two small rectangular frames. These are actually the VRAM, but they aren't placed separately like on a consumer card; they are packaged together with the GPU. This special type of memory is the main character of today's video: HBM (High Bandwidth Memory).

00:43 It’s an advanced memory chip that relies on 3D stacking technology to achieve high bandwidth and high density. It is currently a core necessity for high-end AI graphics cards, and at the same time, it’s the culprit behind the recent rise in memory prices. So, what is the principle of HBM, and how is it made? Let’s find out in this video.

01:05 HBM: High Bandwidth Memory. Bandwidth essentially refers to the speed at which data is transmitted; here, "high bandwidth" just means the data transmission is very fast.

01:15 A memory module and a GPU are connected via wires, and data is transmitted using voltage signals. For example, if you want to transmit the data "01010," we can use high voltage to represent 1 and low voltage to represent 0. So, during transmission, the voltage on the wire goes: Low, High, Low, High, Low.

01:34 If we want to increase the transmission speed, we can shorten the duration of each voltage signal, changing "Low-High-Low-High-Low" into "Low-High-Low-High-Low" (faster), which means increasing the transmission frequency.

01:48 Besides increasing the frequency, we can also add more wires to transmit data in parallel, which also increases the speed. This method is called increasing the "bus width."

01:58 Bandwidth = Frequency × Bus Width. So, to increase bandwidth, you must either increase the frequency, increase the bus width, or increase both.

02:07 However, with VRAM, increasing the frequency causes many problems. The process of the voltage signal changing from low to high (or vice-versa) is not instantaneous; it takes time. We must wait for the voltage signal to be fully "up" or "down" before starting the next cycle. Therefore, there is a clear upper limit to frequency. Furthermore, increasing the frequency inevitably increases interference between each data wire.

02:30 Thus, HBM chooses to increase bandwidth through extremely high bus width.

02:34 A regular GDDR7 memory, like the ones soldered onto the 5090, has a bus width of 32 bits per module. A 5090 at full load can support 16 memory modules, so the total bus width can reach 512 bits.

02:46 For HBM3E memory, the single-module bus width can reach 1024 bits. This B300 GPU has eight HBM3E modules around it, so the total bus width reaches 8192 bits. Now, the question is: How does HBM achieve such a high bus width?

03:03 As we mentioned, memory modules and the GPU transmit data through wires; adding more wires increases the bus width. 1 bit of bus width corresponds to one data wire. A regular GDDR7 has a bus width of 32 bits, meaning there are 32 wires between that module and the GPU.

03:22 Of course, in reality, you need more wires—for power, address, clock signals, etc. So the actual number of wires is far more than 32. GDDR7 has 266 wires, and older GDDR6 had 180. That is just with a bus width of 32 bits.

03:39 HBM's bandwidth can reach 1024 bits, which means just for data transmission, you need 1024 wires. Plus the power, address, and other supporting wires, the total count for HBM3E is 3982 wires. So, the first challenge for HBM is how to get 4000 wires into such a small space.

04:02 Let's look at how regular GDDR VRAM is made. First, the core circuit is made on a thin silicon wafer. However, this wafer is very fragile, and the connection points are very small, which is unfavorable for subsequent processing and use.

04:16 So, a substrate is prepared. This substrate is made of glass fiber woven into a sheet and cured with epoxy resin, making it very sturdy.

04:24 The finished wafer is flipped and attached to the substrate ("flip-chip"). The connection points on the wafer surface directly touch the substrate. The substrate has multiple internal layers to route signals to its back, where solder balls are placed. This protects the wafer and makes it easier to solder.

04:47 This type of substrate is made by soaking glass fiber cloth in epoxy resin, heating it to form a hard board, and then attaching copper foil to both sides. To make the circuits, photosensitive film is applied, a mask is placed on top, and it is exposed to ultraviolet light. The areas exposed to light change, and a developer is used to wash away the unexposed parts, revealing the copper. Then, the board is soaked in an etchant; the exposed copper is dissolved, while the copper covered by the film remains. That’s how the circuit is made.

05:22 A typical packaging substrate has 4 to 8 layers, stacked like a mille-feuille, and laser-drilled holes connect the circuits of different layers.

05:30 However, the wire density is severely limited. The surface of glass fiber is uneven, making it difficult to attach the film perfectly. If lines are too fine, air bubbles can easily cause them to break. Also, chemical etching not only goes downwards but also sideways, which can hollow out the sides, leading to broken lines.

05:58 Because of these manufacturing limits, the wire density of the substrate cannot be very high. Also, the VRAM needs to be soldered onto the PCB, and its manufacturing process is similar, so the wire density there is also limited.

06:17 So, for HBM which requires 4000 wires, what should be done? Glass fiber substrates aren't good enough—the surface is too rough. We need a brand-new material: silicon. That's right, the silicon used for lithography.

06:31 We make a large piece of silicon. This silicon's surface is polished to be extremely smooth. Using photolithography and etching, we open precise grooves in the silicon surface, then use deposition to fill them with copper to form the circuits.

06:44 HBM chips no longer need a substrate; they can be soldered directly onto this silicon. The GPU chip is also soldered directly to it. This way, the HBM can be interconnected with the GPU through this large silicon piece. This piece, responsible for the "bridging" of the wires, is called the "Silicon Interposer."

07:01 The silicon interposer is essentially a chip without transistors. Its surface is smoother than a mirror, so chip manufacturers can use mature lithography, etching, and thin-film deposition processes to make extremely fine metal lines on it.

07:13 The width of these lines is no longer limited to the 10-something micrometers of a glass fiber substrate but can easily go under 1 micrometer, or even down to hundreds of nanometers. In the same area, a silicon interposer can hold dozens or even a hundred times more lines than a glass fiber substrate.

07:30 Thus, through the interconnection of the silicon interposer, HBM's wire density is greatly increased, solving the transmission speed problem.

07:38 Besides speed, AI chips have a tougher requirement for VRAM: Capacity. Looking at HBM and regular GDDR, their surface area is similar. But what is the capacity of a single GDDR chip? For the GDDR7 used on the 5090, the capacity of one chip is typically 2GB. Even with 16 of them, the total is only 32GB. That’s enough for gaming, but for AI, it’s too little. How do we increase the capacity? The answer is stacking.

08:11 By stacking multiple raw, unpackaged memory chips together, they can share the silicon interposer underneath. This increases the total capacity without needing to expand the footprint on the board. However, stacking this many chips is no easy task.

08:25 Stacking a dozen memory chips is easy to say, but how does the top chip communicate with the bottom one? Pulling wires from the side and connecting to the bottom? Impossible. Every layer would need thousands of wires; the silicon interposer wouldn't have enough space. Also, wires would have to be incredibly fine; even gold couldn't handle such thin, unsupported wires.

08:49 The solution engineers came up with is drilling holes in the silicon and inserting copper pillars. This way, each layer connects to the copper pillars, which in turn connect to the silicon interposer below. This technology is called TSV (Through-Silicon Via).

09:05 Of course, achieving TSV is not easy. First, we need to drill through the silicon. The hole diameter is about 5 to 10 micrometers. Using a drill bit is out of the question. Industry uses a technology called Deep Reactive Ion Etching (DRIE), also known as the "Bosch process."

09:24 After creating the holes, we use Chemical Vapor Deposition (CVD) to deposit an insulating layer of silicon dioxide on the hole walls. This layer is very dense, like a ceramic pipe for the copper pillar, completely isolating the copper from the silicon.

09:31 But that’s not enough. Copper atoms are "restless"—over time, they diffuse into the silicon, destroying its crystal structure. So, we need to add a barrier layer between the insulating layer and the copper, typically Tantalum or Tantalum Nitride. This barrier layer is only a few nanometers thick but effectively stops copper diffusion.

09:51 Finally, to ensure the copper fills the hole properly, we need a seed layer. Then, the silicon is placed in a copper sulfate solution for electroplating. Copper ions deposit on the bottom and walls of the hole, eventually filling it completely.

12:19 Now we have a single-layer chip. Next, we expose the copper, use lithography and electroplating to grow tiny copper bumps, and cover them with a thin layer of tin. These are the "micro-bumps."

12:33 We can then stack multiple chips and apply heat and pressure. The tin on the micro-bumps melts and solders them together. This way, every layer is connected by copper pillars and micro-bumps. Data from the top layers can travel all the way down to the silicon interposer.

12:49 Now, the core storage function of the HBM is formed, but the chip is still very fragile and cannot be used yet.

12:58 We have thinned each chip to a thickness of only dozens of micrometers. Supported only by the micro-bumps, they are like dried seaweed—the slightest external force will shatter them. The solution is to fill the gaps between these layers to prevent them from being "unsupported." There are two main approaches in the industry for this.

13:21 Currently, only three companies can mass-produce HBM: SK Hynix, Samsung, and Micron.

13:28 SK Hynix's approach is to stack all chips at once, heat them to solder all micro-bumps simultaneously, and then put them into a mold to inject liquid epoxy resin, allowing it to flow into the gaps between the chips. Then, heat and pressure are applied to cure the resin.

13:47 Samsung and Micron place a high-polymer film between two layers. Heat and pressure melt the film, the micro-bumps solder together, and the film cures upon cooling. Then, they add the next layer.

13:59 SK Hynix’s approach is slightly better; it’s much faster, and because there's nothing else "causing trouble" between chips during soldering, the yield is much higher.

14:12 HBM consists of multiple chips stacked together, so heat dissipation is a big issue. To improve this, manufacturers add extra "dummy" holes. These extra holes aren't for data, but specifically for heat dissipation. Additionally, SK Hynix mixes particles with high thermal conductivity into the epoxy resin, turning it into something like thermal paste for better cooling.

14:36 SK Hynix's HBM heat dissipation efficiency is about twice that of Samsung's, which is why SK Hynix became the "big player" in the HBM market, with a market share of around 50%.

14:45 Through the silicon interposer, HBM can achieve a bus width of 1024 bits, and through the stacking process, HBM can achieve super-large capacity.

14:54 In this GB300 server node, each HBM chip stacks 12 storage dies, reaching 36GB. A single GPU with 8 such chips reaches 288GB, and with four GPUs, the total capacity is 1152GB—over 1TB. For a server node like this, how much is it worth? Answer: 2 million RMB.

15:16 Currently, the HBM used on a large scale is HBM3E, and companies have already started producing HBM4. HBM4's memory bandwidth will reach 2048 bits, and the stack count will reach 16 layers. This means HBM4 will have even denser wiring and a higher stack.

15:35 HBM4 still uses the micro-bump solder technology, but this might be the last generation. Denser wiring will make the spacing between micro-bumps smaller and smaller, and there’s a high risk of bridging during soldering. So the future development direction of HBM is to eliminate micro-bumps and directly push the polished copper together, letting the copper surfaces diffuse and fuse into a whole. This way, there's no need to fill epoxy resin, allowing for even denser wiring, lower height, and better heat dissipation.

16:05 It is not clear when this technology will be used, though it was planned for HBM4 a few years ago. But it hasn't been used yet. Currently, all the top companies are actively laying out their plans, so let’s wait and see.

16:18 Alright, that’s all for today’s video. HBM has effectively facilitated the development of AI, providing key computing infrastructure. But its production process and equipment heavily overlap with traditional DRAM, which has taken up a lot of production capacity, leading to a shortage and a significant price increase for ordinary DRAM. So, are you optimistic about the future of HBM? Or, are you optimistic about the future of AI? See you next time.

source & further reading

gist.github.com — original article LFM2.5 8B A1B synthetic data. Qwen3.6 35B A3B query model, LFM2.5 response model. Formatted in LFM2.5 chat template. Not checked for safety or alignment. Polytoken Reviewer Skill Telemetry Report -- Claude Code v2.1.143

Gemini Flash Lite transcription of the HBM explainer video

Run your AI side-project on zahid.host