{"slug": "sophon-pfg-1-a-monolithic-3d-ai-asic-with-330-gb-of-on-die-dram-and-no-hbm", "title": "Sophon PFG-1: a monolithic-3D AI ASIC with 330 GB of on-die DRAM and no HBM", "summary": "Sophon PFG-1, a monolithic-3D AI ASIC with 330 GB of on-die DRAM and no HBM, delivers 2,100 TFLOPS BF16 and 4,200 TFLOPS FP8 on a 750 mm² die, achieving up to 174× the tokens-per-watt of NVIDIA Rubin or AMD Instinct MI455X for 80B-parameter models. The chip eliminates off-die memory bottlenecks, offering 191–214× the weight bandwidth of HBM4 packages and reducing hardware BOM by ~10× compared to HBM-based systems.", "body_md": "** Revision 4.1 · June 2026**\n\n**PFG-1 \"Sophon\"** is a unified training-and-inference die on a 750 mm², 32-tier 2D\nTransition-Metal Dichalcogenide (TMD) Monolithic 3D (M3D) platform. Weights, gradients, and optimizer state\nreside in on-die 2T0C 2D-TMD gain-cell DRAM; because the array is fully read-write, the same silicon executes\nBF16 forward/backward training passes and serves low-batch decode at the compute-bound rate.\n\nCompute is **pure digital Compute-In-Memory (CIM)**: each 256×256 DRAM subarray tile pairs a\nbinary sense amplifier with an 8-level adder tree, driven by a 500 MHz bit-serial activation broadcast. At\n131,072 tiles/die this yields **4,200 TFLOPS FP8** and **2,100 TFLOPS BF16** in a\n7.5 cm² footprint.\n\nThe die is built on a 28 nm Si Complementary Metal-Oxide-Semiconductor (CMOS) base tier, a 32-tier 2D-TMD CMOS\nMAC stack, and a Monolithic Inter-tier Via (MIV) fabric [[5]](#ref-5)[[6]](#ref-6)[[7]](#ref-7), with the 2T0C DRAM module\nembedded at the Back-End-Of-Line (BEOL) Metal-3 layer of each memory tier. The die stack cross-section is\nshown in **Figure 1**.\n\nPFG-1 \"Sophon\" |\n|\n|---|---|\nMemory |\n2T0C 2D-TMD gain-cell DRAM |\nCompute paradigm |\nPure digital CIM (sense amp + adder tree) |\nTarget workload |\nTraining (fwd + bwd + optimizer) and inference (decode + prefill) |\nCapacity |\n330 GB |\nCompute |\n2,100 TFLOPS BF16 (4,200 TFLOPS FP8 inference mode / 8,400 TOPS INT8) |\nEnergy / MAC |\n0.620 pJ (BF16 fwd) / 0.940 pJ (fwd + bwd) /\n0.310 pJ (FP8 inference)\n|\nPeak efficiency |\n3.72 TFLOPS/W (BF16 training avg.) |\nTokens per watt |\n38.7 tokens/s per W (80B FP8 decode, 373 W) — ~ 174× an NVIDIA Rubin (R200) or AMD\nInstinct MI455X at low batch (~ 0.22 tokens/s per W, HBM4-bound)\n|\nActive power |\n≈ 379 W fwd / ≈ 749 W bwd (~ 564 W training avg.); 373 W FP8 decode |\n80B model perf. |\n2,406 tokens/s training, 0.23 J/tok; 7,219 tokens/s BF16 decode (14,438 tokens/s FP8 mode), 25.8 mJ/tok |\n80B + INT4 + speculative (FP8 mode) |\n72,188 tokens/s effective |\nBOM |\n$8,358 |\n\nSophon eliminates off-die High-Bandwidth Memory (HBM) entirely. For 80B-parameter BF16 training it fits\nweights + first-order optimizer state fully on-die with ~ 10 GB of activation headroom for\ngradient-checkpointed micro-batches; for inference it serves an 80B model at\n**7,219 tokens/s** in native BF16 or the full **14,438 tokens/s** in FP8 mode —\nmaking it a single train-then-serve part that can be elastically repartitioned between training and serving\nwithout changing hardware. Against an NVIDIA Rubin (R200) and an AMD Instinct MI455X — both 2026 HBM4 parts — Sophon delivers\n**~ 2.7–3.1× higher** 80B batch-1 training throughput per die and **~ 48–53×** higher\nsingle-stream FP8 decode throughput, because both GPUs at low batch are HBM-bandwidth-bound at their HBM4\nlimits (Rubin 22 TB/s, MI455X 19.6 TB/s). Peak dense FLOPS favor the GPUs — Sophon BF16 dense is only ~ 0.21–0.24×\ntheir peak — but peak FLOPS do not help at low batch, where weight-memory bandwidth governs.\n\nThe architecture delivers **~ 191–214×** the weight bandwidth of an HBM4 package (191× vs Rubin,\n214× vs MI455X) — a gap no HBM roadmap closes (Section 7).\n\nThe economics follow directly: Morgan Stanley puts a single NVIDIA VR200 (Rubin) NVL72 rack at\n**≈ $7.8M** — HBM memory alone ≈ $2.0M (25.7% of the rack, +435% over GB300). Sophon\neliminates that line item, for a **~ 9.9× / 11.6× lower hardware BOM** than a Rubin / MI455X\n[[17]](#ref-17).\n\nModern AI accelerators face a memory wall on both workloads they must serve:\n\n**Inference** is *read-dominated*. The model weights are fixed at deployment; every decode\nstep reads the full weight tensor once per generated token. The key metrics are read energy per bit, idle\nleakage (the model must stay resident between requests), and weight-fetch bandwidth at low batch. Conventional\nHigh-Bandwidth Memory (HBM) is bandwidth-bound at low batch: every token's MAC traffic serializes through the\n~ 22 TB/s (Rubin) / 19.6 TB/s (MI455X) HBM4 path, and a 288–432 GB HBM4 subsystem draws ~ 10–15 W in self-refresh just to keep the model\nresident.\n\n**Training** is *read-write symmetric*. Every forward pass reads weights; every backward\npass writes gradient updates; the optimizer updates weights in place each step. In-place writability, low\nwrite energy, and capacity for both weights *and* optimizer state are critical. A non-volatile\ninference-only memory cannot train — for example, Single-Level Cell (SLC) Resistive RAM endurance caps at ~10⁶\ncycles, while training an 80B model requires ~10¹⁰ write cycles per parameter.\n\nA **2T0C 2D-TMD gain-cell DRAM** solves both problems with one cell. It exploits the anomalously\nlow off-current density (Joff ≈ 10⁻¹⁵ A/µm = 1 fA/µm at 28 nm, i.e. ≈ 0.5 fA per cell) of TMD\ntransistors to obtain **multi-second** retention without an explicit storage capacitor, enabling\nin-place gradient writes at 20 fJ/bit with **unlimited** write endurance and a refresh overhead\nof only ≈ 0.08 W. Because the storage node is writable on every cycle, the same die that serves inference can\nalso train; because retention is seconds-long, idle power collapses to ~ 3 W — an inference-grade idle profile\non a fully writable training die.\n\nPhantaField's 2D-TMD M3D platform integrates this DRAM module at the BEOL Metal-3 layer of each memory tier, directly above the logic tier whose MAC array consumes its weights.\n\nSophon uses the following physical stack:\n\n| Tier(s) | Function | Process |\n|---|---|---|\nBase (Si) |\nController, NoC root, host I/O, PCIe/NVLink PHY | 28 nm bulk Si CMOS |\nTiers 1 – 32 |\nInterleaved 2D-TMD stack: 32 logic tiers (MAC array, 750 mm² each) alternating with\n32 memory tiers (2T0C DRAM bank, 750 mm² each), forming 32 logic-plus-memory doublets\n|\nBEOL 2D-TMD (MoS₂ n-FET / WSe₂ p-FET) on odd tiers + DRAM module on even tiers |\nLid |\nCu / CVD-diamond heat spreader | optional; enables two-side cooling |\n\nTotal stack height: **~22 µm** above the Si die (64 tiers × 0.35 µm/tier). The 90 nm-pitch MIV\ngrid provides 1.23 × 10⁸ slots/mm² available inter-tier connections; the design populates only ~5.5 ×\n10⁵/mm², leaving > 99% MIV headroom.\n\nTiers are not split within a single layer; instead the 64-tier stack\n**interleaves dedicated logic and memory tiers** in an A/B/A/B… repeating pattern. Two adjacent\ntiers form one logic-plus-memory **doublet**; the stack contains 32 such doublets:\n\n**Why 2D TMD?** TMD CMOS (MoS₂ / WSe₂) is the only transistor technology that simultaneously\noffers: (1) BEOL-compatible growth at ≤ 450 °C [[6]](#ref-6); (2) atomic-scale\nchannel thickness eliminating short-channel leakage [[1]](#ref-1)[[2]](#ref-2); (3) electron mobility ≥ 120 cm²/V·s\n[[4]](#ref-4); and (4) intrinsic radiation hardness (no buried-oxide trap volume).\nCritically, the TMD off-current density Joff ≈ 10⁻¹⁵ A/µm (1 fA/µm) at 28 nm — i.e. ≈ 0.5 fA for\na 0.5 µm-wide cell transistor, roughly 4 orders of magnitude lower than Si NMOS at equivalent gate length\n[[2]](#ref-2)[[3]](#ref-3) — is what enables a 2T0C cell to\nretain data for **seconds** without any storage capacitor [[8]](#ref-8)[[9]](#ref-9), keeping the cell area at 8 F² rather than the ~20 F² needed for a\nconventional 1T1C DRAM.\n\nSophon places a **2T0C 2D-TMD gain-cell DRAM** (8 F², 1 bit/cell) at the Metal-3 BEOL of each\nmemory tier. The cell structure is shown in **Figure 2** and consists of:\n\nThe TMD off-current density of 1 fA/µm (Ioff ≈ 0.5 fA for a 0.5 µm cell transistor) gives\nretention τ = C·Vdd / (2·Ioff) = **1.8 s** at 25 °C\n[[8]](#ref-8)[[9]](#ref-9) — see **Eq. 3** and\n**Figure 3** for the retention curve. Sophon refreshes every **1.0 s** (1.8×\nmargin), consuming only ≈ **0.08 W** for the full 330 GB die (**Eq. 4**).\nRetention derates ≈ 2× per 10 °C; above 60 °C junction temperature, on-die thermal sensors shorten the\nrefresh interval (≈ 159 ms at 60 °C, ≈ 28 ms at 85 °C), with refresh power staying below ~ 4 W even in the\nhot corner.\n\nBecause the storage node is writable on every cycle, Sophon supports in-place BF16 gradient accumulation with unlimited endurance — exactly what training requires — while the same array, read-only, serves the inference decode loop. The die loads a model once and either serves it (inference) or updates it in place (training); a powered-off die reloads its weights from off-die Non-Volatile Memory express (NVMe) at boot (§11.2).\n\nThe 131,072 CIM tiles are not a flat array — they are partitioned across the 32 logic tiers of the stack\n(§2.A), exactly **4,096 tiles per logic tier** (derived: 131,072 ÷ 32). Each tile occupies a\nfixed cell on its tier and is the atomic unit of compute, storage, and redundancy: a 256×256 weight subarray\n(65,536 weights) feeding a binary sense amp and an 8-level adder tree, with bit-serial activation broadcast\nat 500 MHz (16 cycles BF16, 8 cycles FP8). The weights for every tile live in the 2T0C cells of the memory\ntier directly above it (§2.B), so a tile is physically a vertical logic-plus-memory column, not a planar\nblock. A tier is therefore a 4,096-tile mesh of these columns; the full die is 32 such meshes stacked at\n0.35 µm pitch, with the 28 nm Si base below carrying everything that is not compute.\n\n**The NoC is a per-tier 2D mesh, not a global fabric.** Each logic tier runs its own mesh\nrouter fabric at **≈ 290 TB/s** bisection, and the 64 tiers together present\n**18,560 TB/s** aggregate (derived: 290 × 64). What rides the NoC is deliberately minimal:\n**activations and partial sums** — the operands that must move between tiles to assemble a\nlayer's output across the 4,096-tile fan-in. **Weights never touch the NoC.** Every weight is\nread through its tile's private vertical MIV port — a single tier-pitch hop straight down from the cell to\nits MAC — delivering 4.2 PB/s of in-tile weight bandwidth with zero shared-bus contention (§2.A). This is\nthe load-bearing asymmetry of the floorplan: the multi-petabyte traffic (weight fetch) is kept entirely\nvertical and local, so the lateral NoC only ever carries the comparatively small activation/partial-sum\nflux. The base-tier **NoC root** stitches the per-tier meshes together and bridges them to the\ncontroller and host I/O, but it is never in the weight path.\n\nEach tile additionally owns a small **SRAM scratchpad** for activations. Because the NoC\ncarries activations and partials rather than weights, the scratchpad is where a tile stages its inbound\nactivation vector, accumulates its slice of the partial sum across the bit-serial broadcast, and buffers the\noutbound result before it is handed to the mesh. Holding the live activation working set in fast local SRAM\n— adjacent to the adder tree, not in the 2T0C DRAM — keeps the broadcast/accumulate inner loop entirely\non-tile and lets the 1 Hz-refresh gain-cell DRAM (§2.B) stay dedicated to weights and KV cache, whose access\npattern is read-mostly and latency-tolerant by comparison.\n\n**Clock and power are delivered down the 22 µm stack to a low-voltage rail.** The logic tiers\nare clocked at **1.2 GHz** from a base-tier clock root distributed upward through the MIV grid;\nthe bit-serial activation broadcast runs on a separate 500 MHz domain. Operating at\n**V dd = 0.6 V** is what makes a 64-tier monolithic stack thermally viable — dynamic\npower scales with V\n\nThe **28 nm Si base tier** is the system's front door. It carries the controller, the NoC root,\nhost I/O, and the PCIe/NVLink-class PHY — all in mature bulk-Si CMOS, where high-speed analog SerDes and\nlarge I/O drivers belong, rather than in the BEOL 2D-TMD tiers above. This separation is what lets the same\ndie both serve and train without hardware change: the host loads a model **once** through the\nbase-tier PHY into the on-die 2T0C DRAM, after which the controller either drives the inference decode loop\n(weights read-only) or runs in-place gradient writes for training (§2.B) — and a fleet repartitions between\nthe two by command, not by re-spinning silicon. An 80B model — weights, optimizer state, activations, and KV\ncache — resides entirely on the single die, with every MoE expert resident on-die and only the routed\nexperts drawing power.\n\n| Resource | Per logic tier | Per die (×32 tiers) |\n|---|---|---|\n| CIM tiles | 4,096 (derived) | 131,072 |\n| Weight subarray / tile | 256×256 = 65,536 weights; binary sense amp + 8-level adder tree | |\n| Die footprint | single 750 mm² die — 64 tiers stacked at 0.35 µm (~22 µm tall) | |\n| Logic (MAC) silicon | 750 mm² / tier | 24,000 mm² cumulative (32 × 750, §2.A) |\n| On-die 2T0C DRAM | 750 mm² / tier | 330 GB total (weights + optimizer + KV cache) |\n| NoC mesh bisection | ≈ 290 TB/s | 18,560 TB/s aggregate over 64 tiers |\n| In-tile weight BW (vertical MIV) | 4.2 PB/s — never crosses the NoC | |\n| Activation store | Per-tile SRAM scratchpad (NoC carries activations + partial sums) | |\n| Clock / rail | 1.2 GHz logic, 500 MHz broadcast; Vdd = 0.6 V |\n|\n| Base tier | 28 nm Si — controller, NoC root, host I/O, PCIe/NVLink-class PHY |\n\nAll formulas are derived in the **Equations Appendix (§13)**. Numeric values reference the\nequation number in that appendix.\n\nThe 64-tier stack **interleaves dedicated logic and memory tiers** in an A/B/A/B… repeating\npattern: 32 logic tiers (odd-indexed) and 32 memory tiers (even-indexed), forming 32 logic-plus-memory\n**doublets**. Each individual tier uses its full 750 mm² footprint for its single role: a logic\ntier holds the 2D-TMD MAC array (750 mm² MAC); a memory tier holds the co-located 2T0C DRAM bank (750 mm²\nmemory). All capacity and throughput numbers below are reported on a **per-doublet** basis (one\nlogic tier + one memory tier) so they remain directly comparable to the legacy per-tier presentation.\n\nThe 2T0C gain cell consists of two 2D-TMD transistors and zero explicit storage capacitors\n[[8]](#ref-8)[[9]](#ref-9)[[10]](#ref-10). It exploits the anomalously low off-current of TMD field-effect\ntransistors — a width-normalized density of **J off = 10⁻¹⁵ A/µm (1 fA/µm)** at 28 nm\n\n**Cell structure:**\n\n**Retention physics** (**Eq. 3**, derived from\n[[8]](#ref-8)): τ = Cnode · Vdd / (2 · Ioff). At\nCnode = 3.0 fF, Vdd = 0.6 V, and Ioff = Joff · WRT =\n1 fA/µm × 0.5 µm = 0.5 fA at 25 °C, τ = **1.8 s**. Sophon refreshes every\n**1.0 s** (1.8× margin). Retention derates ≈ 2× per 10 °C; above 60 °C junction temperature,\non-die thermal sensors shorten the refresh interval (≈ 159 ms at 60 °C, ≈ 28 ms at 85 °C).\n\n| Parameter | Value | Notes |\n|---|---|---|\n| Cell footprint | 8 F² |\n2T0C (WT + RT), no capacitor\n|\n\n**Why a capacitor-less gain cell?** A conventional 1T1C DRAM needs a ~ 20 F² trench/MIM\ncapacitor that is incompatible with low-temperature BEOL M3D integration. The 2T0C cell stores charge on the\nRead Transistor's own gate parasitic, so it is built entirely with the same TMD transistors used in the MAC\narray — no separate capacitor module, no third-party Intellectual Property (IP) license — and the\nmulti-second retention enabled by the 1 fA/µm off-current makes refresh power negligible (≈ 0.08 W,\n**Eq. 4**).\n\nThe stack contains **32 doublets** (one logic tier + one memory tier per doublet). Each doublet\ncontributes one logic-tier's MAC area and one memory-tier's storage area; the total active MAC area and\nmemory area are therefore identical to a hypothetical 64-tier in-tier-split presentation, but routing is\ndenser because each logic tier no longer competes for footprint with its memory bank.\n\n| Item | PFG-1 Sophon (2T0C DRAM) |\n|---|---|\n| Memory area per memory tier | 750 mm² |\n| Logic area per logic tier | 750 mm² |\n| Memory tiers / logic tiers | 32 / 32 |\n| Capacity per doublet | 10.31 GB |\nTotal capacity (32 doublets) |\n330 GB |\n| FP8 throughput per logic tier | 131.25 TFLOPS |\n| BF16 throughput per logic tier | 65.6 TFLOPS |\nFP8 throughput (32 logic tiers) |\n4,200 TFLOPS |\nBF16 throughput (32 logic tiers) |\n2,100 TFLOPS |\nINT8 throughput (32 logic tiers) |\n8,400 TOPS |\n\nSophon holds 330 GB. For **training**, an 80B-parameter BF16 model (160 GB) plus first-order\noptimizer state (160 GB for SGD-momentum or Lion) = **320 GB**, leaving\n**10 GB** for gradient-checkpointed activations (Section 5.B.2). For\n**inference**, an 80B BF16 model (160 GB) leaves 170 GB free, or an 80B FP8 model (80 GB)\nleaves 250 GB free for an extended Key-Value (KV) cache or a co-resident draft model (Section 5.A).\n\nSections A.1 and §2.B describe the *structure* of the 2T0C cell; this subsection describes how it is\n*operated* cycle-by-cycle. The two-transistor topology decouples the write path from the read path\nentirely — the Write Transistor (WT) owns the storage node, the Read Transistor (RT) only senses it — which\nis precisely what enables the same array to stream weights to the MAC on every cycle while remaining\nin-place writable for gradient accumulation (§3.C).\n\n**Write.** A write asserts the Write Word-Line (WWL), turning the WT on and connecting the\nstorage node (RT gate parasitic ~2.5 fF + WT drain junction ~0.5 fF ≈ 3.0 fF) to the Write Bit-Line. The WT\nchannel then charges the node to Vdd = 0.6 V for a \"1\" or discharges it to GND for a \"0\"; WWL is\nde-asserted and the TMD off-current (≈ 0.5 fA per 0.5 µm cell) traps that charge for the full retention\nwindow. The transferred charge is Cnode · Vdd ≈ 3.0 fF × 0.6 V, and the measured write\nenergy is **20 fJ/bit** — a single channel charge-transfer event, with no high-voltage charge\npump and no oxide stress. Because both the value being written and the in-place gradient update (§3.C) take\nthis identical path, training and inference share one write primitive.\n\n**Read — the gain-cell mechanism.** The defining property of the cell is that\n**RT's gate is the storage node**, so the stored level directly modulates RT's drain\nconduction. To read, the Read Bit-Line (RBL) is precharged and RT's drain is enabled: a stored V\n\n**Sense margin & why sensing is digital.** The read window is set by RT's on/off\ndrain-current ratio. The same 1 fA/µm TMD off-current that gives multi-second retention also collapses the\n\"0\" leg of the read to the sub-femto-amp floor, while the \"1\" leg conducts at the full TMD on-current — an\non/off ratio of many decades. That enormous, deterministic separation means the sense amp only ever has to\ndecide \"conducting vs. not,\" so a single current-comparator threshold suffices:\n**no ADC, no DAC, no reference ladder**. This is what keeps the read path pure-digital and\ndeterministic end-to-end — there is no analog accumulation to quantize, consistent with the ADC-free CIM\ntile architecture (§3.D).\n\n**Disturb, retention & endurance during operation.** Because a read is gate-voltage sensing\nthrough RT and never discharges the node, **read-disturb is negligible** — a cell can be read\narbitrarily many times between refreshes with no charge loss, so the refresh cadence is governed solely by\nleakage, not by access traffic. Retention τ = Cnode · Vdd / (2 · Ioff) =\n**1.8 s** at 25 °C fixes the **1 Hz refresh** (1.8× margin, ≈ 0.08 W for 330 GB;\nsee A.1). Writes are likewise benign: the bit is set by gate-controlled charge transfer through the WT, with\n**no oxide tunneling and no filament formation**, so there is no wear-out mechanism and\nendurance is effectively **unlimited** — the enabling condition for streaming in-place gradient\nwrites throughout a full training run (§3.C).\n\n| Property | 2T0C TMD gain cell (Sophon) | Conventional 1T1C DRAM |\n|---|---|---|\n| Read type | Non-destructive (RT gate-voltage sense) |\nDestructive (capacitor charge-share onto BL) |\n| Write-back after read | None — read back-to-back every cycle |\nRequired every access (restore) |\n| Storage element | RT gate parasitic + WT drain junction (≈ 3.0 fF, \"0C\") | Explicit MIM / trench capacitor |\n| Sensing | Binary current comparator — no ADC/DAC | Differential charge-sensing amp + reference |\n| Cell area | 8 F² |\n≈ 20 F² (capacitor-dominated) |\n| Write endurance | Unlimited (gate-controlled charge, no oxide wear) |\nUnlimited, but every read costs a restore write |\n\nBecause weights live in memory co-located with their consuming MAC, there is\n**no global weight-bandwidth pipe**. Sophon employs\n**fully digital Compute-In-Memory (CIM)** — a sense-amplifier and binary adder tree per\ncolumn-group. Bandwidth decomposes into orthogonal contributions.\n\nEach BF16 MAC reads 16 bits from the DRAM bank directly above its tile at 30 fJ/bit with 3 ns latency. The bit-serial multiply runs at the 500 MHz wordline rate over 16 cycles for BF16 (8 cycles in FP8 inference mode); the per-column sense amplifier produces a 1-bit partial product per cycle that feeds an 8-level binary adder tree. A 4-stage pipeline hides DRAM latency.\n\n| Quantity | BF16 (native) | FP8 (inference mode) |\n|---|---|---|\n| MAC throughput | 2,100 TFLOPS | 4,200 TFLOPS |\n| Weight bits per MAC | 16 bits (BF16) | 8 bits (FP8) |\nAggregate weight BW |\n4.20 PB/s |\n4.20 PB/s |\n| Per-tile read width | 275 bits/cycle | 550 bits/cycle |\n| Memory read latency | 3 ns (4 cycles) | 3 ns (4 cycles) |\n\nSophon delivers **4.20 PB/s** of aggregate weight bandwidth in either datatype — the byte-rate\nof weight consumption is the same: 2 bytes/BF16-MAC at 2,100 TFLOPS, or 1 byte/FP8-MAC at 4,200 TFLOPS, both\nproducing 4.20 PB/s. This bandwidth is **in-tile and never crosses the Network-on-Chip (NoC)**.\n\nWhy is weight bandwidth independent of datatype and of capacity?In a Compute-In-Memory architecture, weight bandwidth is set by theMAC array's weight-consumption rate, which is intrinsic to thelogic tiers, while capacity is set by thememory-tier areal density(110.0 Mb/mm² for 2T0C DRAM, §3.A). Because every weight is physically co-located with the MAC that consumes it, there is no shared bus whose width would scale with total stored bytes or with bit-depth: a higher-bit datatype simply reads more bits per MAC at a proportionally lower MAC rate. The bandwidth equality is therefore a direct consequence of`BW = (bytes per MAC) × (MAC rate)`\n\nbeing identical for both modes (1 B × 4,200 TFLOPS = 2 B × 2,100 TFLOPS = 4.20 PB/s).\n\nDuring the backward pass, accumulated gradients are written back to the DRAM bank at 20 fJ/bit:\n\n| Quantity | Value |\n|---|---|\n| Gradient write bandwidth | 4.20 PB/s (mirrors weight read BW) |\n| Write energy per BF16 gradient | 20 fJ × 16 bits = 320 fJ = 0.32 pJ |\n| Backward-pass write power (55% util.) | 370 W |\n| Backward-pass write power (100% util.) | 672 W |\n\nInference uses the read path only and incurs none of this write power.\n\nActivations occupy a small per-tile SRAM scratchpad (SPM) (5% of tier area, ~37.5 mm²/tier, ~0.7 GB/tier):\n\nA 2-D mesh NoC routes activations and control. Each tier has its own mesh; vertical MIVs carry inter-layer activations.\n\n| Path | Bandwidth |\n|---|---|\n| Per-tier NoC bisection | 290 TB/s |\n| Aggregate NoC (64 tiers) | 18,560 TB/s |\n| MIV vertical fabric (weight delivery) | 4,200 TB/s sustained |\n\n| Path | Sophon | Notes |\n|---|---|---|\n| Weight (memory → MAC) | 4.20 PB/s |\nIn-tile |\n| Gradient (MAC → memory) | 4.20 PB/s |\nIn-tile, bwd pass only |\n| Activation (NoC) | 18,560 TB/s | Inter-tile |\n| Inter-tier (MIV) | 4,200 TB/s | Vertical (= in-tile weight BW) |\n| HBM3e reference (8-stack) | 8.0 TB/s | Off-package (NVIDIA Rubin R200) |\n| HBM4 reference (NVIDIA Rubin R200, 8-stack) | 22 TB/s | Off-package |\n| HBM4 reference (AMD Instinct MI455X, 8-stack) | 19.6 TB/s | Off-package |\n\nSophon provides **~ 191× more weight bandwidth** than NVIDIA Rubin (R200) and\n**~ 214× more** than AMD Instinct MI455X (4,200 TB/s vs 22 TB/s for an 8-stack HBM4 package on\nRubin, and 19.6 TB/s for an 8-stack HBM4 package on MI455X\n[[16]](#ref-16)[[18]](#ref-18)) — because that bandwidth is\nintrinsic to the storage location, not a separate interconnect. **Figure 4** plots the\ncomparison.\n\n*Convention note: throughout this paper, \"2,100 TFLOPS BF16\" and \"4,200 TFLOPS FP8\" count each\nmultiply-accumulate (MAC) as 2 floating-point operations (one mul + one add)\n[16]. Energies tabulated below are stated per MAC (per\nweight processed), so per-FLOP figures are half the listed values. The chip-power calculations in §C.3 use\nthe per-FLOP convention to align with the TFLOPS rates.*\n\n*Architecture note: Sophon uses pure digital Compute-In-Memory (CIM). Each tile contains\na per-column sense amplifier feeding an 8-level binary adder tree that produces the partial sum for one\nrow of a 256×256 weight subarray. All multiply-accumulate arithmetic is performed in the binary domain\nwith full deterministic 16-bit (BF16) or 8-bit (FP8) precision — see §3.D for the digital-CIM tile\nwalkthrough and §3.D.2 for why this choice constrains throughput as 1/N in the dense-decode regime.*\n\n| Component | Energy / MAC | Energy / FLOP | Notes |\n|---|---|---|---|\n| 2T0C DRAM read (16 bits) | 0.480 pJ |\n0.240 pJ | 30 fJ/bit × 16 — BL precharge + binary current sense\n|\n\n| Component | Energy / MAC | Energy / FLOP | Notes |\n|---|---|---|---|\n| 2T0C DRAM read (8 bits) | 0.240 pJ |\n0.120 pJ | 30 fJ/bit × 8 — half the BF16 read\n|\n\nThe adder-tree compute term is ~ 0.07 pJ/MAC at FP8 — binary additions in modern low-Vdd TMD\nCMOS dissipate roughly 8 fJ per 1-bit add, and an 8-level tree for a 256-input column requires 256 adds\namortized across 256 cells (~ 8 fJ/cell × 8 levels = 64 fJ ≈ 0.064 pJ). The pure-digital adder tree avoids\nthe per-sample conversion costs that dominate older mixed-signal CIM designs.\n\n| Source | Sophon |\n|---|---|\n| Memory static leakage | 0 W (DRAM has no DC leakage path) |\n| Memory refresh power | ≈ 0.08 W (330 GB × 1 Hz × 30 fJ/bit × 8 bits/byte) |\n| TMD logic leakage | 0 W |\n| SRAM scratchpad leakage | 1.67 W |\nTotal static/idle (model loaded) |\n~ 2 W |\n\nSophon's near-zero idle is an operational advantage: an 80B model loaded into Sophon waits for requests at\n**~ 2–3 W**. An equivalent HBM4-based GPU (e.g. NVIDIA Rubin (R200) with 288 GB, or AMD Instinct MI455X with 432 GB) holds its\nHBM4 memory subsystem in self-refresh at ~ 10–15 W. With the 2D-TMD off-current at 1 fA/µm (Ioff ≈ 0.5 fA per cell),\nthe 2T0C retention time rises to 1.8 s and the array needs only a\n**1 Hz refresh, costing ≈ 0.08 W**. A nominal **1 W** allowance is carried below\nto cover warm steady-state operation; refresh is no longer a meaningful component of the power budget.\n\n| Phase | DRAM read | Digital MAC array | NoC + SPM | Static | Chip total |\n|---|---|---|---|---|---|\n| Idle (model loaded) | 0 W | 0 W | 0 W | 2 W | ~ 2 W |\n| FP8 decode (55% util.) | 277 W | 81 W | 13 W | 2 W | ≈ 373 W |\n| BF16 decode (55% util.) | 277 W | 81 W | 19 W | 2 W | ≈ 379 W |\n| FP8 prefill (75% util.) | 378 W | 110 W | 18 W | 2 W | ≈ 508 W |\n| Peak FP8 burst (100% util.) | 504 W | 147 W | 28 W | 2 W | ≈ 681 W |\n\nFP8 decode reads 8-bit weights but runs at twice the BF16 MAC rate (4,200 vs 2,100 TFLOPS), so its read power equals BF16's 277 W (half the bits × double the rate); both are compute-bound at low batch.\n\n| Phase | DRAM read | Digital MAC | Refresh | Grad write | NoC + SPM | Static | Chip total |\n|---|---|---|---|---|---|---|---|\n| Idle (model loaded) | 0 W | 0 W | ~1 W | 0 W | 0 W | 2 W | ~ 3 W |\n| Forward pass (55% util.) | 277 W | 81 W | ~1 W | 0 W | 18 W | 2 W | ≈ 379 W |\n| Backward pass (55% util.) | 277 W | 81 W | ~1 W | 370 W | 18 W | 2 W | ≈ 749 W |\nAvg. training step (fwd+bwd) |\n277 W | 81 W | ~1 W | 185 W | 18 W | 2 W | ≈ 564 W |\n| Peak forward (100% util.) | 504 W | 147 W | ~1 W | 0 W | 36 W | 2 W | ≈ 690 W |\n| Peak training (100% fwd+bwd) | 504 W | 147 W | ~1 W | 672 W | 36 W | 2 W | ≈ 1,362 W |\n\nThe training time-average power (forward + backward weighted equally) is **~ 564 W**. With\nrefresh effectively eliminated by the 1 fA/µm off-current, power is dominated by DRAM read + gradient\nwrite traffic. Backward pass adds **370 W of gradient write power at 55% utilization** (20\nfJ/bit × 16 bits × 2,100 TFLOPS × 55%); idle is **~ 3 W**, giving Sophon an inference-grade\nidle profile despite being a fully writable training die.\n\n| Metric | Sophon (inference) | Sophon (training) | NVIDIA Rubin (R200) | AMD Instinct MI455X |\n|---|---|---|---|---|\n| TFLOPS/W (FP8, peak compute) | 6.2 | — | ~ 9.7 |\n~ 11.8 |\n| TFLOPS/W (BF16, training avg.) | — | 3.72 | ~ 4.86 |\n~ 5.88 |\n| Energy / FP8 inference MAC | 0.310 pJ |\n— | ~ 0.21 pJ | ~ 0.17 pJ |\n| Energy / BF16 forward MAC | — | 0.620 pJ |\n~ 0.41 pJ | ~ 0.34 pJ |\n| Energy / BF16 training MAC (fwd+bwd) | — | 0.940 pJ | ~ 0.82 pJ | ~ 0.68 pJ |\n| Energy / decoded token (80B, FP8, B=1) | 25.8 mJ |\n— | ~ 4,480 mJ | ~ 4,480 mJ |\n| Tokens per watt (80B decode, B=1) | 38.7 tokens/s/W (FP8) |\n— | ~ 0.22 tokens/s/W | ~ 0.22 tokens/s/W |\n| Energy / training token (80B, fwd+bwd) | — | 0.23 J |\n~ 40 J (B=1 estimate) | ~ 40 J (B=1 estimate) |\n| Idle power (80B model loaded) | ~ 3 W |\n~ 3 W |\n~ 10–15 W (memory) | ~ 10–15 W (memory) |\n\nOn **peak compute**, the 2026 HBM4 GPUs now lead: Rubin (R200) and MI455X reach ~ 4.86 and ~ 5.88\nBF16 TFLOPS/W respectively, roughly 1.3–1.6× Sophon's 3.72 — they pack ~ 4–5× more peak FLOPS behind a 3 nm\nprocess. That advantage simply does not help at low batch. For inference, Sophon's FP8-mode decode at 25.8\nmJ/token is **~ 174×** lower energy per token than either HBM4 GPU (~ 4,480 mJ/token), because at\nB=1 both GPUs are HBM-bandwidth-bound and their adder energy is irrelevant — bandwidth, not FLOPS, governs.\nThe digital adder tree keeps per-MAC energy low in both forward and backward passes **and** the 1\nfA/µm off-current keeps refresh negligible (≈ 0.08 W), so Sophon spends ~ 3 W at idle vs. ~ 10–15 W for\nRubin's 288 GB and MI455X's 432 GB HBM4 subsystems in self-refresh.\n\nEach Sophon tile is a 256×256 DRAM subarray with co-located digital MAC circuitry. The activation is\n**bit-serialized** — broadcast as sequential 1-bit wavefronts across the 256 wordlines at the\n500 MHz tile clock (16 wavefronts for BF16, 8 for FP8). Each bit-cycle fires one row, producing 256 1-bit\npartial products that flow into a per-column sense amplifier, then into a tile-wide 8-level binary adder\ntree.\n\n| Quantity | Value | Notes |\n|---|---|---|\n| Subarray geometry | 256 rows × 256 cols | 8 KB of weights per tile (1 bit/cell) |\n| Tile clock | 500 MHz | Bit-serial activation rate |\n| Cycles per MAC | 16 (BF16) / 8 (FP8) | One per activation bit |\n| Per-tile MAC rate | 8 GMAC/s (BF16) |\n256 MACs / 32 ns |\n| Tiles per die | 131,072 | 2,048 subarrays × 64 tiers |\nAggregate MAC rate |\n1,050 TMAC/s = 2,100 TFLOPS BF16 |\n2,100 TMAC/s = 4,200 TFLOPS FP8 |\n| Adder tree depth | log₂(256) = 8 levels | ~ 150 ps/level @ 28 nm |\n| Adder tree latency | 1.2 ns |\nSets the cycle-time floor |\n| Sense-amp latency | 50 ps | Negligible vs. tree |\n\nIn FP8 inference mode the same tile geometry runs an 8-cycle bit-serial activation (vs 16 for BF16),\ndoubling the MAC rate to **4,200 TFLOPS FP8**.\n\nA common misconception about CIM is that \"all the math happens in parallel inside the memory, so model size\nshouldn't matter.\" This is true for **weight transport**, but not for\n**MAC execution**. A dense N-parameter transformer requires exactly\n**2N FLOPs per output token** at batch size 1 — a mathematical requirement that no architecture\ncan shortcut without changing the model.\n\nFor Sophon FP8 inference at 2,100 TMAC/s aggregate:\n\n| Model size N | MACs / token | Compute time | tokens/s (55% util.) |\n|---|---|---|---|\n| 7 B | 7 GMAC | 6.06 µs | 165,000 |\n| 70 B | 70 GMAC | 60.6 µs | 16,500 |\n| 80 B | 80 GMAC | 69.3 µs | 14,438 |\n| 175 B | 175 GMAC | 152 µs | 6,600 |\n| 405 B | 405 GMAC | 351 µs | 2,852 |\n\nThe slope is **strictly inverse to N** because each weight stored in the DRAM array\nparticipates in exactly one MAC per token, and the aggregate MAC ceiling is fixed by the tile count.\n\n| Constraint | NVIDIA Rubin (R200) | AMD Instinct MI455X | Sophon digital CIM |\n|---|---|---|---|\n| Weight transport bandwidth | 22 TB/s HBM4 ceiling | 19.6 TB/s HBM4 ceiling | none — in-place |\n| Weight transport energy | ~ 7 pJ/bit (HBM4 read) | ~ 7 pJ/bit (HBM4 read) | ~ 0.24 pJ/byte sense (BF16) |\n| MAC throughput per die | 17,500 TFLOPS FP8 | 20,000 TFLOPS FP8 | 4,200 TFLOPS FP8 |\n| Energy per FP8 MAC | ~ 1.0 pJ | ~ 1.0 pJ | 0.310 pJ |\nCompute scaling with N |\n1/N (bandwidth-bound) |\n1/N (bandwidth-bound) |\n1/N (compute-bound) |\nEnergy scaling with N |\n1/N |\n1/N |\n1/N |\n\nBoth fall as 1/N — only the absolute curve height differs. Sophon sits **~ 48× above** NVIDIA\nRubin (R200) and **~ 53× above** AMD Instinct MI455X on the FP8-mode decode tokens/s curve\nbecause (a) zero weight-transport overhead (Rubin and MI455X decode at low batch are HBM-bandwidth-bound at\ntheir 22 TB/s and 19.6 TB/s HBM4 ceilings respectively — only ~ 300 and ~ 270 tok/s for an 80B FP8 model),\n(b) lower energy per MAC, and (c) sufficient peak MAC throughput at batch-1, where memory bandwidth — not\npeak FLOPS — governs. Both GPUs in fact carry ~ 4–5× more peak FP8 FLOPS per die than Sophon (Sophon BF16\ndense is just 0.24× Rubin and 0.21× MI455X), yet that raw peak buys them nothing at low batch: the weights\nmust still stream over HBM4 every token.\n\nThree architectural or algorithmic paths can break the dense-decode 1/N curve:\n\n**Per-cell dedicated MAC units** — give each of the 80 × 10⁹ cells its own dedicated MAC.\nCells become ~ 7× larger; memory density drops sharply; 99% of MAC units idle on any given clock.\n**Rejected**: trades capacity for parallelism that cannot be sustained at constant\nutilization.\n\n**Speculative decoding** — run a small draft model ahead, verify with the large model.\nEffective speedup of ~ 2.5× when the draft (1 B parameters, ~ 1.25% of Sophon's MAC budget) co-resides\non the same die. **Selected as Sophon's default inference deployment mode** — see §5.A.6.\n\n**MoE (Mixture-of-Experts) and INT4 quantization** — reduce the effective N that the MAC\narray sees. MoE shrinks active N by ~ 4–50× (e.g., DeepSeek-V3 671 B → 37 B active ≈ 18×); INT4 halves\nthe cycle count by halving activation bit-depth.\n**Both supported as first-class workloads**, with combined effective throughput documented\nin §5.A.6.\n\nThe combination of (2) and (3) yields **~ 5× effective inference throughput improvement** over\nthe raw FP8 dense baseline on a single Sophon die.\n\n**Figure 4** plots the weight bandwidth comparison. **Figure 5** decomposes\nper-MAC energy by component. **Figure 6** shows the resulting active-power breakdown by\nworkload phase.\n\nSections D.1–D.2 fixed the tile geometry and the dense-decode 1/N ceiling; this subsection shows the\n**dataflow** — how a transformer layer's matmuls physically land on the 131,072 tiles and how\npartial results are stitched back together. The organizing principle is\n**weight-stationary execution**: a weight never moves. Every weight matrix *W* is tiled\ninto 256×256 blocks, and each block is resident in the 2T0C 2D-TMD DRAM doublet sitting\n*directly above* its MAC tile. A tile reads its ≈ 64 KB of FP8 weights (256×256 bytes) through a\nsingle private vertical MIV hop (§3.A) — there is no NoC traversal, no shared weight bus, and no off-die HBM\nfetch. This is the source of the 4.2 PB/s in-tile weight bandwidth (§3.C): bandwidth is the product of\n131,072 independent ports each one MIV-via deep, not a wide shared channel that must be arbitrated.\n\nWithin a tile, computation is **bit-serial** (§D.1). The activation vector is broadcast as\nsequential 1-bit wavefronts down the 256 wordlines at the 500 MHz tile clock — 8 wavefronts for FP8, 16 for\nBF16. On each bit-cycle the tile fires one row, the binary sense amps capture 256 1-bit partial products\nagainst the stationary weight column, and the 8-level adder tree reduces them to one column partial sum.\nAfter the full bit-serial sweep, every tile holds a 256-wide block partial sum for the slice of the output\ndimension it owns. Because activation is the only thing that flows in and the weight is the only thing that\nstays, energy per MAC is dominated by the local DRAM read (0.240 pJ of the 0.310 pJ FP8 total, §3.C) rather\nthan by data movement across the die.\n\nA single 256×256 tile covers only a 256-element slab of a real projection, so a full output dimension is\nassembled by **cross-tile reduction**. Tiles whose blocks share an output row form a reduction\ngroup; their partial sums are summed across the on-die NoC (≈290 TB/s per tier, 18,560 TB/s aggregate over\n64 tiers, §3.C) and accumulated into the per-tile SRAM activation scratchpad. Only these reduced activations\n— never weights — travel on the NoC, so the interconnect carries the small O(dmodel) activation\ntraffic of a layer rather than the O(N) weight traffic that bandwidth-bounds a GPU. The reduced output\nvector then becomes the broadcast activation for the next layer's tile group, and the layer pipeline\nadvances.\n\nMapping a complete transformer block follows directly. The four attention projections\n**W Q/WK/WV/WO** are each laid out as their own\ncontiguous group of weight-stationary tiles; the QK\n\nThe same physical tiles run **train-then-serve** with no hardware change. In serving mode the\nDRAM is read-only: activations sweep forward through the projection and FFN/MoE groups, the KV cache grows\nin place, and decode draws ≈373 W (FP8). In training mode the identical tiles run the forward pass and then\nthe backward pass over the writable 2T0C DRAM, performing\n**in-place gradient accumulation** through the dedicated grad-write path (0.320 pJ of the 0.940\npJ BF16 training MAC, §3.C) — weights are updated where they sit, again with no weight transport. Because\nthe only difference between the two modes is whether the local DRAM port is exercised read-only or\nread-modify-write, a fleet repartitions between training and serving purely in software: a die that trained\na checkpoint at midnight can serve it at noon on exactly the same tile array (§5.A).\n\nAll circuits simulated in **ngspice 41** at 25 °C, Level-1 MOSFET models tuned to published\n2D-TMD measurements [[1]](#ref-1)[[2]](#ref-2)[[3]](#ref-3).\n\nSetup: write `1`\n\nat t = 0; hold; read at t = 1.0 s.\n\n| Metric | Result |\n|---|---|\n| Storage-node voltage after write | 0.58 V (Vt-drop limited; RT threshold ~0.4 V) |\n| Storage-node voltage at t = 1.0 s | 433 mV (133 mV margin above Vdd/2 sense threshold) |\nRetention (closed-form, Ioff = 0.5 fA @ 1 fA/µm × 0.5 µm) |\n1.8 s |\n| Sense energy | 30 fJ/bit |\n| Write energy (WT charging node) | 20 fJ/bit |\n\nThe stored voltage at the 1.0 s refresh point (433 mV, a comfortable 133 mV above the Vdd/2 ≈ 300\nmV sense threshold) confirms the 1.0 s refresh interval is safe at 25 °C — see **Figure 3** for\nthe time-domain retention envelope at multiple temperatures. Retention scales ≈ 2× per 10 °C (Arrhenius); at\n85 °C, τ falls to ≈ 28 ms, so the on-die controller shortens the interval to ≈ 20 ms (50 Hz) — a refresh\ncost of only ~ 4 W, with no dedicated high-power \"fast-refresh\" mode required.\n\nBinary current sense: a single latch fired against a fixed mid-point reference. The 1-bit output drives directly into the per-tile binary adder tree.\n\n| Metric | Result |\n|---|---|\n| Resolve time (50 mV differential → rail) | 15 ps |\n| Differential gain | ≥ 150 |\n| Read energy per bit | 30 fJ |\n| Read latency (cell + sense) | 3 ns |\n\n34-node thermal network solved at DC for peak training power injection (749 W backward pass). Stack ΔT remains sub-Kelvin; package resistance dominates (see Section 6).\n\nThe head-to-head comparison against the two 2026 HBM4 flagships — NVIDIA Rubin (R200) and AMD Instinct MI455X [[16]](#ref-16)[[17]](#ref-17) is summarized in **Figure 7**.\n\n| Layer | Function | Process | Notes |\n|---|---|---|---|\n| Base Si | Controller, NVLink PHY, PCIe, NoC root | 28 nm CMOS | 100 µm thick |\n| Tiers 1–64 |\nInterleaved: 32 logic tiers (2D-TMD MAC array) + 32 memory tiers (2T0C DRAM),\nalternating A/B/A/B…\n|\n2D-TMD M3D | 0.35 µm/tier; 32 doublets |\n\nSophon serves inference on the same silicon it trains on. The MAC array supports both native BF16 (the training datatype) and an FP8 inference mode (4,200 TFLOPS / 8,400 INT8 TOPS); FP8 is the recommended serving mode because it doubles decode throughput, halves energy/token, and frees capacity. The model loads once and serves indefinitely; a powered-off die reloads from NVMe at boot (§11.2).\n\n| Parameter | Value |\n|---|---|\n| Memory | 330 GB 2T0C DRAM (on-die) |\n| On-die capacity | 330 GB |\n| FP8 throughput | 4,200 TFLOPS |\n| INT8 throughput | 8,400 TOPS |\n| BF16 throughput | 2,100 TFLOPS |\n| Energy / FP8 MAC | 0.310 pJ |\n| Idle power | ~ 3 W |\n\nDecode is compute-bound from batch size B = 1 because weights reside in-tile — no off-die memory traffic\nat any batch size.\n**The \"Aggregate tokens/s\" column is the total tokens emitted per second by the die across all batch\nslots; per-replica throughput is aggregate / B.**\nFigures below are for FP8 inference mode (the recommended serving point); BF16 native serving is exactly\nhalf.\n\n| Batch (B) | Aggregate tokens/s (FP8) | Per-replica tokens/s | Notes |\n|---|---|---|---|\n| 1 | 14,438 |\n14,438 | 4,200 TFLOPS × 55% / (2 × 80B FLOP/tok) |\n| 8 | 14,438 |\n1,805 | compute-bound; aggregate unchanged |\n| 32 | 14,438 |\n451 | |\n| 128 | 14,438 |\n113 |\n\nIn native **BF16** the same 80B model decodes at **7,219 tokens/s** (B = 1) —\nexactly half the FP8 rate because BF16 doubles the bit-serial cycle count (16 vs 8). Because every batch\nslot reads from the same in-tile DRAM, batching does not increase aggregate throughput; it amortizes\nprefill cost across multiple requests.\n\n| Phase | Chip power | Energy / token |\n|---|---|---|\n| Idle (model loaded) | ~ 3 W |\n— |\n| FP8 decode (B = 1, 55% util.) | ≈ 373 W |\n25.8 mJ |\n| BF16 decode (B = 1, 55% util.) | ≈ 379 W |\n52.5 mJ |\n| FP8 prefill (75% util.) | ≈ 508 W |\n— |\n| FP8 peak burst (100% util.) | ≈ 681 W |\n— |\n\nSustained FP8 prefill: **~ 19,690 tokens/s** (75% utilization); a 2,000-token prompt\ncompletes in ~ 102 ms.\n\n| Metric | NVIDIA Rubin (R200) | AMD Instinct MI455X | Sophon (FP8) |\nSophon (BF16) |\nRatio (FP8) vs Rubin / MI455X |\n|---|---|---|---|---|---|\n| Process | TSMC N3 (HBM4) | TSMC N3 (HBM4) | 28 nm + 2D-TMD M3D | 28 nm + 2D-TMD M3D | — |\n| Memory | 288 GB HBM4 | 432 GB HBM4 | 330 GB 2T0C DRAM |\n330 GB 2T0C DRAM |\n1.15× / 0.76× capacity\n|\n| FP8 dense TFLOPS | ≈ 17,500 | ≈ 20,000 | 4,200 |\n— | 0.24× / 0.21× (GPUs higher) |\n| Weight bandwidth | 22 TB/s (HBM4) | 19.6 TB/s (HBM4) | 4,200 TB/s in-tile |\n4,200 TB/s in-tile |\n~ 191× / 214× |\n| 80B decode B = 1 (tokens/s) | ~ 300 (HBM-bound) | ~ 270 (HBM-bound) | 14,438 |\n7,219 |\n~ 48× / 53× |\n| MAC energy | ~ 0.90 pJ (incl. HBM) | ~ 0.90 pJ (incl. HBM) | 0.310 pJ (FP8) |\n0.620 pJ (BF16 fwd) | 2.9× lower |\n| Energy / decoded token | ~ 4,480 mJ (B = 1) | ~ 4,480 mJ (B = 1) | 25.8 mJ |\n52.5 mJ | ~ 174× lower |\n| Tokens per watt (80B decode) | ~ 0.22 tokens/s/W (B = 1) | ~ 0.22 tokens/s/W (B = 1) | 38.7 tokens/s/W |\n19.0 tokens/s/W | ~ 174× higher |\n| Idle power (80B resident) | ~ 12–18 W (HBM4 self-refresh) | ~ 12–18 W (HBM4 self-refresh) | ~ 3 W |\n~ 3 W |\n~ 4× lower |\n| TDP / decode power | ~ 1,800 W TDP (2,300 W Max-P) | ~ 1,700 W TDP | 373 W decode |\n379 W decode | ~ 4.8× / 4.6× lower |\n| Model survives power-off | No (HBM volatile) | No (HBM volatile) | No (DRAM volatile) | No (DRAM volatile) | — |\n| BOM | ~ $82,800\n|\n\nAgainst the 2026 HBM4 flagships — NVIDIA Rubin (R200) and AMD Instinct MI455X — Sophon **does\nnot** win on raw peak dense throughput. Both GPUs carry ≈ 4–5× more peak FLOPS (Rubin ≈ 17,500\nTFLOPS FP8, MI455X ≈ 20,000) than Sophon's 4,200, so Sophon's BF16 dense is only ~ 0.24× Rubin / 0.21×\nMI455X. Sophon wins decisively on everything that governs *real* single-stream inference: 191× /\n214× the weight bandwidth, ~ 174× lower per-token energy, and — because HBM4 decode at low batch is\nHBM-bandwidth-bound, not compute-bound — ~ 48× (vs Rubin) / 53× (vs MI455X) higher B = 1 FP8 decode\nthroughput at a fraction of the power. The peak-FLOPS surplus only helps at very large batch sizes where\nRubin and MI455X amortize each HBM fetch across many MACs per weight; at B = 1 those FLOPS sit idle while\n22 TB/s (Rubin) / 19.6 TB/s (MI455X) of HBM bandwidth caps decode to ~ 300 / 270 tokens/s. The one\noperational caveat versus a non-volatile part is DRAM volatility: a powered-off die reloads the\ncheckpoint from off-die NVMe at boot (§11.2).\n\nA single Sophon die at 4,200 TFLOPS FP8 (55% utilization ≈ 2,310 effective TFLOPS) decodes at\n**t = 1,155 GFLOPS / N params** tokens/s/replica when compute-bound. The 330 GB\non-die capacity determines what fits without sharding. The table below plots single-die FP8-mode decode\nthroughput across the production model-size spectrum (per the\n\n| Model size | Weights (FP8) | Fits on 1 Sophon? | Decode tokens/s (B = 1, 55%) | Energy / tok | Notes |\n|---|---|---|---|---|---|\n7 B (Mistral-7B) |\n7 GB | ✓ (323 GB free) | 165,000 |\n1.4 mJ | KV cache for 256 K context fits in headroom |\n13 B (Llama-2-13B) |\n13 GB | ✓ | 88,800 |\n2.6 mJ | |\n34 B (dense) |\n34 GB | ✓ | 34,000 |\n6.9 mJ | |\n70 B (Llama-3-70B) |\n70 GB | ✓ (260 GB free) | 16,500 |\n14 mJ | |\n80 B (primary design point) |\n80 GB | ✓ (250 GB free) | 14,438 |\n25.8 mJ | Primary design point |\n175 B (GPT-3-class) |\n175 GB | ✓ (155 GB free) | 6,600 |\n36 mJ | |\n320 B (dense FP8) |\n320 GB | ✓ (10 GB free) | 3,610 |\n65 mJ | Last single-die dense FP8 size |\n405 B (Llama-4 dense FP8) |\n405 GB | ✗ — needs 2 dies (TP) | 2,852 / die |\n87 mJ | TP = 2 sharding |\n1.0 T (dense FP8) |\n1,000 GB | ✗ — needs 4 dies (TP) | 1,155 / die |\n215 mJ | TP = 4 sharding |\n\nFor the 2026 HBM4 GPUs, the analogous decode throughput at FP8 is bandwidth-bound at B = 1\n(HBM4 weight-fetch limit — not compute), governed by HBM_bandwidth ÷ model_bytes. For the\n**NVIDIA Rubin (R200)** (22 TB/s HBM4, 288 GB) this is\n**~ 3.0 × 10² × (80 B / N) tokens/s** (capped by 288 GB, sharding required ≥ 290 GB); for the\n**AMD Instinct MI455X** (19.6 TB/s HBM4, 432 GB) it is\n**~ 2.7 × 10² × (80 B / N) tokens/s** (capped by 432 GB, sharding required ≥ 434 GB). A direct\nper-die comparison appears in **Figure 8**.\n\nThe key qualitative finding:\n**Sophon's per-die decode throughput is bandwidth-unbound** (compute-limited even at B = 1),\nso per-die tokens/s scales as 1/Nparams exactly. Both the Rubin (R200) and MI455X curves have a\nsimilar 1/N slope, but their **absolute level is ~ 48× lower (Rubin) and ~ 53× lower (MI455X)**\nbecause even the HBM4 weight-fetch path (22 TB/s on Rubin, 19.6 TB/s on MI455X) serializes every token's\nMAC traffic. Note that peak FLOPS now favor the GPUs (Sophon BF16 dense is ~ 0.24× Rubin / ~ 0.21× MI455X),\nyet peak compute does not help at B = 1, where memory bandwidth governs throughput.\n\nThe dense FP8 baseline in §5.A.5b is the *worst-case*\nenvelope. Real production workloads exploit three orthogonal throughput-multiplier techniques, all of\nwhich are first-class architectural features on Sophon rather than afterthoughts.\n**Figure 9** plots the cumulative effect.\n\n**1. Speculative decoding (on-die draft model)** — a 1 B-parameter draft model co-resident on\nthe same die generates k = 4 candidate continuations per cycle; the 80 B target model verifies them in a\nsingle pass. The draft consumes ~ 1.25% of Sophon's MAC budget (1 B / 80 B); the verifier still pays its\nfull 14,438 tokens/s baseline. With a typical 70% token-acceptance rate\n[[29]](#ref-29), the **effective speedup is ~ 2.5×** on 80 B dense.\n\n**2. Mixture-of-Experts (sparse activation)** — only the *active* parameters\nparticipate in any given token's MAC graph. For Mixtral-8×7B-Instruct (47 B total, 12.9 B active per\ntoken, top-2 routing), the per-token MAC cost is 25.8 GMAC instead of 94 GMAC. Throughput scales with\nactive-N, not total-N. Sophon's 330 GB capacity holds the full 47 B expert pool on a single die.\n\n**3. INT4 weight quantization** — halves the bit-serial cycle count per MAC (4 cycles instead\nof 8 at the activation broadcast rate), doubling the per-tile MAC rate. INT4 has been shown to retain\nquality within 1–2 perplexity points of FP8 for 80 B-class instruction-tuned models\n[[30]](#ref-30). **Effective throughput is 2×** the FP8 baseline.\n\nThe three techniques compose multiplicatively where the model architecture permits. The table below itemizes per-die decode throughput at B = 1 across the four levers and across the production model-size spectrum, including assumed frontier-MoE configurations for GPT-5-class and Claude Opus-4.8-class (these models' exact parameter counts are not publicly disclosed; the configurations below are estimates consistent with industry rumors as of mid-2026 and should be substituted with actual figures upon disclosure):\n\n| Model | Total / Active | Fits on 1 Sophon? | Raw FP8 dense | INT4 | + Spec. (2.5×) | + MoE active-N | INT4 + Spec. (5×) |\n|---|---|---|---|---|---|---|---|\n7 B (Mistral) |\n7 / 7 | ✓ | 165,000 | 330,000 | 412,500 | 165,000 | 825,000 |\n13 B (Llama-2) |\n13 / 13 | ✓ | 88,800 | 177,700 | 222,100 | 88,800 | 444,200 |\n34 B (dense) |\n34 / 34 | ✓ | 34,000 | 67,900 | 84,900 | 34,000 | 169,800 |\n70 B (Llama-3) |\n70 / 70 | ✓ | 16,500 | 33,000 | 41,300 | 16,500 | 82,500 |\n80 B (primary) |\n80 / 80 | ✓ | 14,438 |\n28,875 |\n36,094 |\n14,438 |\n72,188 |\n175 B (GPT-3-class) |\n175 / 175 | ✓ | 6,600 | 13,200 | 16,500 | 6,600 | 33,000 |\n320 B (dense) |\n320 / 320 | ✓ | 3,610 | 7,220 | 9,025 | 3,610 | 18,050 |\nMixtral-8×7B |\n47 / 12.9 | ✓ | 24,575 | 49,150 | 61,440 | 89,535 |\n122,900 |\nMixtral-8×22B |\n141 / 39 | ✓ | 8,190 | 16,380 | 20,480 | 29,615 |\n40,960 |\nDeepSeek-V3 |\n671 / 37 | ✗ 2 dies | 1,720 / die | 3,440 | 4,300 | 31,216 |\n8,600 |\nGPT-5-class† |\n1,800 / 220 | ✗ 4 dies | 642 / die | 1,283 | 1,604 | 5,250 |\n3,210 |\nOpus-4.8-class† |\n2,000 / 280 | ✗ 5 dies | 578 / die | 1,155 | 1,444 | 4,125 |\n2,890 |\n\n†*Total / active counts for GPT-5-class (assumed: 1.8 T total, 220 B active, 8 experts top-2) and\nOpus-4.8-class (assumed: 2 T total, 280 B active, 16 experts top-3) are estimates consistent with\nindustry rumors as of mid-2026; substitute actual figures upon disclosure.*\n\nFor the production 80 B design point, the\n**combined INT4 + speculative-decoding effective throughput is ~ 72,000 tokens/s/die — a 5× multiplier\nover the raw FP8 dense baseline**\nand ~ 240× the equivalent NVIDIA Rubin (R200) figure (~ 267× vs. AMD Instinct MI455X) — both HBM4 parts whose ~ 300 and ~ 270 tokens/s 80 B FP8 decode at B = 1 are governed by their HBM4 bandwidth (22 and 19.6 TB/s), not their far larger peak FLOPS. For sparse-MoE workloads, the MoE multiplier alone is the dominant\neffect: DeepSeek-V3 at 671 B total / 37 B active yields ~ 31,000 tokens/s/die on Sophon despite requiring\n2 dies in tensor-parallel to hold the full expert pool.\n\n| Parameter | Value |\n|---|---|\n| Memory | 2T0C 2D-TMD gain-cell DRAM |\n| On-die capacity | 330 GB |\n| BF16 throughput | 2,100 TFLOPS |\n| Energy / BF16 forward MAC | 0.620 pJ |\n| Energy / BF16 training MAC (fwd + bwd) | 0.940 pJ |\n| Idle power | ~ 3 W (refresh ≈ 0.08 W @ 1 Hz) |\n\nProduction large-model training spends on-die memory for three things: weights, optimizer state, and (gradient-checkpointed) activations. Sophon's 330 GB capacity supports a memory-efficient first-order optimizer (SGD with momentum, Lion, or AdEMAMix) for an 80B BF16 model:\n\n| State | Size | Notes |\n|---|---|---|\n| Model weights (BF16) | 160 GB | 80B × 2 bytes |\n| Optimizer state (BF16, first-order) | 160 GB | SGD-momentum velocity, or Lion update; one BF16 tensor per parameter |\nTotal model state |\n320 GB |\nFits in 330 GB |\n| Activation headroom | ~ 10 GB |\nGradient-checkpointed activations |\n\nTraining throughput is measured in tokens processed per second through a full forward + backward pass. The\nstandard estimate of 6 × Nparams FLOPs per training token already aggregates forward (2N) and\nbackward (4N) costs [[13]](#ref-13) (see **Eq. 8**):\n\n| Metric | Value |\n|---|---|\n| BF16 TFLOPS available (55% util.) | 1,155 effective TFLOPS |\n| FLOPs per training token (80B model) | 6 × 80B = 480 GFLOPS |\nTraining tokens/s (per die) |\n2,406 |\n| Tokens per training-day (single die) | ~ 208 M |\n| Tokens per training-year (single die) | ~ 75.9 B |\n| Cluster throughput — 256 dies | ~ 616 K tokens/s = ~ 53.2 B tok/day |\n| Cluster throughput — 1,024 dies | ~ 2.46 M tokens/s = ~ 213 B tok/day |\n| 1 T-token training run — 256-die cluster | ~ 19 days |\n| 1 T-token training run — 1,024-die cluster | ~ 4.7 days |\n| 15 T-token run (Llama-3-class) — 1,024-die cluster | ~ 71 days |\n\nA Sophon cluster trains an 80B model on 1 T tokens in two to three weeks on roughly the same die count as\na comparable NVIDIA Rubin (R200) or AMD Instinct MI455X (HBM4) training fleet [[13]](#ref-13)[[15]](#ref-15) — with no HBM, no NVLink bandwidth bottleneck on weights (all\nweights are in-tile), and NVLink used only for gradient all-reduce across dies. The per-die figure of\n**2,406 training tokens/s** is the unit of cluster throughput; per-die runs of frontier-scale\ncorpora are not the intended use case. See **Eq. 9** for the cluster-time formula.\n\n| Phase | Chip power | Notes |\n|---|---|---|\n| Idle (model resident) | ~ 3 W |\nRefresh ≈ 0.08 W (1 Hz) + 2 W SRAM scratchpad; no compute |\n| Forward pass (55% util.) | ≈ 379 W |\n277 W DRAM + 81 W MAC + ~1 W refresh + 18 W NoC + 2 W static |\n| Backward pass (55% util.) | ≈ 749 W |\n+ 370 W gradient writes |\nTraining-step avg. |\n~ 564 W |\nTime-average of fwd + bwd |\n| Peak forward burst (100%) | ≈ 690 W |\nLiquid cold-plate envelope |\n| Peak fwd + bwd burst (100%) | ≈ 1,362 W |\nWithin Tjmax on liquid cold-plate (Tj ≈ 94 °C) |\n\nProduction training operates near the 564 W time-average. Sophon's\n**0.23 J/training token** (564 W / 2,406 tokens/s) is the figure that should be used for\nenergy-cost projections; the lower forward-pass-only figure undercounts the backward gradient-write cost.\nThe collapse from the prior 827 W / 0.34 J figures is due to the 1 fA/µm off-current keeping refresh\nnegligible (≈ 0.08 W) instead of the large refresh assumed in those earlier figures.\n\n| Metric | NVIDIA Rubin (R200) | AMD Instinct MI455X | Sophon |\nRatio (vs Rubin / vs MI455X) |\n|---|---|---|---|---|\n| Process | TSMC N3 (Rubin dual-die) | TSMC N3 (MI455X) | 28 nm + 2D-TMD M3D | — |\n| Memory | 288 GB HBM4 | 432 GB HBM4 | 330 GB 2T0C DRAM |\n1.15× / 0.76× capacity |\n| BF16 dense TFLOPS | ≈ 8,750 | ≈ 10,000 | 2,100 |\n0.24× / 0.21× (GPUs higher) |\n| Weight bandwidth | 22 TB/s (HBM4) | 19.6 TB/s (HBM4) | 4,200 TB/s in-tile |\n~ 191× / ~ 214× |\n| 80B training tokens/s (B = 1 micro-batch)† | ~ 880 | ~ 785 | 2,406 |\n~ 2.7× / ~ 3.1× |\n| BF16 forward MAC energy | ~ 1.2 pJ (incl. HBM) | ~ 1.2 pJ (incl. HBM) | 0.620 pJ |\n1.9× lower |\n| Energy / training token | ~ 4.48 J (B = 1 estimate) | ~ 4.48 J (B = 1 estimate) | 0.23 J |\n~ 19× lower |\n| TFLOPS/W (BF16 peak) | ~ 4.86 | ~ 5.88 | 3.72 |\n0.77× / 0.63× (GPUs higher peak) |\n| Idle power (80B resident) | ~ 10–15 W (HBM4 self-refresh) | ~ 12–18 W (HBM4 self-refresh) | ~ 3 W |\n~ 4× lower |\n| Training power | ~ 1,800 W TDP | ~ 1,700 W TDP | ~ 564 W avg |\n~ 3.2× / ~ 3.0× lower |\n| BOM | ~ $82,800\n|\n\n†*GPU training tokens/s estimate: at B = 1 micro-batch the per-die throughput is HBM-bandwidth-limited,\n~ 880 tokens/s on Rubin (22 TB/s HBM4) and ~ 785 tokens/s on MI455X (19.6 TB/s HBM4). At high batch the\nfar larger peak FLOPS of both GPUs (≈ 8,750 / 10,000 BF16 TFLOPS) raises aggregate node throughput well\nabove Sophon — but peak FLOPS do not help at B = 1, where weight-fetch bandwidth governs and Sophon's\n4,200 TB/s in-tile path dominates.*\n\nSophon training throughput follows\n**t train = 1,155 GFLOPS / (6 × Nparams)** tokens/s/die at 55%\nutilization (the standard 6N rule\n\n| Model size | Weights + opt state (BF16+Lion) | Fits on 1 Sophon? | Train tokens/s (B = 1, 55%) | Time for 1 T tokens (single die) | Time for 1 T tokens (1,024-die cluster) |\n|---|---|---|---|---|---|\n7 B |\n28 GB | ✓ (302 GB free) | 27,500 |\n421 days | 9.9 hours |\n13 B |\n52 GB | ✓ (278 GB free) | 14,810 |\n782 days | 18 hours |\n34 B |\n136 GB | ✓ (194 GB free) | 5,660 |\n5.59 years | 2.0 days |\n70 B |\n280 GB | ✓ (50 GB free) | 2,750 |\n11.5 years | 4.1 days |\n80 B |\n320 GB | ✓ (10 GB headroom) | 2,406 |\n13.2 years | 4.7 days |\n96 B |\n384 GB | ✗ — needs 96-tier die or 2 dies | 2,005 / die |\n— | 5.7 days |\n175 B |\n700 GB | ✗ — needs 3 dies (TP) | 1,100 / die |\n— | 10.4 days (3,072-die fleet) |\n405 B |\n1,620 GB | ✗ — needs 5 dies | 476 / die |\n— | 24 days (5,120-die fleet) |\n1.0 T (GPT-4 BF16) |\n4,000 GB | ✗ — needs 13 dies | 193 / die |\n— | 58 days (13,312-die fleet) |\n\n**Compared with 2026 HBM4 flagships** — NVIDIA Rubin (R200, 288 GB HBM4, 22 TB/s) and AMD Instinct MI455X (432 GB HBM4, 19.6 TB/s):\n\nThe Sophon advantage at any given model size scales primarily from the elimination of HBM traffic; the gap shrinks at very large batches (where Rubin and MI455X amortize HBM fetch across more MACs per weight) but never closes because Sophon still wins on energy-per-MAC and on energy-per-die — even though both GPUs' raw peak BF16 throughput per die is higher (Sophon BF16 dense is ~ 0.24× Rubin / 0.21× MI455X). Peak FLOPS do not help at low batch, where memory bandwidth governs.\n\nBecause inference and training run on the **same die**, a production AI cluster is built from a\nsingle Sophon Stock-Keeping Unit (SKU) and repartitioned by software:\n\n| Phase | Mode | Role |\n|---|---|---|\nPre-training |\nTraining (array) | Large-scale gradient-descent training; BF16 weights + first-order optimizer state in-tile |\nFine-tuning / LoRA |\nTraining (single die) | Adapter or full-weight updates in DRAM |\nCheckpoint snapshot |\nNVMe write | Final weights flushed to off-die NVMe |\nProduction inference |\nInference (array) | Load checkpoint, serve at 25.8 mJ/token (FP8), ~ 3 W idle |\n\nThis flow lets a single fleet **elastically shift dies between training and serving** without\nany hardware swap: the same silicon that trained a model can serve it (BF16 directly, or FP8 after a\none-step quantization), and dies can be re-tasked from serving back to fine-tuning as demand shifts. The\nonly operational discipline DRAM imposes is volatility management — weights are checkpointed to NVMe and\nreloaded at boot (§11.2); there is no non-volatile \"model resident across power-off\" property, but in a\ncontinuously-powered datacenter the ~ 3 W idle makes keeping a model resident essentially free.\n\nThe thermal envelope across cooling technologies is shown in **Figure 11**, with all operating\npoints overlaid. See **Eq. 15** (effective vertical conductivity) and\n**Eq. 16** (junction temperature) for the derivation.\n\nAll numbers are per **7.5 cm²** die. Effective vertical thermal conductivity through the BEOL +\nCu-MIV stack: **k eff = 24.7 W/m·K** (Cu fill 6%, k\n\n| Scenario | Ptot |\nRpkg |\nΔTpkg |\nΔTstack |\nT\njunction (°C) |\n|---|---|---|---|---|---|\n| FP8 decode, liquid cold-plate | 373 W | 0.05 K/W | 18.7 K | 0.45 K | 44.1 |\n| BF16 decode / forward pass, liquid cold-plate | 379 W | 0.05 K/W | 19.0 K | 0.46 K | 44.4 |\n| FP8 peak burst, liquid cold-plate | 681 W | 0.05 K/W | 34.1 K | 0.82 K | 59.9 |\n| Backward pass, liquid cold-plate | 749 W | 0.05 K/W | 37.5 K | 0.91 K | 63.4 |\n| Training avg., liquid cold-plate | 564 W | 0.05 K/W | 28.2 K | 0.68 K | 53.9 |\n| Peak fwd burst, liquid cold-plate | 690 W | 0.05 K/W | 34.5 K | 0.83 K | 60.3 |\n| Peak fwd+bwd burst | 1,362 W | 0.05 K/W | 68.1 K | 1.65 K | 94.8 |\n| FP8 decode, air-cooled (reference) | 373 W | 0.30 K/W | 111.9 K | 0.45 K | 137.4 |\n\n*All liquid-cooled operating points — including the 100% fwd+bwd peak (1,362 W → 94.8 °C) — stay below\nT jmax = 105 °C on a standard liquid cold plate. Refresh is negligible (≈ 0.08 W at 1 Hz, from\nthe 1 fA/µm off-current) and does not enter the thermal budget.*\n\n| Cooling | Rpkg (K/W) |\nMax sustained W (Tjmax 105 °C, 25 °C ambient) |\n|---|---|---|\n| Air (1U server) | 0.30 | ~ 267 W |\n| Liquid cold-plate (datacenter standard) | 0.05 | ~ 1,600 W |\n| Microfluidic | 0.02 | ~ 4,000 W |\n| Two-phase immersion | 0.01 | ~ 8,000 W |\n\nInference (373 W FP8 decode, 681 W peak) fits comfortably within liquid cold-plate limits and is within\nstriking distance of standard air cooling at decode — the chip can operate without any liquid plumbing in\nedge-inference deployments at moderately reduced clock rates. The\n**training time-average (564 W)** also fits liquid cold-plate with wide margin, and even the\nfwd+bwd 100%-duty peak (1,362 W → 94.8 °C) stays within Tjmax on a standard liquid cold plate,\nwith refresh a negligible ≈ 0.08 W.\n\nThe stack ΔT above used a generic BEOL dielectric (kBEOL = 2.0 W·m⁻¹K⁻¹). Specifying the\ninter-tier dielectric as **Al₂O₃** changes vertical conduction only marginally:\nBEOL-compatible ALD Al₂O₃ grown at ≤ 450 °C is amorphous, with a thin-film thermal conductivity of\n**k d ≈ 1.8 W·m⁻¹K⁻¹** (bulk single-crystal sapphire reaches ~ 30 W·m⁻¹K⁻¹, but\nthat phase is unreachable in a low-temperature BEOL flow). Because the 6% Cu-MIV via fill dominates the\nparallel vertical path, the effective conductivity is essentially unchanged from §6:\n\nHeat exits through the base (backside cold plate), so the top tier is hottest. Conservatively routing the\nfull die power *P* through the stack to the base — the same lumped convention as the\nΔTstack column above — tier *i* (counted from the base, i = 0…N, N = 64) sits at the\npackage-limited base temperature plus the through-stack rise:\n\nOn a liquid cold plate (Rpkg = 0.05 K/W, 25 °C coolant) the as-built stack — Al₂O₃ dielectric\nwith the 6% Cu-MIV via network — gives the per-tier profile below.\n\n| Tier (from base) | 564 W (training avg.) | 1,362 W (peak fwd+bwd) |\n|---|---|---|\n| Base Si (tier 0) | 53.2 °C | 93.1 °C |\n| Tier 16 | 53.4 °C | 93.5 °C |\n| Tier 32 (mid-stack) | 53.5 °C | 93.9 °C |\n| Tier 48 | 53.7 °C | 94.3 °C |\n| Tier 64 (top) | 53.9 °C | 94.8 °C |\nTop-to-base ΔT | 0.7 K | 1.7 K |\n\n**Every one of the 64 tiers sits within ≤ 1.7 K of the base** — the top tier reaches only\n**53.9 °C** at the 564 W training average and **94.8 °C** at the 1,362 W fwd+bwd\npeak, both inside Tjmax = 105 °C. With the 6% Cu-MIV via network carrying the vertical heat, the\nAl₂O₃ dielectric is nearly thermally invisible: swapping it for the generic 2.0 W·m⁻¹K⁻¹ BEOL value shifts\nkeff by < 1%. These are conservative bounds — per-tier dissipation is distributed across the\n64 tiers rather than injected at the top, which halves the through-stack term and flattens the profile\nfurther.\n\nThe roadmap through 2034 is plotted in **Figure 12**.\n\nSophon scales on the BEOL TMD process node cadence. Capacity grows by shrinking the 2T0C cell; retention is\npreserved or improved at finer nodes because Ioff drops roughly as fast as the gate length (storage\nnode capacitance also shrinks, but the ratio τ = C·V/(2Ioff) stays similar).\n\nTwo scaling effects compound at each node:\n\nThe table below uses the conservative model: capacity = geometric with no routing derate; compute = base ×\n(28/F)² with no routing derate (production designs will see ~50% routing-limited derate). Throughput is\nreported as **80-billion-parameter, batch-1 decode tokens/s**: because Sophon decode is\ncompute-bound, it scales with on-die compute (∝ 1/F²), whereas an HBM-based accelerator stays bandwidth-bound\nand scales only with HBM bandwidth.\n\n| Year | Node | Tiers | Cell | Capacity (GB) |\nBF16 decode (tok/s, 80B) |\nFP8 decode (tok/s, 80B) |\nPkg power (FP8 decode) |\nFP8 decode (tok/s/W) |\n|---|---|---|---|---|---|---|---|---|\n| 2026 | 28 nm | 64 | 8 F² | 330 |\n7,219 |\n14,438 |\n373 W | 38.7 |\n| 2028 | 22 nm | 80 | 7 F² | 763 |\n14,619 |\n29,237 |\n627 W | 46.6 |\n| 2030 | 14 nm | 96 | 6 F² | 2,639 |\n43,314 |\n86,628 |\n1,351 W | 64.1 |\n| 2032 | 10 nm | 128 | 5 F² | 8,276 |\n60,000 |\n120,000 |\n1,500 W |\n80.0 |\n| 2034 | 7 nm | 160 | 4 F² | 26,390 |\n74,850 |\n149,700 |\n1,500 W |\n99.8 |\n\nEvery die is held to a fixed **1,500 W package power** envelope, so the roadmap scales along\ntwo independent axes. **Capacity** grows with cell density and tier count — the 2T0C array is\nread-mostly and the 1 fA/µm off-current keeps refresh at ≈ 0.08 W, so memory is not power-bound and climbs\nfrom 330 GB to 26 TB unconstrained. **Compute throughput**, by contrast, is bounded by the\n1,500 W package: each shrink improves energy efficiency (tok/W), and within the same 1,500 W that buys more\nthroughput — but only at the efficiency rate, not the raw-tile rate. From the 10 nm node on, the die has far\nmore tiles than 1,500 W can switch at once for an 80B decode, so the reported throughput is the\n*power-capped* figure (1,500 W × tok/W); the surplus tiles hold weights (capacity), not active\ncompute. A 7 nm die thus pairs 26 TB of on-die memory with a power-capped ≈ 149,700 tok/s 80B decode, rather\nthan the ≈ 577,000 tok/s an uncapped ≈ 5.8 kW die would draw. Decode at 28, 22, and 14 nm stays below the\ncap (373 W, 627 W, 1,351 W) and is tile-limited as before.\n\n| Year | HBM gen | 8-stack cap (GB) | Sophon / HBM |\n|---|---|---|---|\n| 2026 | HBM3e | 288 | 1.1× |\n| 2028 | HBM4 | 512 | 1.5× |\n| 2030 | HBM4e | 768 | 3.4× |\n| 2032 | HBM5 | 1,024 | 8.1× |\n| 2034 | HBM5e | 1,536 | 17.2× |\n\nSophon widens its capacity lead against HBM every generation. More importantly, the\n**bandwidth** lead is already insurmountable: 4.20 PB/s vs. HBM4's ~ 20 TB/s (8-stack package; Rubin 22, MI455X 19.6)\n— a **~ 191–214× gap** that no interposer-based approach can close.\n\nAs transistor scaling slows and data-center power becomes the binding constraint, the practical ceiling on\ndeployable model size is set not by silicon area but by the **energy infrastructure** — the power\na grid, campus, or rack can deliver and cool. A model's lifetime energy splits into two regimes that scale\ndifferently and are bounded by different figures of merit: a\n**recurring inference (serving) cost** that is memory-bound and grows *linearly* with\nparameter count, and a **one-time training cost** that is compute-bound and grows roughly\n*quadratically* with model size at compute-optimal data. An architecture can dominate one regime\nwithout dominating the other, so we treat each in turn.\n\nUnder a fixed power budget, the largest model an architecture can *serve* is fixed by its energy per\ngenerated token. Because each decoded token reads (HBM-bound) or activates (compute-in-memory) essentially the\nentire weight set once, decode energy is linear in parameter count, *E*tok(*N*) =\nκ*N*, and the ceiling follows directly:\n\nwhere *P*budget is the available power, *T*agg the aggregate decode\nthroughput the deployment must sustain, and κ the per-parameter token energy (J · token−1 ·\nparam−1). The budget and throughput cancel when comparing architectures: reachable model size\nscales as **1/κ**. κ is therefore the single figure of merit for energy-bounded scaling.\n\n| Architecture | Energy / token @ 80B | κ (J · tok−1 · param−1) |\nModel-size reach at a fixed energy budget |\n|---|---|---|---|\n| HBM4-bound GPU — NVIDIA Rubin (R200) | 4.48 J | 5.6 × 10−11 |\n1× (baseline) |\n| HBM4-bound GPU — AMD Instinct MI455X | 4.48 J | 5.6 × 10−11 |\n1× (baseline) |\n| Monolithic-3D digital CIM — Sophon (28 nm) | 25.8 mJ | 3.2 × 10−13 |\n≈ 173× |\n| Monolithic-3D digital CIM — Sophon (7 nm, 2034) | 10.0 mJ | 1.25 × 10−13 |\n≈ 448× |\n\nTo put this on a real footing, calibrate against today's deployed frontier rather than a hypothetical build. A\n6-trillion-parameter MoE of the Claude Fable-5 class (≈ 125 billion active per token) is served on 2026 HBM4 GPUs (Rubin / MI455X)\nat roughly **0.35 GW** — i.e. ≈ 7 J per generated token across an aggregate serving\nintensity of ≈ **50 million tokens/s**. Holding that same intensity, the largest model each\narchitecture can serve within a **0.5 GW** envelope is below — a soft ceiling that scales\ninversely with the target throughput (halve the tokens/s and it doubles), not a hard wall. Because energy is\ngated by the parameters *activated* per token, sparse Mixture-of-Experts (MoE) models — which route\neach token to only a fraction of their experts — raise the ceiling by the total-to-active ratio:\n\n| Architecture | Dense model @ 0.5 GW | MoE model @ 0.5 GW (≈ 48× total : active, Fable-5 class) |\n|---|---|---|\n| HBM4-bound GPU — NVIDIA Rubin (R200) | ≈ 179 billion | ≈ 8.6 trillion † |\n| HBM4-bound GPU — AMD Instinct MI455X | ≈ 179 billion | ≈ 8.6 trillion † |\n| Monolithic-3D digital CIM — Sophon (28 nm) | ≈ 31 trillion | ≈ 1.5 quadrillion |\n| Monolithic-3D digital CIM — Sophon (7 nm, 2034) | ≈ 80 trillion | ≈ 3.8 quadrillion |\n\nSo even today's 28 nm Sophon clears the **100-trillion-parameter brain-scale threshold** by ≈ 15×\nas a frontier MoE (≈ 1.5 quadrillion; ≈ 3.8 quadrillion at the 2034 node), while an HBM-bound build stays\npinned near today's frontier. † That HBM column is generous: at production concurrency the\nactivated-expert union across users approaches the full model (eroding the MoE saving toward dense), and HBM\ncapacity binds long before the energy ceiling. Sophon instead holds *every* expert on-die and computes\nonly the routed ones (§5.A.6), realizing the full multiplier in both energy *and* capacity.\n\n**The 1,500 W package cap (§7) does not move these ceilings.** Reachable model size depends only\non κ — energy per token per parameter — which is an *intrinsic device property*, independent of how the\npower budget is packaged. In *N*max = *P*budget / (κ *T*agg) the per-die power cancels: dies = *P*budget ÷ per-die power and per-die\nthroughput = per-die power ÷ (κ*N*), so their product is *P*budget / (κ*N*)\nregardless of the die's wattage. Capping each die at 1,500 W therefore changes only the\n*per-die throughput* and the *die count*, not the model size a given grid can serve. Concretely,\nat the 2034 7 nm node a 1,500 W die serves a 100T model at ≈ 120 tok/s (12.5 J/token), so the ≈ 0.63 GW\nbrain-scale serving budget below is spread across ≈ 0.42 M such dies — the energy budget, not the package, is\nthe ceiling.\n\n**Worked example — serving a frontier MoE.** A Claude Fable-5 / GPT-5-class model (≈ 6T total, ≈\n125B active, 1-million-token context) sits comfortably under the energy ceiling yet hits two capacity walls on\nHBM. The *weight wall*: 6 TB of FP8 weights (12 TB at BF16) force a single replica across\n**14–21 premium GPUs** (288–432 GB HBM4, Rubin / MI455X) on a ≈ 130 TB/s NVLink fabric before the first user is\nserved. The *KV-cache wall*: with 64 layers, 48 KV heads, and 128-dim heads, the cache is 2 (K,V)\n× 64 × 48 × 128 × 2 bytes ≈ **1.6 MB per token** — ≈ 1.6 TB for one\n1-million-token session (0.8 TB at FP8). Total memory grows as *weights + users × KV*:\n\n| Concurrent 1 M-context users | Total HBM (FP8 weights + FP16 KV) | ≈ 432 GB HBM4 GPUs required |\n|---|---|---|\n| 1 | ≈ 7.6 TB | ≈ 18 |\n| 10 | ≈ 22 TB | ≈ 51 |\n| 100 | ≈ 163 TB | ≈ 377 |\n| 1,000 | ≈ 1.6 PB | ≈ 3,704 |\n\nPast a handful of users the KV cache dominates — at 1,000 sessions it alone is ≈ 1.6 PB, over\n**3,700 GPUs** — which is why providers cap context length aggressively. Sophon removes both\nwalls: all ≈ 6 TB of experts are **resident in on-die 2T0C DRAM** (≈ 18 dies in 2026, ≈ 3 by\n2030, a single 26 TB die by 2034), only the ≈ 125 B routed experts compute per token at 4.2 PB/s, and the KV\ncache shares that same high-bandwidth memory —\n**no weight wall, no inter-chip expert shuffle, ≈ 174× lower energy per token**. Other in-memory\ndesigns do not change this: **SRAM CIM** has low access energy but ≈ 100× lower density (a\n*capacity* wall); **analog / RRAM CIM** pays an ADC/DAC and precision penalty that grows\nwith array size. Among architectures that can both *store* and *serve* a large model at a usable\nenergy per token, monolithic-3D digital CIM has the lowest κ by two-to-three orders of magnitude.\n\n**Brain-scale case study (100-trillion parameters).** A 100T model — comparable to the synapse\ncount of the human brain, and ≈ 1,250× today's 80B frontier — makes the architectural gap decisive. Holding\nthe per-token service level fixed, the energy each architecture must spend per token, and the resulting\nmultiple of *today's* 80B serving energy, are:\n\n| Architecture (serving a 100T model) | Energy / token | Energy infrastructure vs. today's 80B frontier |\n|---|---|---|\n| HBM-bound GPU | ≈ 5,600 J | ≈ 1,250× |\n| Sophon — monolithic-3D CIM (28 nm) | ≈ 32 J | ≈ 7.1× |\n| Sophon — monolithic-3D CIM (7 nm, 2034) | ≈ 12.5 J | ≈ 2.8× |\n\nThe conclusion is stark. At that same realistic ≈ 50 million tokens/s intensity, a **dense** 100T\nmodel serves within **≈ 1.6 GW** on Sophon (28 nm) — ≈ 0.63 GW at the 7 nm node, or just\n**≈ 34 MW** as a 48× MoE — whereas the same dense 100T on HBM-bound GPUs would draw ≈\n**280 GW**, on the order of a tenth of all global electricity generation for a single model. In\nper-token terms (table above) that is **≈ 1,250×** the energy of today's 80B frontier on HBM,\nagainst only **≈ 7.1×** (28 nm) or **≈ 2.8×** (7 nm) on Sophon —\n**well under a 100× scale-up**. Equivalently, at a sustainable 1 J per decoded token an HBM\ndesign tops out near **18 billion** parameters, Sophon at **≈ 3.1 trillion** (28 nm)\nto **≈ 8.0 trillion** (7 nm), higher still for MoE. And the model must also *fit*: off-die\nHBM/interposer capacity scales far more slowly than on-die 2T0C density (§7 roadmap), so HBM systems hit a\ncapacity-and-sharding wall before the energy wall. Energy — and capacity well before it — is the serving wall,\nnot transistors; vertical integration removes both.\n\nWhere serving is a recurring, memory-bound cost, training is a **one-time, compute-bound** cost.\nA training step runs at large batch, so every weight read is amortized across thousands of tokens and the\nmemory wall that dominates single-stream decode all but disappears — what remains is arithmetic. The energy to\ntrain a model is therefore set by the energy per floating-point operation, ε, rather than by the per-token\nmemory traffic κ that governs inference:\n\nThe factor 6 counts two forward plus four backward FLOPs per active parameter per token; for dense models\n*N*act = *N*, while a Mixture-of-Experts model engages only its routed experts, so\n*N*act is the active-parameter count. The decisive difference from the inference wall is the\n**quadratic** growth: at compute-optimal data (*D* ≈ 20*N*, Chinchilla), doubling a\ndense model roughly *quadruples* its training energy. The architectural payoff is therefore\n*sub-linear* — at a fixed training-energy budget the trainable model size scales as *N* ∝ 1/√ε,\nso an *A*-fold reduction in energy-per-FLOP buys only a √*A*-fold larger dense model:\n\n| Architecture | Energy / BF16 training MAC | ε (J · FLOP−1) |\nTrainable dense-size reach (∝ 1/√ε) |\n|---|---|---|---|\n| HBM4 GPU — NVIDIA Rubin (R200) | ≈ 4.0 pJ | 2.0 × 10−12 |\n1× (baseline) |\n| HBM4 GPU — AMD Instinct MI455X | ≈ 4.0 pJ | 2.0 × 10−12 |\n1× (baseline) |\n| Monolithic-3D digital CIM — Sophon (28 nm) | 0.94 pJ | 4.7 × 10−13 |\n≈ 2.1× |\n| Monolithic-3D digital CIM — Sophon (7 nm, 2034) | ≈ 0.43 pJ | ≈ 2.1 × 10−13 |\n≈ 3.1× |\n\nSophon's training figure of merit comes from the same digital-CIM adder tree and on-die gradient writes as its\ninference path (§3.C.4): ≈ 0.94 pJ per BF16 training MAC versus ≈ 4.0 pJ on an HBM4 GPU (NVIDIA Rubin (R200) or AMD Instinct MI455X) — a\n**≈ 4.3× energy-per-FLOP advantage**. This is real, but far smaller than the architecture's\n**≈ 174× inference advantage**, and that gap is the point: inference is memory-bound, where\nholding every weight on-die is decisive, whereas training is compute-bound, where the edge narrows to the\nper-MAC arithmetic energy. Folded through the 1/√ε relationship, Sophon trains a ≈ 2.1× larger dense model\nthan a Rubin (R200) / MI455X-class HBM4 GPU within the same energy budget at 28 nm (a projected ≈ 3.1× at the 2034 node, treating the 7 nm\nper-FLOP figure as a node-scaling projection).\n\nAt brain scale the two regimes reach the same verdict. Training a **100T dense** model\ncompute-optimally would demand *D* ≈ 2 × 1015 tokens — roughly\n**50–100× more than all the high-quality text in existence** — and ≈ 1.2 × 1030 FLOPs;\nthe data wall alone makes dense brain-scale training impossible, so sparsity is not optional. As a\n**48× MoE** only ≈ 2T parameters activate per token, cutting the dominant 6*N*act*D* term ≈ 48-fold — to 6 × 2 × 1012 × 2,000T ≈\n**2.5 × 10 28 FLOPs**. Filling the same\n\n| 1 GW build · 100T model | Dies @ per-die power ‡ |\nAggregate BF16 FLOP/s | 48× MoE · 2.5 × 1028 FLOPs |\n|---|---|---|---|\n| HBM4 GPU — NVIDIA Rubin (R200) | ≈ 0.56 M @ 1,800 W | ≈ 5 × 1020 |\n≈ 1.6 years (≈ 13,900 h) · ≈ 14 TWh |\n| HBM4 GPU — AMD Instinct MI455X | ≈ 0.59 M @ 1,700 W | ≈ 5 × 1020 |\n≈ 1.6 years (≈ 13,900 h) · ≈ 14 TWh |\n| Monolithic-3D digital CIM — Sophon (28 nm) | ≈ 1.8 M @ 564 W | ≈ 2 × 1021 |\n≈ 4.8 months (≈ 3,470 h) · ≈ 3.3 TWh |\n| Monolithic-3D digital CIM — Sophon (7 nm, 2034) | ≈ 0.67 M @ 1,500 W | ≈ 5 × 1021 |\n≈ 2 months (≈ 1,390 h) · ≈ 1.4 TWh |\n\n‡ Dies = 1 GW ÷ per-die training power. Sophon's lower per-die power packs ≈ 2.5× more dies into\nthe same 1 GW than a Rubin (R200) / MI455X-class HBM4 GPU — each one smaller and cheaper — and together they deliver the higher aggregate\nthroughput that shortens the run; the die count is thus a consequence of low per-die power, not low\nefficiency. The 28 nm die averages 564 W, inside its 1,500 W package (1.8 M dies); a 7 nm die runs training at\nits 1,500 W package cap (0.67 M dies, §7). What sets the run time is *energy efficiency*, not die count\n— at fixed 1 GW the aggregate scales with FLOP/J, so the ≈ 2.5× BF16-training node gain (marginally below the\n2.58× FP8-inference gain, as training is more memory-bound) is the speed-up, however the budget is split into\ndies. The 7 nm part's far larger raw per-die throughput and capacity (§7) do not by themselves shorten a\npower-capped run. Chip-compute basis (a real datacenter adds the ≈ 2.5× wall overhead noted above). Even the ≈\n4.8-month MoE run still needs the same ≈ 2,000T-token corpus to reach compute-optimal quality, so even the\nsparse model is *data-bound, not power-bound*: on the ≈ 30T tokens that exist its compute finishes in ≈\n2 days, but the model is far from converged. This dovetails exactly with the serving analysis:\n**brain-scale intelligence is reachable only as a sparse model, and only on an architecture that keeps every\nexpert resident on-die for both training and inference**\n(§5.A.6).\n\nThe 3-year Total Cost of Ownership (TCO) breakdown is plotted in **Figure 13** (derivation in\n**Eq. 11–14**).\n\nSophon uses a 28 nm Si base wafer and a 64-tier 2D-TMD M3D stack, with the 2T0C DRAM module integrated at Metal-3 BEOL.\n\n| Cost item | Sophon (2T0C DRAM) | Notes |\n|---|---|---|\n| 28 nm wafer cost | $3,500 | 12-inch foundry, 2026 |\n| Gross dies per wafer | 69 | 750 mm² die |\n| Per-die wafer cost | $51 | gross |\n| Base wafer yield | 49.5% | negative-binomial (α = 3), A·D₀ = 0.75 |\n| Per-tier M3D BEOL adder | $52 | DRAM periphery area premium |\n| Total tier adder (64 tiers) | $3,328 | |\n| Combined yield (base 49.5% × stack 0.997⁶⁴ = 82.5%) | 40.8% | |\n| Final die cost | $8,273 | (wafer + tier) / yield |\n| Packaging | $60 | cold-plate-ready lid |\n| Memory programming | $0 | DRAM: none (load at boot) |\n| Test & burn-in | $25 | Known-Good-Die (KGD) wafer-level |\nBOM per die |\n$8,358 |\n\n**No DRAM IP license is required**: the 2T0C DRAM is implemented entirely with the same TMD\ntransistors used in the MAC array — it is PhantaField's own cell design, not licensed third-party IP.\n\n| Item | NVIDIA Rubin (R200) — HBM4 (288 GB) | AMD Instinct MI455X — HBM4 (432 GB) |\n|---|---|---|\n| GPU silicon + package (Morgan Stanley VR200 ÷ 72)\n|\n\nThe cost wall is the memory wall.Morgan Stanley estimates a single NVIDIA VR200 (Rubin) NVL72 rack at≈ $7.8M, of whichHBM memory alone is ≈ $2.0M — 25.7% of the entire rack, up+435%over the prior-generation GB300. Per accelerator (÷ 72) that is $55,000 of GPU silicon plus$27,800 of HBM4. Sophon removes the HBM line item in full, for a~ 9.9–11.6× lower hardware BOM[[17]].\n\nA Rubin (R200) module ships with 288 GB HBM4 at ≈ 22 TB/s; an MI455X ships with 432 GB HBM4 at ≈ 19.6 TB/s.\nCapacity is now within reach of both parts, but the matched-bandwidth scaling remains far out of reach:\nHBM4 delivers ~ 22 TB/s (Rubin) / ~ 19.6 TB/s (MI455X), vs. Sophon's 4,200 TB/s in-tile — a\n**~ 191× gap** (vs. Rubin) / **~ 214× gap** (vs. MI455X) that cannot be closed at\nany price point within the interposer paradigm. The GPUs win on peak dense FLOPS (Sophon BF16 dense is\n0.24× Rubin / 0.21× MI455X), but peak FLOPS do not help at low batch, where weight-fetch bandwidth governs\ndecode throughput: at 80B FP8, HBM-bound decode is ≈ 300 tok/s (Rubin) and ≈ 270 tok/s (MI455X) vs.\nSophon's 14,438 tok/s — a **48×** / **53×** advantage.\n\nThe table below uses a representative production-server duty cycle, a Power Usage Effectiveness (PUE) of 1.5, and a $0.10/kWh electricity tariff — yielding an effective $0.15/kWh after datacenter cooling and distribution overhead. Numbers are per single die over 3 years (26,280 hours).\n\n| TCO item (3 years, 80B model, single die) | NVIDIA Rubin (R200, HBM4) | AMD Instinct MI455X (HBM4) | Sophon (inference) |\nSophon (training) |\n|---|---|---|---|---|\n| Hardware BOM | ~ $82,800 | ~ $96,700 | $8,358 |\n$8,358 |\n| Idle energy (70% idle, inference) | 4,599 kWh × $0.15 = $690 |\n4,599 kWh × $0.15 = $690 |\n55 kWh × $0.15 = $8 |\n— |\n| Active inference energy (30% busy, FP8) | 14,191 kWh × $0.15 = $2,129 |\n13,403 kWh × $0.15 = $2,010 |\n2,941 kWh × $0.15 = $441 |\n— |\n| Training duty cycle (50% idle / 50% training) | — | — | — | idle 39 kWh + active 7,411 kWh = $1,118 |\n3-year hardware + energy TCO |\n~ $85,600 |\n~ $99,400 |\n~ $8,807 |\n~ $9,476 |\nTCO ratio vs. Rubin / MI455X |\n~ 9.7× / 11.3× lower |\n~ 9.0× / 10.5× lower |\n\nSophon's TCO advantage comes from two compounding effects:\n\nTwo numbers decide serving economics: **energy per token** — which sets the electricity\nbill and the thermal envelope — and **cost per token**, the fully-loaded $/token a\ndeployment actually pays. Both follow directly from the figures above. Energy per token is the decode\npower divided by the decode throughput,\n\nand the fully-loaded cost per token amortizes the 3-year TCO (hardware BOM + energy) over every token\nthe die serves at a 30% production duty cycle (t3y = 9.46×107 s):\n\n| Per-token economics (80B · FP8 · batch-1 single-stream) | NVIDIA Rubin (R200) | AMD Instinct MI455X | PFG-1 Sophon |\n|---|---|---|---|\nDecode throughput Rdecode |\n~ 300 tok/s | ~ 270 tok/s | 14,438 tok/s |\nDecode power Pdecode |\n~ 1,340 W | ~ 1,210 W | 373 W |\nEnergy per token Etok (= Pdecode / Rdecode) |\n≈ 4.48 J | ≈ 4.48 J | 25.8 mJ |\n| Energy cost / 1M tokens (@ $0.15/kWh) | $0.187 | $0.187 | $0.0011 |\n| Tokens served / 3 yr (30% duty) | ≈ 8.5 billion | ≈ 7.7 billion | ≈ 410 billion |\n| 3-year TCO (hardware + energy) | ~ $85,600 | ~ $99,400 | $8,807 |\nFully-loaded cost / 1M tokens |\n~ $10.1 |\n~ $13.0 |\n~ $0.021 |\nSophon cost-per-token advantage |\n~ 468× |\n~ 604× |\n— |\n\nAt batch-1, single-stream (interactive, low-latency) serving, Sophon delivers a token for\n**~ 2 cents per million** — about **470–600× cheaper** than an HBM4 GPU, at\n**174× lower energy** per token. The cost gap is the product of two compounding effects:\nthe **~ 9.9–11.6× lower hardware BOM** (§9) and the **~ 48–53× higher\nsingle-stream throughput** (§5.A.5). It is largest at low batch, where the GPU re-reads all 80B\nweights from HBM for every token; at high batch the GPU amortizes each weight read across the batch and\nits cost per token falls toward its compute-bound floor — but interactive chat, agentic loops, and\nlong-context decode are precisely the batch-1, memory-bound regime where Sophon governs.\n\nThe 40.8% final die yield (§9, Eq. 11–12) reflects an unmitigated baseline — a raw wafer-level sort with no\narchitectural countermeasures. Production deployment applies a three-tier defect mitigation (DM) strategy that\nrecovers gross-defect dies and reduces effective cost per working die by a further\n**20–35%** relative to the unmitigated baseline.\n\nEach 2D-TMD CIM tile is provisioned with **4 spare columns per 256-column bank** (~1.6%\ncolumn-area overhead). Wafer-level Automated Optical Inspection (AOI) identifies defective bitlines; a\none-time electrical fuse (e-fuse) map reroutes those columns to spares before Known-Good-Die (KGD)\nselection. This converts the majority of single-column faults — typically the dominant failure mode in M3D\nvia layers — into repaired working dies.\n\n| Parameter | Value | Basis |\n|---|---|---|\n| Spare columns per bank | 4 / 256 | ~1.6% area overhead |\n| Targeted fault mode | Single-bitline open/short (MIV via defect) | Stapper\n|\n\nDies that fail Tier 1 repair due to clustered multi-column faults are evaluated at the\n**tile granularity** (each die contains 576 tiles, §3.D). A die with ≤ 10% tile failures (~58\ntiles) is re-characterised and deployed at reduced capacity:\n\n| Partial-good grade | Active tiles | Effective capacity | Effective TFLOPS (BF16) | Discount factor |\n|---|---|---|---|---|\n| PFG-1 Full | 576 / 576 | 330 GB | 2,100 | — |\n| PFG-1 Grade-B | 518–575 | 297–329 GB | 1,888–2,098 | 15% BOM discount |\n| PFG-1 Grade-C | 461–517 | 264–296 GB | 1,681–1,884 | 30% BOM discount |\n| Scrap threshold | < 461 tiles | < 264 GB | < 1,681 | Wafer-level scrap |\n\nGrade-B and Grade-C dies are targeted at edge-inference and MoE partial-expert deployments where capacity\nheadroom exceeds strict density requirements. Modelling of the negative-binomial defect distribution (α = 3)\nindicates that **~18% of otherwise-scrapped dies qualify for Grade-B or Grade-C** harvest.\n\nAll KGD candidates (full and partial-good) undergo a **24-hour elevated-voltage burn-in** at\nVDD + 10% and Tjunction = 85 °C to screen infant-mortality failures — primarily 2T0C\nretention outliers. Post burn-in, full parametric re-test confirms:\n\nField return data from analogous 28 nm BEOL products places the post-burn-in Annualised Failure Rate (AFR)\nbelow **0.1% per die-year** — consistent with the mission-life assumptions in §6 (Thermal) and\n§9 (TCO).\n\n| Scenario | Effective yield | Effective BOM / working die |\n|---|---|---|\n| Unmitigated baseline (§9) | 40.8% | $8,358 |\n| + Tier 1 column repair | ~50–52% | ~$6,750 |\n| + Tier 2 partial-good harvest | ~58–60% effective | ~$5,870 |\n| + Tier 3 KGD burn-in (AFR reduction) | Identical yield; eliminates infant mortality | Negligible $25 test adder already in BOM |\n\nThe Tier 1 + Tier 2 combined uplift reduces the effective cost per working die by **~29–30%**,\ntightening the BOM advantage over HBM4 systems — NVIDIA Rubin (R200) and AMD Instinct MI455X — from a list-price 9.9×/11.6× (Rubin/MI455X) to a ~14× / 16.4× realised advantage — the ≈ $5,900 effective Sophon BOM\nafter defect harvest, against the unchanged GPU list price.\n\nNote on M3D-specific defect modes.The dominant yield detractor in the 64-tier 2D-TMD M3D stack is not planar Si lithography (which is mature at 28 nm) but ratherMonolithic Inter-tier Via (MIV) open/short defectsat the ~90 nm via pitch. Tier 1 column redundancy is specifically architected to absorb MIV-induced single-bitline opens — the most frequent M3D failure signature observed in imec SCALE 2024 demonstration vehicles[[7]]. Tier 2 tile harvesting addresses clustered MIV fault regions that escape column repair, which are typically correlated with local TMD grain boundary density gradients from CVD non-uniformity.\n\nBeyond terrestrial datacenters, the Sophon platform is intrinsically suited to orbital and deep-space deployment. Two structural properties — one from the 2T0C cell, one from the 2D-TMD channel itself — give the stack radiation tolerance that bulk-silicon parts can only approximate with shielding, redundancy, or dedicated rad-hard process options.\n\nIn a conventional 1T1C DRAM, the bit lives as charge on a deep-trench or stacked capacitor of tens of\nfemtofarads; the capacitor and its substrate collection volume present a large sensitive cross-section, and\na single ionizing strike that collects enough charge flips the bit [[31]](#ref-31).\nThe 2T0C gain cell eliminates the capacitor entirely: state is held on the ~ 3.0 fF parasitic node (Cgs\nof the read transistor plus the write transistor's junction) confined to a sub-micron footprint at the\nMetal-3 BEOL — far above the silicon substrate. The radiation target area per bit shrinks by orders of\nmagnitude relative to a capacitor cell, and with it the single-event upset (SEU) cross-section of the 330 GB\narray.\n\nThe 2D-TMD channel is grown on amorphous dielectric, not on a bulk semiconductor. This removes the two\ndominant radiation-degradation mechanisms of silicon devices at the root. First, there is no substrate\nbeneath the active channel to accumulate displacement damage: the lattice-disorder-induced leakage paths,\ncharge-funneling collection, and parasitic latch-up structures of bulk CMOS simply do not exist in the upper\ntiers [[32]](#ref-32). Second, displacement damage in the channel itself is bounded\nby geometry: an energetic particle traversing a three-atom-thick sheet can at most knock individual atoms\nout of the monolayer, producing an isolated point defect. There is no three-dimensional volume in which a\ncollision cascade can develop, so the surrounding covalently bonded lattice remains crystalline and the\ntransistor continues to operate — in contrast to bulk silicon, where a single primary knock-on atom\ndisplaces thousands of lattice atoms [[33]](#ref-33).\n\nThese mechanisms are not merely theoretical. 2D-material devices have shown negligible performance change\nafter γ-ray, proton, and electron irradiation at space-relevant doses\n[[34]](#ref-34), and a wafer-scale monolayer MoS₂ RF system has operated in low\nEarth orbit for nine months with a bit error rate below 10⁻⁸ — with a predicted lifetime of ~ 271 years even\nin geosynchronous-orbit flux [[35]](#ref-35). Combined with the total-ionizing-dose\nimmunity noted in §2 (no buried-oxide trap vulnerability) and the seconds-scale refresh that bounds any\ntransient corruption window, these properties make the platform a natural fit for satellite inference\npayloads. Formal SEE characterization of the full Sophon stack for LEO/MEO flux environments remains a\nqualification milestone (§11.3).\n\n| Sub-system | Validation |\n|---|---|\n| 2D-TMD nFET/pFET DC |\nmatches Liu Nature 2021\n|\n| 2T0C retention (closed-form) |\nτ = C·V/(2·Ioff); ngspice Level-1 confirms margin\n|\n| 2T0C read/write energy |\nngspice simulation this work\n|\n\nAll numeric assumptions in this paper trace to either a peer-reviewed publication, a vendor datasheet, or a Process Design Kit (PDK) module document. Numbers labelled \"this work\" are derived in the Equations Appendix (§13) from the listed source data.\n\n[1] **Radisavljevic, B., et al.** \"Single-layer MoS₂ transistors.\"\n*Nature Nanotechnology* 6, 147–150 (2011). DOI: 10.1038/nnano.2010.279.\n[https://doi.org/10.1038/nnano.2010.279](https://doi.org/10.1038/nnano.2010.279) → Source for\nMoS₂ baseline mobility (~ 200 cm²/V·s), Ion/Ioff > 10⁸.\n\n[2] **Liu, Y., Duan, X., Shin, H.-J., et al.** \"Promises and prospects of two-dimensional\ntransistors.\" *Nature* 591, 43–53 (2021). DOI: 10.1038/s41586-021-03339-z.\n[https://doi.org/10.1038/s41586-021-03339-z](https://doi.org/10.1038/s41586-021-03339-z) → Source\nfor TMD Ioff density ≈ 10⁻¹⁵ A/µm (1 fA/µm) at 28 nm gate length; comparative tables of MoS₂ vs\nSi scaling.\n\n[3] **Lan, H.-Y., et al.** \"Dual-Gate Synthetic MoS₂ MOSFETs with 4.56 µS/µm gm, 320\nµA/µm Id at 1 V Vd.\" *IEDM 2022* Technical Digest, paper 7.3. IEEE.\n[https://ieeexplore.ieee.org/document/10019462](https://ieeexplore.ieee.org/document/10019462) →\nSource for TMD nFET drive current, sub-threshold slope (~ 75 mV/dec), Vdd = 0.6 V operation.\n\n[4] **Sebastian, A., et al.** \"Benchmarking monolayer MoS₂ and WS₂ field-effect transistors.\"\n*Nature Communications* 12, 693 (2021). DOI: 10.1038/s41467-020-20732-w.\n[https://doi.org/10.1038/s41467-020-20732-w](https://doi.org/10.1038/s41467-020-20732-w) →\nWSe₂/WS₂ p-FET hole mobilities (60–120 cm²/V·s); CMOS-pair benchmarking.\n\n[5] **Shulaker, M. M., et al.** \"Three-dimensional integration of nanotechnologies for\ncomputing and data storage on a single chip.\" *Nature* 547, 74–78 (2017). DOI: 10.1038/nature22994.\n[https://doi.org/10.1038/nature22994](https://doi.org/10.1038/nature22994) → M3D nanosheet\nproof-of-concept; demonstrates low-temperature BEOL stacking compatible with this paper's TMD M3D approach.\n\n[6] **Vinet, M., et al. (CEA-Leti).** \"Monolithic 3D Integration: A Powerful Alternative to\nClassical 2D Scaling.\" *IEEE S3S Conference* 2014.\n[https://ieeexplore.ieee.org/document/7028181](https://ieeexplore.ieee.org/document/7028181) →\nEstablished M3D thermal budget constraints (≤ 450 °C BEOL ceiling) cited in §2.A.\n\n[7] **imec.** \"SCALE-3D: Scaling roadmap for monolithic 3D integration.\" imec Technology Forum\n2024.\n[https://www.imec-int.com/en/articles/monolithic-3d-integration](https://www.imec-int.com/en/articles/monolithic-3d-integration)\n→ MIV (Monolithic Inter-tier Via) pitch (~ 90 nm) and density (~ 10⁸/mm²) used in §2.A.\n\n[8] **Belmonte, A., et al. (imec).** \"Capacitor-less, Long-Retention (>400 s) DRAM Cell\nPaving the Way Towards Low-Power and High-Density Monolithic 3D DRAM.\" *IEDM 2020*, paper 28.2.\n[https://ieeexplore.ieee.org/document/9372074](https://ieeexplore.ieee.org/document/9372074) →\nImec 2T0C IGZO-channel demonstration; establishes 2T0C feasibility and validates closed-form retention model\nτ = C·V/(2·Ioff) used in §4.1.\n\n[9] **Liu, X., et al.** \"A 2T0C DRAM Based on Amorphous In-Ga-Zn-O Thin Film Transistors with\nRetention Time Larger Than 400 s.\" *IEEE Electron Device Letters* 41(8), 1184–1187 (2020).\n[https://ieeexplore.ieee.org/document/9118898](https://ieeexplore.ieee.org/document/9118898) →\nIndependent confirmation of long-retention 2T0C; basis for TMD adaptation in this paper.\n\n[10] **Wu, F., et al.** \"Vertically Stacked Multilayer Heterostructures for 2T0C DRAM.\"\n*Nature Electronics* 5, 519–526 (2022). DOI: 10.1038/s41928-022-00807-w.\n[https://doi.org/10.1038/s41928-022-00807-w](https://doi.org/10.1038/s41928-022-00807-w) →\n2D-material-based 2T0C with sub-µm² cells; closest published analogue to the Sophon cell.\n\n[11] **Horowitz, M.** \"Computing's energy problem (and what we can do about it).\"\n*ISSCC 2014* Keynote. IEEE.\n[https://ieeexplore.ieee.org/document/6757323](https://ieeexplore.ieee.org/document/6757323) →\nSource for the per-operation energy model (FP add ~ 0.4 pJ @ 45 nm, scaling by Vdd²); the TMD MAC\nenergy in §C.1 is computed by scaling this with Vdd² ratio and 0.85× TMD device factor (from\n[[3]](#ref-3)).\n\n[12] **Jouppi, N. P., et al.** \"Ten Lessons From Three Generations Shaped Google's TPUv4i.\"\n*ISCA 2021*.\n[https://ieeexplore.ieee.org/document/9499913](https://ieeexplore.ieee.org/document/9499913) →\nIndustrial benchmark for tile-array CIM energy per MAC and utilization figures (55% sustained, 75% peak).\n\n[13] **Patterson, D., et al.** \"Carbon Emissions and Large Neural Network Training.\"\n*arXiv:2104.10350* (2021).\n[https://arxiv.org/abs/2104.10350](https://arxiv.org/abs/2104.10350) → Source for the \"6 × Nparams\nFLOPs per training token\" estimator and per-token energy framework used in §5.B.3.\n\n[14] **Kaplan, J., et al.** \"Scaling Laws for Neural Language Models.\"\n*arXiv:2001.08361* (2020).\n[https://arxiv.org/abs/2001.08361](https://arxiv.org/abs/2001.08361) → Source for the 2 × Nparams\nFLOPs per inference token estimator used in §5.A.3.\n\n[15] **Hoffmann, J., et al. (Chinchilla).** \"Training Compute-Optimal Large Language Models.\"\n*arXiv:2203.15556* (2022).\n[https://arxiv.org/abs/2203.15556](https://arxiv.org/abs/2203.15556) → Source for the 1T–15T\ntraining-token range used in §5.B.3 cluster sizing.\n\n[16] **NVIDIA Corporation.**\n*NVIDIA Rubin (R200) Architecture Technical Brief* (2026).\n[https://www.nvidia.com/en-us/data-center/rubin/](https://www.nvidia.com/en-us/data-center/rubin/)\n→ Source for NVIDIA Rubin (R200) per-GPU specs: ≈ 17,500 TFLOPS dense FP8, ≈ 8,750 TFLOPS dense BF16,\n**288 GB HBM4**, ≈ 22 TB/s memory bandwidth per GPU, ≈ 1,800 W TDP (2,300 W Max-P).\n80B FP8 decode ≈ 300 tok/s (HBM-bound: 22 TB/s ÷ 80 GB), 80B batch-1 training ≈ 880 tok/s;\nenergy/token 4.48 J, tokens/W 0.22, BOM ≈ $82,800, TCO ≈ $85,600.\n\n[16b] **Advanced Micro Devices, Inc.**\n*AMD Instinct MI455X Architecture Technical Brief* (2026).\n[https://www.amd.com/en/products/accelerators/instinct/mi400/mi455x.html](https://www.amd.com/en/products/accelerators/instinct/mi400/mi455x.html)\n→ Source for AMD Instinct MI455X per-GPU specs: ≈ 20,000 TFLOPS dense FP8, ≈ 10,000 TFLOPS dense BF16,\n**432 GB HBM4**, ≈ 19.6 TB/s memory bandwidth per GPU, ≈ 1,700 W TDP.\n80B FP8 decode ≈ 270 tok/s (HBM-bound: 19.6 TB/s ÷ 80 GB), 80B batch-1 training ≈ 785 tok/s;\nenergy/token 4.48 J, tokens/W 0.22, BOM ≈ $96,700, TCO ≈ $99,400.\n\n[17] **NVIDIA Corporation / Advanced Micro Devices, Inc. / Morgan Stanley Research.**\n*Rubin (R200) and Instinct MI455X Platform Specifications* (2026); and\n*Nvidia NVL72 Bill of Materials — GB300 vs VR200* (Morgan Stanley Research estimate, 2025).\n[https://www.nvidia.com/en-us/data-center/rubin/](https://www.nvidia.com/en-us/data-center/rubin/)\n→ Power references: NVIDIA Rubin (R200) ≈ 1,800 W TDP (2,300 W Max-P); AMD Instinct MI455X ≈ 1,700 W TDP.\nPer-accelerator BOM from Morgan Stanley’s VR200 NVL72 estimate (≈ $7.8M / rack ÷ 72 GPUs): GPU silicon\n≈ $55,000 + HBM4 ≈ $27,800 → Rubin BOM ≈ $82,800; MI455X scaled to 432 GB HBM4 ≈ $96,700. (These are\nrack-price allocations including vendor margin; the Sophon BOM is a pre-silicon build cost.)\n\n[18] **JEDEC Solid State Technology Association.** *JESD270-4: HBM4 Standard* (2025).\n[https://www.jedec.org/standards-documents/docs/jesd270-4](https://www.jedec.org/standards-documents/docs/jesd270-4)\n→ HBM4 package bandwidth: Rubin (R200) ≈ 22 TB/s, MI455X ≈ 19.6 TB/s; HBM read energy ≈ 7 pJ/bit,\nκHBM4 = 5.6×10⁻¹¹ J·tok⁻¹·param⁻¹; used as the ~ 191×/214× weight-bandwidth baseline.\n\n[19] **JEDEC.** *Roadmap: HBM4 and HBM5 — preliminary specifications*.\n[https://www.jedec.org/news/pressreleases](https://www.jedec.org/news/pressreleases) → Source for\nHBM4/HBM4e/HBM5/HBM5e roadmap capacity figures used in §7.\n\n[20] **Pop, E.** \"Energy Dissipation and Transport in Nanoscale Devices.\"\n*Nano Research* 3, 147–169 (2010). DOI: 10.1007/s12274-010-1019-z.\n[https://doi.org/10.1007/s12274-010-1019-z](https://doi.org/10.1007/s12274-010-1019-z) → Source\nfor BEOL effective thermal conductivity baseline (kBEOL ≈ 2.0 W/m·K).\n\n[21] **Mahajan, R., et al. (Intel).** \"Cooling a Microprocessor Chip.\"\n*Proceedings of the IEEE* 94(8), 1476–1486 (2006).\n[https://ieeexplore.ieee.org/document/1683998](https://ieeexplore.ieee.org/document/1683998) →\nSource for liquid cold-plate package thermal resistance (Rpkg ≈ 0.05 K/W).\n\n[22] **Bar-Cohen, A., et al.** \"Embedded Cooling for Wide Bandgap Power Amplifiers.\"\n*IEEE Trans. Components, Packaging and Manufacturing Tech.* 5(9), 1226–1239 (2015).\n[https://ieeexplore.ieee.org/document/7173025](https://ieeexplore.ieee.org/document/7173025) →\nSource for microfluidic Rpkg ≈ 0.02 K/W; two-phase immersion ≈ 0.01 K/W envelope.\n\n[23] **Cunningham, J. A.** \"The Use and Evaluation of Yield Models in Integrated Circuit\nManufacturing.\" *IEEE Trans. Semiconductor Manufacturing* 3(2), 60–71 (1990).\n[https://ieeexplore.ieee.org/document/55438](https://ieeexplore.ieee.org/document/55438) →\nNegative-binomial yield model with clustering parameter α = 3; basis for the 49.5% base yield in §9.\n\n[24] **Stapper, C. H.** \"Modeling of Defects in Integrated Circuit Photolithographic Patterns.\"\n*IBM Journal of R&D* 28(4), 461–475 (1984).\n[https://ieeexplore.ieee.org/document/5390244](https://ieeexplore.ieee.org/document/5390244) →\nMurphy yield model used as cross-check (51.2% for A·D₀ = 0.75) in the audit calculations.\n\n[25] **TechInsights.** *28 nm Foundry Wafer Cost Analysis, 2025–2026 Update*.\nTechInsights subscription report; public summary:\n[https://www.techinsights.com/wafer-cost-analysis](https://www.techinsights.com/wafer-cost-analysis)\n→ Source for the $3,500 28 nm 12-inch wafer cost.\n\n[26] **U.S. Energy Information Administration.**\n*Average Industrial Electricity Price, 2025*.\n[https://www.eia.gov/electricity/monthly/](https://www.eia.gov/electricity/monthly/) → Source for\nthe $0.10/kWh industrial tariff baseline used in TCO (§9).\n\n[27] **Uptime Institute.** *Global Data Center Survey 2024 — PUE Trends*.\n[https://uptimeinstitute.com/resources/research/global-data-center-survey-2024](https://uptimeinstitute.com/resources/research/global-data-center-survey-2024)\n→ Source for the PUE = 1.5 assumption (industry median for liquid-cooled facilities).\n\n[28] **PhantaField Inc.**\n*2T0C 2D-TMD Cell Characterization, Pre-Production Lot, May 2026*. (Internal report.) → Source for the\n30 fJ/bit read and 20 fJ/bit write energies in §A.1.\n\n[29] **Leviathan, Y., Kalman, M., Matias, Y.** \"Fast Inference from Transformers via\nSpeculative Decoding.\" *ICML 2023*.\n[https://arxiv.org/abs/2211.17192](https://arxiv.org/abs/2211.17192) → Source for the\nspeculative-decoding speedup model, k = 4 draft length, 70% token-acceptance rate baseline used in §5.A.6\nand Eq. 17.\n\n[30] **Lin, J., et al.** \"AWQ: Activation-aware Weight Quantization for LLM Compression and\nAcceleration.\" *MLSys 2024*.\n[https://arxiv.org/abs/2306.00978](https://arxiv.org/abs/2306.00978)\n→ Source for INT4 weight-only quantization quality bounds (≤ 1–2 perplexity points vs FP8 on 70B-class\ninstruction-tuned models) used in §5.A.6 and Eq. 17.\n\n[31] **Baumann, R. C.** \"Radiation-induced soft errors in advanced semiconductor technologies.\"\n*IEEE Transactions on Device and Materials Reliability* 5(3), 305–316 (2005). DOI:\n10.1109/TDMR.2005.853449.\n[https://doi.org/10.1109/TDMR.2005.853449](https://doi.org/10.1109/TDMR.2005.853449) → Source for\nthe single-event-upset mechanism: charge collection onto storage nodes, and the dependence of SEU\ncross-section on sensitive-node volume, used in §10.1.\n\n[32]\n**Schwank, J. R., Ferlet-Cavrois, V., Shaneyfelt, M. R., Paillet, P., Dodd, P. E.** \"Radiation\neffects in SOI technologies.\" *IEEE Transactions on Nuclear Science* 50(3), 522–538 (2003). DOI:\n10.1109/TNS.2003.812930.\n[https://ieeexplore.ieee.org/document/1208574](https://ieeexplore.ieee.org/document/1208574) →\nSource for dielectric isolation effects: reduced charge-collection volume, elimination of substrate\nfunneling, and latch-up immunity of devices isolated from the bulk substrate, used in §10.2.\n\n[33]\n**Komsa, H.-P., Kotakoski, J., Kurasch, S., Lehtinen, O., Kaiser, U., Krasheninnikov, A. V.**\n\"Two-dimensional transition metal dichalcogenides under electron irradiation: defect production and doping.\"\n*Physical Review Letters* 109, 035503 (2012). DOI: 10.1103/PhysRevLett.109.035503.\n[https://doi.org/10.1103/PhysRevLett.109.035503](https://doi.org/10.1103/PhysRevLett.109.035503)\n→ Source for displacement-threshold energies in TMD monolayers and the isolated-point-vacancy character of\nirradiation damage in atomically thin sheets, used in §10.2.\n\n[34] **Vogl, T., Sripathy, K., Sharma, A., et al.** \"Radiation tolerance of two-dimensional\nmaterial-based devices for space applications.\" *Nature Communications* 10, 1202 (2019). DOI:\n10.1038/s41467-019-09219-5.\n[https://doi.org/10.1038/s41467-019-09219-5](https://doi.org/10.1038/s41467-019-09219-5) →\nDemonstrates negligible performance change in 2D-material devices after γ-ray, proton, and electron\nirradiation at space-relevant doses, used in §10.\n\n[35] **Zhu, L., et al.** \"Radiation-tolerant atomic-layer-scale RF system for spaceborne\ncommunication.\" *Nature* 650, 346–352 (2026). DOI: 10.1038/s41586-025-10027-9.\n[https://www.nature.com/articles/s41586-025-10027-9](https://www.nature.com/articles/s41586-025-10027-9)\n→ On-orbit demonstration: a wafer-scale monolayer MoS₂ RF transmit/receive system operated at ~ 517 km LEO\nfor 9 months with bit error rate < 10⁻⁸, with a predicted ~ 271-year lifetime in GEO flux, used in §10.\n\nEvery numeric result in this paper is derived from the equations below. Source citations refer to §12.\n\nwhere F² is the cell footprint in lithographic squares (8 for the 2T0C DRAM cell), Fnm is the\nhalf-pitch in nm (28 nm baseline), p is the periphery overhead fraction (0.45 for DRAM), and b is bits per\ncell (1 for 2T0C). The 10¹² factor converts nm² to mm².\n\n**Worked example — Sophon 2T0C DRAM:** D = 10¹² / (8 × 28² × 1.45) × 1 = 110.0 Mb/mm². Source\nfor cell: [[10]](#ref-10) (analogous 2D-material 2T0C); validated by\n[[8]](#ref-8)[[9]](#ref-9).\n\nwhere Amem-tier is the full footprint of one memory tier (750 mm²) and Nmem-tiers =\n32. The 64-tier stack interleaves dedicated logic and memory tiers (32 of each); only the 32 memory tiers\ncontribute to capacity.\n\n**Sophon:** C = (110.0 × 750 × 32) / 8000 = **330.2 GB** (rounded to 330 GB).\n\nThe factor of 2 reflects the sense margin: data is reliably recovered while the stored voltage remains above\nVdd/2. Source: [[8]](#ref-8) (closed-form derivation);\n[[9]](#ref-9) (empirical confirmation).\n\n**Worked example:** Cnode = 3.0 fF (sum of Cgs,RT ≈ 2.5 fF + Cj,WT\n≈ 0.5 fF), Vdd = 0.6 V. The off-current is specified as a\n**width-normalized density** Joff = 10⁻¹⁵ A/µm = **1 fA/µm** for the\n2D-TMD nFET [[2]](#ref-2)[[3]](#ref-3); with a\nRead-Transistor channel width WRT = 0.5 µm the absolute leakage is Ioff = Joff\n· WRT = **0.5 fA** (5 × 10⁻¹⁶ A) at 25 °C: τ = (3.0 × 10⁻¹⁵ × 0.6) / (2 × 5 × 10⁻¹⁶)\n= **1.8 s** at 25 °C.\n\nThis is ≈ 4,800× longer than a 1T1C DRAM cell and reflects the exceptional sub-threshold off-state of the\natomically-thin TMD channel (Ion/Ioff > 10⁸, sub-threshold slope ≈ 75 mV/dec).\nRetention derates with junction temperature at ≈ 2× per 10 °C (Arrhenius): τ ≈ 159 ms at 60 °C and ≈ 28 ms\nat 85 °C.\n\nwith Cbits = capacity in bits, frefresh = 1 / Trefresh.\n\n**Sophon:** at 25 °C the retention τ = 1.8 s (Eq. 3) permits a relaxed refresh interval of\n**T refresh = 1.0 s** (1.8× margin). P = (330 × 8 × 10⁹ bits) × (1 / 1.0 Hz) × (30 ×\n10⁻¹⁵ J/bit) =\n\nTotal energy per MAC operation is the sum of memory access and compute.\n**Sophon uses pure digital CIM** (binary sense amplifier + adder tree per column-group.\n\n**Sophon BF16 forward MAC:** E = (30 fJ/bit × 16 bits) + Eadder-tree,BF16 = 0.480 pJ\n+ 0.140 pJ = **0.620 pJ/MAC**.\n\n**Sophon BF16 backward MAC:** add gradient write Ewrite = 20 fJ/bit × 16 bits =\n0.320 pJ → **0.940 pJ/MAC total per weight per training step**.\n\n**Sophon FP8 inference MAC:** E = (30 fJ/bit × 8 bits) + Eadder-tree,FP8 = 0.240 pJ\n+ 0.070 pJ = **0.310 pJ/MAC**.\n\nEadder-tree,FP8 is computed from per-bit binary adder energy in 28 nm CMOS at 0.6 V\n[[11]](#ref-11) scaled to 2D-TMD: 8 fJ × 8 levels × 0.85 ≈ 0.054 pJ; with sign-bit\nand mantissa pipeline overhead the effective figure is 0.070 pJ/MAC. The BF16 adder-tree figure (0.140 pJ)\nis twice the FP8 figure because the bit-serial activation broadcast runs for 16 cycles instead of 8. The\nfully digital adder tree is the primary energy improvement of the digital-CIM architecture.\n\nwhere Rop is the peak operation rate (FLOPS), u is the utilization fraction, Eper op\nis per-FLOP energy (half of per-MAC energy, since 1 MAC = 2 FLOPs).\n\n**Sophon FP8 decode (55% util.):** P = 4,200 × 10¹² × 0.55 × (0.310 / 2) × 10⁻¹² + 15 W (NoC +\nstatic) = **≈ 373 W** (matches §C.3 table: DRAM read 277 W + digital MAC 81 W + NoC 13 W +\nstatic 2 W; the read is the full 0.240 pJ/MAC at the FP8 MAC rate, not halved).\n\n**Sophon BF16 forward (55% util.):** P = 2,100 × 10¹² × 0.55 × (0.620 / 2) × 10⁻¹² + ~1 W\nrefresh + 20 W NoC + 2 W static = **≈ 379 W** (refresh is negligible at ≈ 0.08 W thanks to the\n1 fA/µm off-current).\n\n**Sophon backward (55% util.):** + gradient write power 2,100 × 10¹² × 0.55 × (0.320 / 2) ×\n10⁻¹² = + 185 W extra at FLOP rate, or **370 W at MAC rate**. The §C.3 table uses 370 W →\n**≈ 749 W total**.\n\nUtilization 55% is from TPUv4i sustained workload data [[12]](#ref-12); peak 100%\nused for thermal worst-case.\n\nFrom Kaplan et al. [[14]](#ref-14):\n\n**Sophon 80B FP8 decode:** tokens/s = (4,200 × 10¹² × 0.55) / (2 × 80 × 10⁹) =\n**14,438 tokens/s**. **Sophon 80B BF16 decode:** tokens/s = (2,100 × 10¹² × 0.55)\n/ (2 × 80 × 10⁹) = **7,219 tokens/s**.\n\nFrom Patterson et al. [[13]](#ref-13):\n\nThe factor 6 (vs. 2 for inference) accounts for forward (2N) + backward (4N) compute.\n\n**Sophon 80B BF16:** tokens/s = (2,100 × 10¹² × 0.55) / (6 × 80 × 10⁹) =\n**2,406 tokens/s/die**.\n\n**Examples:**\n\n**Sophon FP8 decode:** E = 373 W / 14,438 tokens/s = **25.8 mJ/token**.\n**Sophon BF16 decode:** E = 379 W / 7,219 tokens/s = **52.5 mJ/token**.\n**Sophon training (time-avg fwd + bwd):** E = 564 W / 2,406 tokens/s =\n**0.234 J/token**.\n\nSource: Cunningham [[23]](#ref-23). A = 7.5 cm² die area, D₀ = 0.1 defect/cm²\n(mature 28 nm), α = 3 (typical clustering).\n\nY = (1 + 0.75/3)⁻³ = **0.512** → 51.2% (negative-binomial).\n\nCross-check with Murphy/Stapper [[24]](#ref-24): Y = ((1 − exp(−AD₀)) / AD₀)² =\n0.495 → 49.5%. The more conservative Murphy/Stapper value is used as the base wafer yield.\n\nWith Ytier = 0.997 (3 σ M3D process control achievable per imec\n[[7]](#ref-7)): Ystack = 0.997⁶⁴ = **0.825**.\n\nCombined yield (base × stack): 0.495 × 0.825 = **0.408 → 40.8%** final die yield used in the\nBOM calculation (§9).\n\n**Sophon:** BOM = ($51 + 64 × $52) / 0.408 + $60 + $0 + $25 = $8,273 + $85 =\n**$8,358**.\n\nWafer cost from [[25]](#ref-25); tier adder estimated from per-tier mask +\nprocessing economics in [[7]](#ref-7)[[5]](#ref-5).\n\nwith 26,280 hours = 3 years × 8,760 h/year, PUE = 1.5 [[27]](#ref-27), ckWh\n= $0.10/kWh [[26]](#ref-26). Pavg\nis the duty-weighted average power.\n\n**Sophon inference (30% busy FP8 decode, 70% idle):** Pavg = 0.30 × 373 + 0.70 × 3 =\n114.0 W → energy 2,996 kWh × $0.15 = $449 → TCO = $8,358 + $449 = **$8,807**.\n\n**NVIDIA Rubin (R200) same duty:** Pavg = 0.30 × 1,800 + 0.70 × 250 = 715 W → energy\n18,790 kWh × $0.15 = $2,819 → TCO = $82,800 + $2,819 = **$85,619**.\n\n**AMD Instinct MI455X same duty:** Pavg = 0.30 × 1,700 + 0.70 × 250 = 685 W → energy\n18,002 kWh × $0.15 = $2,700 → TCO = $96,700 + $2,700 = **$99,400**.\n\n**Sophon training (50% busy training, 50% idle):** Pavg = 0.50 × 564 + 0.50 × 3 =\n283.5 W → energy 7,450 kWh × $0.15 = $1,118 → TCO = $8,358 + $1,118 = **$9,476**.\n\nParallel-conduction model with Cu fill fraction φCu = 0.06 (Monolithic Inter-tier Via density ×\nvia cross-section / total area), kCu = 380 W/m·K, kBEOL = 2.0 W/m·K\n[[20]](#ref-20):\n\nkeff = 0.06 × 380 + 0.94 × 2.0 = **24.7 W/m·K**.\n\nRstack = (Ntiers × ttier) / (keff × Adie) is the M3D\nstack resistance; Rpkg is the package-to-coolant resistance from\n[[21]](#ref-21)[[22]](#ref-22).\n\n**Sophon FP8 decode (373 W, liquid R pkg = 0.05 K/W):** R\n\n**Sophon training avg. (564 W):** Tj = 25 + 564 × 0.0512 =\n**53.9 °C** (well below Tjmax = 105 °C).\n\nThe raw dense FP8 baseline of Eq. 7 can be multiplied by three orthogonal workload-level accelerators on a\nsingle Sophon die. Let s be the speculative-decoding multiplier, q be the quantization multiplier, and Nactive\n/ Ntotal be the MoE sparsity ratio. The effective decode throughput becomes:\n\nwith assumed multiplier values supported by published technique benchmarks:\n\n**Worked example — Sophon 80B dense, INT4 + speculative (FP8 mode):** tokens/s = (4,200 × 10¹²\n× 0.55 × 2.5 × 2.0) / (2 × 80 × 10⁹) = **72,188 tokens/s/die** = ~ 5× raw FP8 baseline.\n\n**Worked example — Sophon DeepSeek-V3 MoE (671 B total / 37 B active), FP8 dense weights:**\ntokens/s = (4,200 × 10¹² × 0.55) / (2 × 37 × 10⁹) = **31,216 tokens/s/die** = ~ 18× the\nequivalent 671 B dense decode rate.\n\nNote that the three multipliers do not all compose additively in every regime: speculative decoding's effective speedup depends on the small-model draft accuracy (which itself depends on the deployment domain), and the q = 2 INT4 multiplier and the MoE sparsity multiplier compose only when the model architecture supports both jointly. The benchmark table in §5.A.6 enumerates the realistic combinations.", "url": "https://wpnews.pro/news/sophon-pfg-1-a-monolithic-3d-ai-asic-with-330-gb-of-on-die-dram-and-no-hbm", "canonical_source": "https://www.phantafield.com/whitepaper", "published_at": "2026-06-29 01:23:38+00:00", "updated_at": "2026-06-29 01:28:27.786795+00:00", "lang": "en", "topics": ["ai-chips", "ai-infrastructure", "ai-products", "ai-research"], "entities": ["Sophon PFG-1", "NVIDIA Rubin", "AMD Instinct MI455X", "Morgan Stanley", "HBM4"], "alternates": {"html": "https://wpnews.pro/news/sophon-pfg-1-a-monolithic-3d-ai-asic-with-330-gb-of-on-die-dram-and-no-hbm", "markdown": "https://wpnews.pro/news/sophon-pfg-1-a-monolithic-3d-ai-asic-with-330-gb-of-on-die-dram-and-no-hbm.md", "text": "https://wpnews.pro/news/sophon-pfg-1-a-monolithic-3d-ai-asic-with-330-gb-of-on-die-dram-and-no-hbm.txt", "jsonld": "https://wpnews.pro/news/sophon-pfg-1-a-monolithic-3d-ai-asic-with-330-gb-of-on-die-dram-and-no-hbm.jsonld"}}