CUDA-like programming of Cerebras WSE

wpnews.pro

cd /news/ai-chips/cuda-like-programming-of-cerebras-ws… · home › topics › ai-chips › article

[ARTICLE · art-27677] src=github.com ↗ pub=2026-06-15T07:32Z topic=ai-chips verified=true sentiment=· neutral

CUDA-like programming of Cerebras WSE

A high-fidelity architectural simulator for the Cerebras CS3 Wafer-Scale Engine (WSE) has been developed, modeling 720,000 processing elements on a 2D mesh to enable performance analysis and software development. The simulator uses a CUDA-like programming model and a hybrid performance/functional execution track to balance accuracy and speed, supporting a full toolchain from a Python DSL to a custom 32-bit ISA binary.

read3 min views23 publishedJun 15, 2026

A high-fidelity architectural simulator for the Cerebras CS3 Wafer-Scale Engine (WSE). This project models the hardware architecture, interconnects, and execution environment to enable performance analysis and software development for massively parallel 2D mesh architectures.

The Cerebras-Sim is designed to model the CS3 WSE, featuring a massive array of 720,000 processing elements (PEs). It provides a full-stack simulation environment, from a high-level Python DSL down to a custom 32-bit ISA binary.

Performance Analysis: Estimate total runtime and identify bottlenecks using a hybrid performance/functional model.** Software Development**: Verify kernel correctness via a CUDA-like programming model before deploying to hardware.** Architectural Exploration**: Model the impact of mesh bisection bandwidth, latency, and SRAM constraints.

Processing Element (PE): Each core implements an 8-wide SIMD unit, vector registers, and a private** 48KB local SRAM**.** Interconnect**: A 2D Mesh (800x900) where communication occurs viaSEND

/RECV

primitives and global address space abstractions.Memory Hierarchy:** Local SRAM**: Private high-speed memory per PE (analogous to CUDA Shared Memory).** Weight Server**: External DRAM accessed via a global address space for large-scale model weights and data.

Host-Device Interface: A driver model implementing a command queue (CS3Queue

) and memory movement (cs3_memcpy

The simulator employs a Bulk Synchronous Parallel (BSP) model, dividing execution into discrete "supersteps":

Compute: PEs perform local SIMD operations.** Communicate**: PEs exchange data across the mesh or with the Weight Server.** Synchronize**: A global barrier (SYNC

) aligns the execution state.

To balance accuracy and speed, the simulator uses a hybrid execution track:

Performance Track (Global): All PEs are tracked for cycle counts and timing.** Functional Track (Sampled)**: A stochastic sampling strategy is used where only a subset of blocks is fully simulated functionally to verify correctness.

The project implements a complete toolchain: Python DSL

Tungsten-IR

ISA Binary

Simulator

Kernels are written in a CUDA-like Python DSL. For example, a simple SAXPY (

@cs3_kernel(block_w=16, block_h=16)
def saxpy_kernel(ctx):
    x = ctx.load_global(None, 0)
    y = ctx.load_global(None, 4)
    
    z = 2.0 * x + y
    
    ctx.store_global(None, 8, z)

Frontend: A CUDA-like DSL embedded in Python using@cs3_kernel

decorators.Intermediate Representation (Tungsten-IR): A dataflow-centric IR mapping compute nodes and synchronization points.** Compiler Backend**:** Mapping & Scheduling**: Assigns IR nodes to the physical 2D mesh and manages the SRAM budget.** Assembler**: Emits the final 32-bit binary stream.

Simulator Engine: A Python-based engine that decodes the ISA and drives the hardware model.

Instead of exhaustive packet-level simulation, the system uses a latency-and-bandwidth-aware abstract model:

Latency: Calculated based on physical Manhattan distance:$$\text{Latency}_{\text{op}} = \text{Base Latency} + (\text{Manhattan Distance} \times \text{Hop Latency})$$ - Bandwidth & Congestion: The simulator enforces a** Bisection Bandwidth Constraint**. If total bytes transferred per superstep exceed network capacity, a congestion multiplier is applied to "stretch" the superstep duration.

Component	Status	Details
ISA Decoder
✅ Complete	Full implementation of Compute, Mesh, Control, System, DSD, and Global memory opcodes.
Hardware Model
✅ Functional	Core logic, SRAM, 2D Mesh, and Host-Device IO are implemented.
Compiler
✅ Functional	AST parsing, IR generation, register allocation, and assembly are operational.
Simulation Engine
✅ Functional	Hybrid Performance/Functional tracks and BSP scheduling are implemented.
Advanced Mapping
🚧 In Progress	Optimizing spatial mapping for complex kernels.
Weight Server
🚧 In Progress	Integration with external weight servers for real-world model weights.

The project uses a Dual-Execution strategy to verify the compiler:

Python Path: Executes the kernel as a Python function viaKernelContext

(Golden Reference). - Binary Path: Compiles the kernel$\rightarrow$ executes the resulting binary on theBSPScheduler

. - Comparison: Bit-exact comparison of final memory states.

To run integration tests:

python3 -m unittest discover tests/integration

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/cuda-like-programming-of…

Read original on github.com → github.com/greg1232/cerebras-py-sim

mentioned entities

Cerebras

CS3 WSE

Wafer-Scale Engine

Cerebras-Sim

Tungsten-IR

SAXPY

metadata

slugcuda-like-programming-of-cerebras-wse

topic#ai-chips

secondary3 topics

sentimentneutral

canonicalgithub.com

navigation

← prevAnimoca Brands cofounder Yat Siu…

next →What Happens to Platform Teams?

── more in #ai-chips 4 stories · sorted by recency

siliconangle.com · 29 Jul · #ai-chips

Cerebras and AMD partner to build the world’s fastest disaggregated AI inference solution

techstrong.ai · 27 Jul · #ai-chips

AMD, Cerebras Partner on Specialized AI Inference

cerebras.ai · 24 Jul · #ai-chips

AMD and Cerebras Launch AI Inference Solution

mlq.ai · 23 Jul · #ai-chips

AMD and Cerebras Partner on Disaggregated AI Inference System Pairing Helios Racks with Wafer-Scale Chips

── more on @cerebras 3 stories trending now

wpnews · 30 Jul · #artificial-intelligence

Microsoft and Meta Earnings Show Different AI Spending Pressures

wpnews · 31 Jul · #artificial-intelligence

Microsoft doubles down on multi-model AI as it builds a Copilot super app

wpnews · 30 Jul · #artificial-intelligence

Apple to join Samsung in AI glasses race against Meta

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required