CUDA-like programming of Cerebras WSE

A high-fidelity architectural simulator for the Cerebras CS3 Wafer-Scale Engine (WSE) has been developed, modeling 720,000 processing elements on a 2D mesh to enable performance analysis and software development. The simulator uses a CUDA-like programming model and a hybrid performance/functional execution track to balance accuracy and speed, supporting a full toolchain from a Python DSL to a custom 32-bit ISA binary.

A high-fidelity architectural simulator for the Cerebras CS3 Wafer-Scale Engine WSE . This project models the hardware architecture, interconnects, and execution environment to enable performance analysis and software development for massively parallel 2D mesh architectures. The Cerebras-Sim is designed to model the CS3 WSE, featuring a massive array of 720,000 processing elements PEs . It provides a full-stack simulation environment, from a high-level Python DSL down to a custom 32-bit ISA binary. Performance Analysis : Estimate total runtime and identify bottlenecks using a hybrid performance/functional model. Software Development : Verify kernel correctness via a CUDA-like programming model before deploying to hardware. Architectural Exploration : Model the impact of mesh bisection bandwidth, latency, and SRAM constraints. Processing Element PE : Each core implements an 8-wide SIMD unit, vector registers, and a private 48KB local SRAM . Interconnect : A 2D Mesh 800x900 where communication occurs via SEND / RECV primitives and global address space abstractions. Memory Hierarchy : Local SRAM : Private high-speed memory per PE analogous to CUDA Shared Memory . Weight Server : External DRAM accessed via a global address space for large-scale model weights and data. Host-Device Interface : A driver model implementing a command queue CS3Queue and memory movement cs3 memcpy . The simulator employs a Bulk Synchronous Parallel BSP model, dividing execution into discrete "supersteps": Compute : PEs perform local SIMD operations. Communicate : PEs exchange data across the mesh or with the Weight Server. Synchronize : A global barrier SYNC aligns the execution state. To balance accuracy and speed, the simulator uses a hybrid execution track : Performance Track Global : All PEs are tracked for cycle counts and timing. Functional Track Sampled : A stochastic sampling strategy is used where only a subset of blocks is fully simulated functionally to verify correctness. The project implements a complete toolchain: Python DSL Tungsten-IR ISA Binary Simulator Kernels are written in a CUDA-like Python DSL. For example, a simple SAXPY python @cs3 kernel block w=16, block h=16 def saxpy kernel ctx : Load inputs from global memory Weight Server x = ctx.load global None, 0 y = ctx.load global None, 4 Compute: z = 2.0 x + y z = 2.0 x + y Store result back to global memory ctx.store global None, 8, z Frontend : A CUDA-like DSL embedded in Python using @cs3 kernel decorators. Intermediate Representation Tungsten-IR : A dataflow-centric IR mapping compute nodes and synchronization points. Compiler Backend : Mapping & Scheduling : Assigns IR nodes to the physical 2D mesh and manages the SRAM budget. Assembler : Emits the final 32-bit binary stream. Simulator Engine : A Python-based engine that decodes the ISA and drives the hardware model. Instead of exhaustive packet-level simulation, the system uses a latency-and-bandwidth-aware abstract model : - Latency : Calculated based on physical Manhattan distance:$$\text{Latency} {\text{op}} = \text{Base Latency} + \text{Manhattan Distance} \times \text{Hop Latency} $$ - Bandwidth & Congestion : The simulator enforces a Bisection Bandwidth Constraint . If total bytes transferred per superstep exceed network capacity, a congestion multiplier is applied to "stretch" the superstep duration. | Component | Status | Details | |---|---|---| ISA Decoder | ✅ Complete | Full implementation of Compute, Mesh, Control, System, DSD, and Global memory opcodes. | Hardware Model | ✅ Functional | Core logic, SRAM, 2D Mesh, and Host-Device IO are implemented. | Compiler | ✅ Functional | AST parsing, IR generation, register allocation, and assembly are operational. | Simulation Engine | ✅ Functional | Hybrid Performance/Functional tracks and BSP scheduling are implemented. | Advanced Mapping | 🚧 In Progress | Optimizing spatial mapping for complex kernels. | Weight Server | 🚧 In Progress | Integration with external weight servers for real-world model weights. | The project uses a Dual-Execution strategy to verify the compiler: - Python Path : Executes the kernel as a Python function via KernelContext Golden Reference . - Binary Path : Compiles the kernel$\rightarrow$ executes the resulting binary on the BSPScheduler . - Comparison : Bit-exact comparison of final memory states. To run integration tests: python3 -m unittest discover tests/integration