Clojure is almost as fast as C (with some help)

A developer ported a C stress test to Clojure and achieved near-parity performance (0.86 ms vs 0.70 ms per frame) by using the JVM Vector API, static final species references, macros, and JDK 25, demonstrating that Clojure can match C speed with careful optimization.

I have a stress test written in C: 100,000 cubes flying around in space. The CPU rebuilds every cube’s 4x4 transform matrix on every frame and sends all of them to the GPU. That is around 900,000 sine evaluations and 6 MB of matrix data per frame, and after that the GPU still has to draw 3.6 million triangles. So the frame is half CPU work, half GPU work. I ported it to Clojure and wanted to see how close I could get to the C version’s FPS. I should say up front that I did not do the optimization work alone: I paired with Claude Code on it, and most of the digging in this post the benchmarks, the JIT logs, the failed attempts comes from that session. I did not expect much. The C version is built with clang at -O2 , and at that level clang auto-vectorizes the transform loop with NEON SIMD instructions without telling you anything. So when you benchmark a language against C, you are not really competing with the loop in the source file. You are competing with whatever the optimizer turned it into. The first measurements confirmed this. C computes all 100K matrices in 0.70 ms on a single thread. My best scalar Clojure loop - primitive arrays, type hints, unchecked math, every trick I knew - took 2.6 ms. Almost four times slower, and I had nothing left to try. HotSpot does not auto-vectorize a loop like this. Clang does. That is the entire gap. The JVM does have an answer though: the Vector API from Project Panama. Instead of hoping the JIT vectorizes your loop, you write the SIMD operations yourself, and since it is just a Java API, it works from Clojure too. My first attempt with the Vector API was a disaster. 7.7 ms. Slower than the scalar loop, ten times behind C. The code was correct, I could even see the vector intrinsics being compiled in the JIT logs, so for a while I was staring at it with no idea what was wrong. The problem is that the API only becomes fast when the JIT can treat the “species” the descriptor that says how wide your vectors are as a compile-time constant. I had stored it in a Clojure var. A var is a field lookup, the JIT cannot fold it, and every single vector operation silently fell back to a slow path that allocates objects. One indirection, 10x. Nothing warns you about this. Three things fixed it: - Reference the species as a static final field FloatVector/SPECIES 128 at every call site, so the JIT sees a constant. - Write the helper math as macros instead of functions, so the whole kernel ends up in one inlinable body. - Use a recent JDK. On JDK 21 the shuffle operations I use for transposing matrices in registers do not compile to the right NEON instructions, on JDK 25 they do. That upgrade alone took the pass from 2.5 ms to about 1 ms. Same code. After adding fused multiply-adds on top clang was already doing that for the C side , the Clojure pass landed at 0.86 ms against C’s 0.70 ms, single-threaded. When I run both apps side by side they average around 370 FPS on my M3 MacBook. At this point neither version is limited by the CPU pass anymore, the GPU is the bottleneck for both, which is what parity means for this test. The other thing I kept an eye on was garbage, because a hot loop that allocates will eventually stutter, and 0.86 ms means nothing if the GC interrupts you every second. The final kernel reads from plain float arrays, keeps everything in SIMD registers, and writes into one preallocated float array that goes directly to OpenGL. There is nothing for the collector to do. The heap sits flat at about 134 MB the whole time. For comparison, the broken first attempt was producing roughly 7.5 GB per second of temporary vector objects. Same algorithm, same API. The entire difference is whether the JIT gets to do its job. Nobody would call this idiomatic Clojure. There is no immutability in the hot path, no laziness, no sequences. It reads like C with parentheses. I am fine with that. You write normal Clojure for the 99% of the program where performance does not matter, and for the one loop where it does, the language lets you go this low without leaving it. The real thanks goes to the JVM developers. Project Panama’s Vector API is what made this possible: explicit SIMD from a dynamic language, landing within 20% of clang’s auto-vectorized output. Ten years ago my answer to this problem would have been writing the kernel in C and calling it through JNI. I am glad I do not have to do that anymore. Clojure code: ns cpu-stress-test :import org.lwjgl BufferUtils org.lwjgl.glfw GLFW GLFWErrorCallback Callbacks GLFWKeyCallbackI GLFWCursorPosCallbackI GLFWFramebufferSizeCallbackI org.lwjgl.opengl GL GL11 GL15 GL20 GL30 GL31 GL33 org.lwjgl.system MemoryUtil java.nio FloatBuffer java.util.concurrent Executors ExecutorService Callable jdk.incubator.vector FloatVector VectorShuffle :gen-class set warn-on-reflection true set unchecked-math :warn-on-boxed ;; ── configuration ─────────────────────────────────────────────────────────── def ^:const NUM-CUBES 100000 def ^:const CUBE-SPREAD 200.0 def ^:const ROTATION-SPEED 0.4 def ^:const INST-MOVE-SPEED 0.5 def ^:const OSC-AMPLITUDE 3.0 def ^:const WINDOW-W 1920 def ^:const WINDOW-H 1080 def ^:const NEAR-PLANE 0.1 def ^:const FAR-PLANE 500.0 def ^:const FPS-SAMPLE-COUNT 120 def ^:const STATIC-FLOATS 10 ; per cube: bp.xyz, rs.xyz, mf.xyz, scale def ^:const MODEL-FLOATS 16 ; per cube: column-major mat4 def ^:const TAU 6.28318530717958647692 def ^:const CAM-MOVE-SPEED 50.0 def ^:const CAM-SENSITIVITY 0.002 def CAM-FOV-RAD Math/PI / 70.0 180.0 def ncores .availableProcessors Runtime/getRuntime ;; thread counts to cycle through with T: 1,2,4,8,... clamped to core count def thread-counts vec sort distinct filter <= ^long % ^long ncores 1 2 4 8 16 long ncores ;; ── shaders ────────────────────────────────────────────────────────────────── def vertex-shader-src " version 330 core layout location = 0 in vec3 aPos; layout location = 1 in vec3 aNormal; layout location = 3 in vec4 aModel0; layout location = 4 in vec4 aModel1; layout location = 5 in vec4 aModel2; layout location = 6 in vec4 aModel3; layout location = 7 in vec3 aColor; uniform mat4 uView; uniform mat4 uProjection; out vec3 vColor; out vec3 vNormal; out vec3 vFragPos; void main { mat4 model = mat4 aModel0, aModel1, aModel2, aModel3 ; vec4 worldPos = model vec4 aPos, 1.0 ; gl Position = uProjection uView worldPos; vColor = aColor; vNormal = mat3 model aNormal; vFragPos = worldPos.xyz; } " def fragment-shader-src " version 330 core in vec3 vColor; in vec3 vNormal; in vec3 vFragPos; out vec4 FragColor; uniform vec3 uLightDir; uniform vec3 uViewPos; void main { vec3 norm = normalize vNormal ; vec3 lightDir = normalize uLightDir ; vec3 ambient = 0.15 vColor; float diff = max dot norm, lightDir , 0.0 ; vec3 diffuse = diff vColor; vec3 viewDir = normalize uViewPos - vFragPos ; vec3 halfDir = normalize lightDir + viewDir ; float spec = pow max dot norm, halfDir , 0.0 , 32.0 ; vec3 specular = 0.3 spec vec3 1.0 ; float dist = length vFragPos - uViewPos ; float fog = exp -dist 0.008 ; fog = clamp fog, 0.0, 1.0 ; vec3 result = ambient + diffuse + specular; result = mix vec3 0.02, 0.02, 0.05 , result, fog ; FragColor = vec4 result, 1.0 ; } " ;; ── cube geometry pos.xyz, normal.xyz ───────────────────────────────────── def cube-vertices float-array -0.5 -0.5 0.5 0 0 1 0.5 -0.5 0.5 0 0 1 0.5 0.5 0.5 0 0 1 0.5 0.5 0.5 0 0 1 -0.5 0.5 0.5 0 0 1 -0.5 -0.5 0.5 0 0 1 -0.5 -0.5 -0.5 0 0 -1 -0.5 0.5 -0.5 0 0 -1 0.5 0.5 -0.5 0 0 -1 0.5 0.5 -0.5 0 0 -1 0.5 -0.5 -0.5 0 0 -1 -0.5 -0.5 -0.5 0 0 -1 -0.5 0.5 -0.5 0 1 0 -0.5 0.5 0.5 0 1 0 0.5 0.5 0.5 0 1 0 0.5 0.5 0.5 0 1 0 0.5 0.5 -0.5 0 1 0 -0.5 0.5 -0.5 0 1 0 -0.5 -0.5 -0.5 0 -1 0 0.5 -0.5 -0.5 0 -1 0 0.5 -0.5 0.5 0 -1 0 0.5 -0.5 0.5 0 -1 0 -0.5 -0.5 0.5 0 -1 0 -0.5 -0.5 -0.5 0 -1 0 0.5 -0.5 -0.5 1 0 0 0.5 0.5 -0.5 1 0 0 0.5 0.5 0.5 1 0 0 0.5 0.5 0.5 1 0 0 0.5 -0.5 0.5 1 0 0 0.5 -0.5 -0.5 1 0 0 -0.5 -0.5 -0.5 -1 0 0 -0.5 -0.5 0.5 -1 0 0 -0.5 0.5 0.5 -1 0 0 -0.5 0.5 0.5 -1 0 0 -0.5 0.5 -0.5 -1 0 0 -0.5 -0.5 -0.5 -1 0 0 ;; ── deterministic PRNG SplitMix64, seed 42 ───────────────────────────────── def ^:const SM-GAMMA unchecked-long Long/parseUnsignedLong "9E3779B97F4A7C15" 16 def ^:const SM-MIX1 unchecked-long Long/parseUnsignedLong "BF58476D1CE4E5B9" 16 def ^:const SM-MIX2 unchecked-long Long/parseUnsignedLong "94D049BB133111EB" 16 def ^:const INV-2P24 / 1.0 16777216.0 ;; Returns ^double, not ^float: Clojure has no primitive float fn return, so a ;; ^float hint would box every result and poison downstream arithmetic. The cloud ;; needn't be bit-identical to C the integer RNG state stream is , so double is fine. defn sm-mix ^double ^long s let z unchecked-multiply bit-xor s unsigned-bit-shift-right s 30 SM-MIX1 z unchecked-multiply bit-xor z unsigned-bit-shift-right z 27 SM-MIX2 z bit-xor z unsigned-bit-shift-right z 31 double unsigned-bit-shift-right z 40 INV-2P24 ;; ── instance generation: static CPU inputs + static GPU color buffer ───────── defn generate-instance-data let ^floats statics float-array NUM-CUBES STATIC-FLOATS ^FloatBuffer color BufferUtils/createFloatBuffer NUM-CUBES 3 half / CUBE-SPREAD 2.0 loop i 0, st long 42 if < i NUM-CUBES let s unchecked-add st SM-GAMMA px - sm-mix s CUBE-SPREAD half s unchecked-add s SM-GAMMA py - sm-mix s CUBE-SPREAD half s unchecked-add s SM-GAMMA pz - sm-mix s CUBE-SPREAD half s unchecked-add s SM-GAMMA hue sm-mix s TAU r + 0.5 0.5 Math/sin hue g + 0.5 0.5 Math/sin + hue 2.094 b + 0.5 0.5 Math/sin + hue 4.188 s unchecked-add s SM-GAMMA sc + 0.3 sm-mix s 0.7 s unchecked-add s SM-GAMMA rsx - sm-mix s 0.5 ROTATION-SPEED 2.0 s unchecked-add s SM-GAMMA rsy - sm-mix s 0.5 ROTATION-SPEED 2.0 s unchecked-add s SM-GAMMA rsz - sm-mix s 0.5 ROTATION-SPEED 2.0 s unchecked-add s SM-GAMMA mfx + 0.5 sm-mix s INST-MOVE-SPEED s unchecked-add s SM-GAMMA mfy + 0.5 sm-mix s INST-MOVE-SPEED s unchecked-add s SM-GAMMA mfz + 0.5 sm-mix s INST-MOVE-SPEED si i STATIC-FLOATS aset statics si float px aset statics + si 1 float py aset statics + si 2 float pz aset statics + si 3 float rsx aset statics + si 4 float rsy aset statics + si 5 float rsz aset statics + si 6 float mfx aset statics + si 7 float mfy aset statics + si 8 float mfz aset statics + si 9 float sc .put color float r .put color float g .put color float b recur inc i s do .flip color ;; transpose AoS - SoA one-time so the SIMD path can do contiguous ;; per-field vector loads: soa = bpx bpy bpz rsx ..rsz mfx ..mfz scale let soa object-array STATIC-FLOATS dotimes k STATIC-FLOATS aset soa int k float-array NUM-CUBES dotimes c NUM-CUBES let cs c STATIC-FLOATS dotimes k STATIC-FLOATS aset ^floats aget soa int k int c aget statics int + cs k statics soa color ;; ── shared polynomial sin/cos identical algorithm to the C version ───────── ;; Replaces Math/sin|cos fdlibm — not a HW intrinsic on Apple Silicon so the ;; C-vs-Clojure comparison isolates language/JIT instead of measuring two ;; different transcendental implementations. Range-reduce to -pi, pi , parabola ;; + one refinement. ^double on the arg vector ⇒ true primitive in/out, no boxing. def ^:const INV-TWO-PI 0.15915494309189535 def ^:const TWO-PI 6.283185307179586 def ^:const HALF-PI 1.5707963267948966 def ^:const B-SIN 1.2732395447351628 ; 4/pi def ^:const C-SIN -0.40528473456935109 ; -4/pi^2 def ^:const P-SIN 0.225 defn fast-sin ^double ^double x let k double long + x INV-TWO-PI if = x 0.0 0.5 -0.5 ; round to nearest r - x k TWO-PI ; r in -pi, pi ar Math/abs r y + B-SIN r C-SIN r ar ay Math/abs y + P-SIN - y ay y y ;; ── SIMD: the same polynomial sin on a FloatVector JDK Panama Vector API ─── ;; Lane-parallel version of fast-sin. Range reduction uses the float "magic ;; number" trick: v + 1.5 2^23 - 1.5 2^23 == round-to-nearest, all add/sub, so ;; it lowers to NEON — no per-lane libm and no copysign branch. ;; ;; CRITICAL for performance: the Vector API only compiles to actual SIMD when ;; C2 can constant-fold the species. A species stored in a Clojure Var is an ;; opaque field load, which silently drops every op onto the boxed Java fallback ;; path ~10x slower than scalar . So the species is referenced as the static ;; final field FloatVector/SPECIES 128 at every call site NEON = 128-bit = 4 ;; float lanes — this project is pinned to Apple Silicon , and the sin polynomial ;; is a macro, not a fn, so the whole kernel is one inlinable method body. def ^:const LANES 4 ; SPECIES 128 float lanes def ^:const MAGIC 12582912.0 ; 1.5 2^23 defmacro vbc "Broadcast a compile-time constant to a FloatVector. Inside a loop C2 hoists the resulting Replicate node out as loop-invariant, so this costs nothing." c FloatVector/broadcast FloatVector/SPECIES 128 float ~c defmacro vsin "Inline lanewise polynomial sin same algorithm as fast-sin on a FloatVector. Uses fma like the clang -O2 build of the C kernel fp-contract is on there ." x let x ~x v .mul x float INV-TWO-PI k .sub .add v float MAGIC float MAGIC ; round-to-nearest, lanewise r .fma k vbc - TWO-PI x ; r = x - k 2pi, in -pi, pi ar .abs r y .mul r .fma ar vbc C-SIN vbc B-SIN ; y = B r + C r |r| ay .abs y .fma .fma y ay .neg y vbc P-SIN y ; P y |y| - y + y ;; ── THE CPU WORKLOAD ───────────────────────────────────────────────────────── ;; Build column-major model matrices for cubes start, end at time t, writing ;; into the heap float model handed to glBufferData via LWJGL's array ;; overload — a pinned zero-copy JNI pass, symmetric with C handing malloc'd ;; memory to GL . Params are left unhinted a fn taking 4 primitives is illegal ;; and re-bound to primitives inside, so the body is fully primitive with no boxing. defn compute-range model statics start end t let ^floats model model ^floats statics statics end long end t double t osc double OSC-AMPLITUDE loop i long start when < i end let si i STATIC-FLOATS apx + aget statics si osc fast-sin aget statics + si 6 t apy + aget statics + si 1 osc fast-sin aget statics + si 7 t apz + aget statics + si 2 osc fast-sin aget statics + si 8 t ax aget statics + si 3 t ay aget statics + si 4 t az aget statics + si 5 t scale double aget statics + si 9 cx fast-sin + ax HALF-PI sx fast-sin ax cy fast-sin + ay HALF-PI sy fast-sin ay cz fast-sin + az HALF-PI sz fast-sin az base i MODEL-FLOATS aset model base float scale cy cz aset model + base 1 float scale + sx sy cz cx sz aset model + base 2 float scale + - cx sy cz sx sz aset model + base 3 float 0.0 aset model + base 4 float scale - cy sz aset model + base 5 float scale + - sx sy sz cx cz aset model + base 6 float scale + cx sy sz sx cz aset model + base 7 float 0.0 aset model + base 8 float scale sy aset model + base 9 float scale - sx cy aset model + base 10 float scale cx cy aset model + base 11 float 0.0 aset model + base 12 float apx aset model + base 13 float apy aset model + base 14 float apz aset model + base 15 float 1.0 recur inc i ;; SIMD variant: same math, LANES cubes at a time, reading SoA arrays with ;; contiguous vector loads. The 16 matrix elements are computed as 16 FloatVectors ;; lane j = cube i+j , then transposed in registers — four 4x4 transposes built ;; from two-vector rearranges NEON tbl — so each cube's 16 floats are written ;; with 4 contiguous vector stores. No scratch buffer, no scalar scatter. ;; Requires start,end to be LANES-aligned guaranteed by run-pass ; a scalar ;; tail covers any remainder. ;; ;; transpose4 turns column vectors A,B,C,W lane j = cube j into row vectors ;; T j = A j, B j, C j, W j and stores them at model i+j 16 + off : ;; ab-lo = A0 B0 A1 B1 cw-lo = C0 W0 C1 W1 - T0 = A0 B0 C0 W0 ;; ab-hi = A2 B2 A3 B3 cw-hi = C2 W2 C3 W3 T1 = A1 B1 C1 W1 ... ;; In a two-vector rearrange, shuffle index k in 0,4 picks this k and the ;; "exceptional" index k-4 negative picks other k . defmacro ^:private transpose4-store model i off A B C W zip-lo zip-hi cat-lo cat-hi let ab-lo .rearrange ~A ~zip-lo ~B ab-hi .rearrange ~A ~zip-hi ~B cw-lo .rearrange ~C ~zip-lo ~W cw-hi .rearrange ~C ~zip-hi ~W .intoArray .rearrange ab-lo ~cat-lo cw-lo ~model int + ~i 16 ~off .intoArray .rearrange ab-lo ~cat-hi cw-lo ~model int + + ~i 1 16 ~off .intoArray .rearrange ab-hi ~cat-lo cw-hi ~model int + + ~i 2 16 ~off .intoArray .rearrange ab-hi ~cat-hi cw-hi ~model int + + ~i 3 16 ~off defn compute-range-simd model soa start end t let ^floats model model ^objects soa soa start long start end long end tf float t osc float OSC-AMPLITUDE hp float HALF-PI lanes long LANES ^floats bpx aget soa 0 ^floats bpy aget soa 1 ^floats bpz aget soa 2 ^floats rsx aget soa 3 ^floats rsy aget soa 4 ^floats rsz aget soa 5 ^floats mfx aget soa 6 ^floats mfy aget soa 7 ^floats mfz aget soa 8 ^floats scl aget soa 9 zero FloatVector/zero FloatVector/SPECIES 128 one FloatVector/broadcast FloatVector/SPECIES 128 float 1.0 zip-lo VectorShuffle/fromValues FloatVector/SPECIES 128 int-array 0 -4 1 -3 zip-hi VectorShuffle/fromValues FloatVector/SPECIES 128 int-array 2 -2 3 -1 cat-lo VectorShuffle/fromValues FloatVector/SPECIES 128 int-array 0 1 -4 -3 cat-hi VectorShuffle/fromValues FloatVector/SPECIES 128 int-array 2 3 -2 -1 simd-end + start quot - end start lanes lanes loop i start when < i simd-end let ii int i BPX FloatVector/fromArray FloatVector/SPECIES 128 bpx ii BPY FloatVector/fromArray FloatVector/SPECIES 128 bpy ii BPZ FloatVector/fromArray FloatVector/SPECIES 128 bpz ii RSX FloatVector/fromArray FloatVector/SPECIES 128 rsx ii RSY FloatVector/fromArray FloatVector/SPECIES 128 rsy ii RSZ FloatVector/fromArray FloatVector/SPECIES 128 rsz ii MFX FloatVector/fromArray FloatVector/SPECIES 128 mfx ii MFY FloatVector/fromArray FloatVector/SPECIES 128 mfy ii MFZ FloatVector/fromArray FloatVector/SPECIES 128 mfz ii SCALE FloatVector/fromArray FloatVector/SPECIES 128 scl ii APX .add BPX .mul vsin .mul MFX tf osc APY .add BPY .mul vsin .mul MFY tf osc APZ .add BPZ .mul vsin .mul MFZ tf osc AX .mul RSX tf AY .mul RSY tf AZ .mul RSZ tf CX vsin .add AX hp SX vsin AX CY vsin .add AY hp SY vsin AY CZ vsin .add AZ hp SZ vsin AZ SXSY .mul SX SY CXSY .mul CX SY M0 .mul SCALE .mul CY CZ M1 .mul SCALE .fma SXSY CZ .mul CX SZ M2 .mul SCALE .fma .neg CXSY CZ .mul SX SZ M4 .mul SCALE .neg .mul CY SZ M5 .mul SCALE .fma .neg SXSY SZ .mul CX CZ M6 .mul SCALE .fma CXSY SZ .mul SX CZ M8 .mul SCALE SY M9 .mul SCALE .neg .mul SX CY M10 .mul SCALE .mul CX CY transpose4-store model ii 0 M0 M1 M2 zero zip-lo zip-hi cat-lo cat-hi transpose4-store model ii 4 M4 M5 M6 zero zip-lo zip-hi cat-lo cat-hi transpose4-store model ii 8 M8 M9 M10 zero zip-lo zip-hi cat-lo cat-hi transpose4-store model ii 12 APX APY APZ one zip-lo zip-hi cat-lo cat-hi recur + i lanes ;; scalar tail only runs if the range wasn't LANES-aligned loop i simd-end when < i end let ix int i td double tf od double osc apx + aget bpx ix od fast-sin aget mfx ix td apy + aget bpy ix od fast-sin aget mfy ix td apz + aget bpz ix od fast-sin aget mfz ix td ax aget rsx ix td ay aget rsy ix td az aget rsz ix td sc double aget scl ix cx fast-sin + ax HALF-PI sx fast-sin ax cy fast-sin + ay HALF-PI sy fast-sin ay cz fast-sin + az HALF-PI sz fast-sin az base i MODEL-FLOATS aset model base float sc cy cz aset model + base 1 float sc + sx sy cz cx sz aset model + base 2 float sc + - cx sy cz sx sz aset model + base 3 float 0.0 aset model + base 4 float sc - cy sz aset model + base 5 float sc + - sx sy sz cx cz aset model + base 6 float sc + cx sy sz sx cz aset model + base 7 float 0.0 aset model + base 8 float sc sy aset model + base 9 float sc - sx cy aset model + base 10 float sc cx cy aset model + base 11 float 0.0 aset model + base 12 float apx aset model + base 13 float apy aset model + base 14 float apz aset model + base 15 float 1.0 recur inc i ;; Run the transform pass for the whole cloud, fanned across nt threads, in ;; either scalar AoS statics or SIMD SoA soa mode. The calling thread ;; computes chunk 0 itself while the pool runs chunks 1..nt-1 cheaper than ;; parking the caller in invokeAll — closer to OpenMP's fork-join cost . defn run-pass ^ExecutorService pool model statics soa nt t simd? let n long NUM-CUBES nt long nt lanes long LANES run fn s e if simd? compute-range-simd model soa s e t compute-range model statics s e t if == nt 1 run 0 n let base quot + n dec nt nt ; ceil n / nt chunk long if simd? quot + base dec lanes lanes lanes ; round up to LANES base futs mapv fn c let start long c chunk e + start chunk end if < e n e n .submit pool reify Callable call when < start n run start end nil range 1 nt run 0 min chunk n run fn ^java.util.concurrent.Future f .get f futs nil ;; ── minimal column-major math view + projection only ─────────────────────── defn perspective ^FloatBuffer b fovy aspect near far let fovy double fovy aspect double aspect near double near far double far t Math/tan / fovy 2.0 dotimes i 16 .put b int i float 0 doto b .put 0 float / 1.0 aspect t .put 5 float / 1.0 t .put 10 float / - + far near - far near .put 11 float -1.0 .put 14 float / - 2.0 far near - far near defn look-at ^FloatBuffer b ex ey ez cx cy cz ux uy uz let ex double ex ey double ey ez double ez cx double cx cy double cy cz double cz ux double ux uy double uy uz double uz dx - cx ex dy - cy ey dz - cz ez dl Math/sqrt + dx dx dy dy dz dz fx / dx dl fy / dy dl fz / dz dl ax - fy uz fz uy ay - fz ux fx uz az - fx uy fy ux al Math/sqrt + ax ax ay ay az az sx / ax al sy / ay al sz / az al vx - sy fz sz fy vy - sz fx sx fz vz - sx fy sy fx doto b .put 0 float sx .put 1 float vx .put 2 float - fx .put 3 float 0 .put 4 float sy .put 5 float vy .put 6 float - fy .put 7 float 0 .put 8 float sz .put 9 float vz .put 10 float - fz .put 11 float 0 .put 12 float - + sx ex sy ey sz ez .put 13 float - + vx ex vy ey vz ez .put 14 float + fx ex fy ey fz ez .put 15 float 1 ;; ── shader compilation ──────────────────────────────────────────────────────── defn compile-shader ^CharSequence src ^long kind ^String label let id GL20/glCreateShader int kind GL20/glShaderSource id src GL20/glCompileShader id when zero? GL20/glGetShaderi id GL20/GL COMPILE STATUS binding out err println str "ERROR::SHADER::" label "::COMPILATION FAILED" println GL20/glGetShaderInfoLog id throw ex-info "shader compilation failed" {:label label} id defn build-program vs-src fs-src let vs compile-shader vs-src GL20/GL VERTEX SHADER "VERTEX" fs compile-shader fs-src GL20/GL FRAGMENT SHADER "FRAGMENT" program GL20/glCreateProgram GL20/glAttachShader program vs GL20/glAttachShader program fs GL20/glLinkProgram program GL20/glDeleteShader vs GL20/glDeleteShader fs when zero? GL20/glGetProgrami program GL20/GL LINK STATUS binding out err println "ERROR::SHADER::PROGRAM::LINKING FAILED" println GL20/glGetProgramInfoLog program throw ex-info "program linking failed" {} program ;; ── shared state for GLFW callbacks ────────────────────────────────────────── def cam-pos atom 0.0 50.0 150.0 def cam-yaw atom 0.0 def cam-pitch atom 0.0 def first-mouse atom true def last-mouse atom 0.0 0.0 def mouse-captured atom true def vsync-on atom false def show-stats atom true def fb-size atom WINDOW-W WINDOW-H def thread-idx atom 0 ; index into thread-counts def simd? atom true ; true = Panama SIMD default , false = scalar defn current-threads ^long long nth thread-counts @thread-idx defn process-keyboard ^long window ^double dt let yaw double @cam-yaw pitch double @cam-pitch speed CAM-MOVE-SPEED dt cp Math/cos pitch fx Math/sin yaw cp fz - Math/cos yaw cp len Math/sqrt + fx fx fz fz nfx if zero? len fx / fx len nfz if zero? len fz / fz len rx Math/cos yaw rz Math/sin yaw down? fn k = GLFW/glfwGetKey window int k GLFW/GLFW PRESS w? down? GLFW/GLFW KEY W s? down? GLFW/GLFW KEY S a? down? GLFW/GLFW KEY A d? down? GLFW/GLFW KEY D up? down? GLFW/GLFW KEY SPACE dn? down? GLFW/GLFW KEY LEFT SHIFT ddx + if w? nfx speed 0.0 if s? - nfx speed 0.0 if d? rx speed 0.0 if a? - rx speed 0.0 ddz + if w? nfz speed 0.0 if s? - nfz speed 0.0 if d? rz speed 0.0 if a? - rz speed 0.0 ddy + if up? speed 0.0 if dn? - speed 0.0 when or w? s? a? d? up? dn? swap cam-pos fn x y z + double x ddx + double y ddy + double z ddz defn make-key-callback reify GLFWKeyCallbackI invoke window key scancode action mods when = action GLFW/GLFW PRESS condp = key GLFW/GLFW KEY ESCAPE GLFW/glfwSetWindowShouldClose window true GLFW/GLFW KEY TAB do swap mouse-captured not if @mouse-captured do GLFW/glfwSetInputMode window GLFW/GLFW CURSOR GLFW/GLFW CURSOR DISABLED reset first-mouse true GLFW/glfwSetInputMode window GLFW/GLFW CURSOR GLFW/GLFW CURSOR NORMAL GLFW/GLFW KEY T do swap thread-idx fn i mod inc long i count thread-counts println "Worker threads:" current-threads GLFW/GLFW KEY M do swap simd? not println "Compute mode:" if @simd? "SIMD Panama " "scalar" GLFW/GLFW KEY V do swap vsync-on not GLFW/glfwSwapInterval int if @vsync-on 1 0 println "VSync:" if @vsync-on "ON" "OFF" GLFW/GLFW KEY F1 swap show-stats not nil defn make-cursor-callback reify GLFWCursorPosCallbackI invoke window xpos ypos if @first-mouse do reset last-mouse xpos ypos reset first-mouse false when @mouse-captured let lx ly @last-mouse xoff - double xpos double lx yoff - double ypos double ly max-pitch - / Math/PI 2.0 0.1 reset last-mouse xpos ypos swap cam-yaw fn y + double y xoff CAM-SENSITIVITY swap cam-pitch fn p let np - double p yoff CAM-SENSITIVITY cond np max-pitch max-pitch < np - max-pitch - max-pitch :else np defn make-fb-callback reify GLFWFramebufferSizeCallbackI invoke window w h reset fb-size w h GL11/glViewport 0 0 int w int h def log-stats? some? System/getenv "STRESS STATS" ; STRESS STATS=1 - also print stats to stdout defn update-title window threads cpu-ms fps frame-ms let window long window mode if @simd? "SIMD" "scalar" if @show-stats let rt Runtime/getRuntime used-mb quot - .totalMemory rt .freeMemory rt 1024 1024 max-mb quot .maxMemory rt 1024 1024 title format "CPU Stress Clojure, %s, %d thread%s | %d cubes | CPU: %.2f ms | FPS: %.1f | %.2f ms | Mem: %d/%d MB | VSync: %s" mode threads if == long threads 1 "" "s" NUM-CUBES cpu-ms fps frame-ms used-mb max-mb if @vsync-on "ON" "OFF" when log-stats? println title flush GLFW/glfwSetWindowTitle window title GLFW/glfwSetWindowTitle window "CPU Stress Test - 100K Cubes" ;; ── main ───────────────────────────────────────────────────────────────────── defn -main & args .set GLFWErrorCallback/createPrint System/err when-not GLFW/glfwInit binding out err println "GLFW init failed" System/exit 1 GLFW/glfwDefaultWindowHints GLFW/glfwWindowHint GLFW/GLFW CONTEXT VERSION MAJOR 3 GLFW/glfwWindowHint GLFW/GLFW CONTEXT VERSION MINOR 3 GLFW/glfwWindowHint GLFW/GLFW OPENGL PROFILE GLFW/GLFW OPENGL CORE PROFILE GLFW/glfwWindowHint GLFW/GLFW OPENGL FORWARD COMPAT GLFW/GLFW TRUE let window GLFW/glfwCreateWindow int WINDOW-W int WINDOW-H "CPU Stress Test - 100K Cubes Clojure " MemoryUtil/NULL MemoryUtil/NULL when = window MemoryUtil/NULL binding out err println "Window creation failed" GLFW/glfwTerminate System/exit 1 GLFW/glfwMakeContextCurrent window GLFW/glfwSwapInterval 0 let wbuf int-array 1 hbuf int-array 1 GLFW/glfwGetFramebufferSize window wbuf hbuf reset fb-size aget wbuf 0 aget hbuf 0 GLFW/glfwSetFramebufferSizeCallback window make-fb-callback GLFW/glfwSetCursorPosCallback window make-cursor-callback GLFW/glfwSetKeyCallback window make-key-callback GLFW/glfwSetInputMode window GLFW/GLFW CURSOR GLFW/GLFW CURSOR DISABLED GL/createCapabilities println format str "\n" "CPU STRESS TEST - %d CPU-TRANSFORMED CUBES Clojure / LWJGL \n" "OpenGL: %s\nGPU: %s\n" "Cores: %d | thread cycle T : %s | SIMD lanes: %d M to toggle Panama SIMD \n" "WASD/Space/Shift fly | Mouse look | Tab cursor | T threads | M SIMD | V vsync | F1 stats | Esc quit\n" NUM-CUBES GL11/glGetString GL11/GL VERSION GL11/glGetString GL11/GL RENDERER ncores pr-str thread-counts LANES GL11/glEnable GL11/GL DEPTH TEST GL11/glDepthFunc GL11/GL LESS GL11/glEnable GL11/GL CULL FACE GL11/glCullFace GL11/GL BACK let shader int build-program vertex-shader-src fragment-shader-src statics soa color-buf generate-instance-data model-bytes long NUM-CUBES MODEL-FLOATS 4 ^floats model float-array NUM-CUBES MODEL-FLOATS ^ExecutorService pool Executors/newFixedThreadPool ncores cube-buf doto BufferUtils/createFloatBuffer alength ^floats cube-vertices .put ^floats cube-vertices .flip stride-cube int 6 4 stride-model int MODEL-FLOATS 4 vao GL30/glGenVertexArrays vbo-cube GL15/glGenBuffers vbo-model GL15/glGenBuffers vbo-color GL15/glGenBuffers println format "Generated %d cubes streaming %.2f MB/frame of model matrices " NUM-CUBES / double model-bytes 1024.0 1024.0 let fw fh @fb-size println format "Framebuffer: %dx%d" long fw long fh GL30/glBindVertexArray vao ;; per-vertex cube geometry GL15/glBindBuffer GL15/GL ARRAY BUFFER vbo-cube GL15/glBufferData GL15/GL ARRAY BUFFER ^FloatBuffer cube-buf GL15/GL STATIC DRAW GL20/glVertexAttribPointer 0 3 GL11/GL FLOAT false stride-cube long 0 GL20/glEnableVertexAttribArray 0 GL20/glVertexAttribPointer 1 3 GL11/GL FLOAT false stride-cube long 3 4 GL20/glEnableVertexAttribArray 1 ;; per-instance model matrix mat4 = locations 3..6 , streamed each frame GL15/glBindBuffer GL15/GL ARRAY BUFFER vbo-model GL15/glBufferData GL15/GL ARRAY BUFFER model-bytes GL15/GL STREAM DRAW dotimes col 4 let loc int + 3 col GL20/glVertexAttribPointer loc 4 GL11/GL FLOAT false stride-model long col 4 4 GL20/glEnableVertexAttribArray loc GL33/glVertexAttribDivisor loc 1 ;; per-instance color location 7 , static GL15/glBindBuffer GL15/GL ARRAY BUFFER vbo-color GL15/glBufferData GL15/GL ARRAY BUFFER ^FloatBuffer color-buf GL15/GL STATIC DRAW GL20/glVertexAttribPointer 7 3 GL11/GL FLOAT false int 3 4 long 0 GL20/glEnableVertexAttribArray 7 GL33/glVertexAttribDivisor 7 1 GL30/glBindVertexArray 0 GL15/glBindBuffer GL15/GL ARRAY BUFFER 0 let u-view GL20/glGetUniformLocation shader "uView" u-proj GL20/glGetUniformLocation shader "uProjection" u-light GL20/glGetUniformLocation shader "uLightDir" u-viewpos GL20/glGetUniformLocation shader "uViewPos" view-buf BufferUtils/createFloatBuffer 16 proj-buf BufferUtils/createFloatBuffer 16 frame-times double-array FPS-SAMPLE-COUNT ;; ── render loop ────────────────────────────────────────────────────── loop last-t GLFW/glfwGetTime idx 0 cnt 0 elapsed 0.0 title-t 0.0 cpu-accum 0.0 cpu-frames 0 if GLFW/glfwWindowShouldClose window do .shutdown pool GL30/glDeleteVertexArrays vao GL15/glDeleteBuffers vbo-cube GL15/glDeleteBuffers vbo-model GL15/glDeleteBuffers vbo-color GL20/glDeleteProgram shader Callbacks/glfwFreeCallbacks window GLFW/glfwDestroyWindow window GLFW/glfwTerminate some- GLFW/glfwSetErrorCallback nil .free let now GLFW/glfwGetTime dt - now last-t aset frame-times idx dt idx' rem inc idx FPS-SAMPLE-COUNT cnt' if < cnt FPS-SAMPLE-COUNT inc cnt cnt total double loop i 0 acc 0.0 if < i cnt' recur inc i + acc aget frame-times i acc fps if total 0.0 / double cnt' total 0.0 frame-ms dt 1000.0 elapsed' + elapsed dt title-t' + title-t dt nt current-threads process-keyboard window dt ;; ── the CPU pass timed ────────────────────────────────────── let c0 GLFW/glfwGetTime run-pass pool model statics soa nt elapsed' 5.0 @simd? c1 GLFW/glfwGetTime cpu-accum' + cpu-accum - c1 c0 1000.0 cpu-frames' inc cpu-frames GL11/glClearColor float 0.02 float 0.02 float 0.05 float 1.0 GL11/glClear bit-or GL11/GL COLOR BUFFER BIT GL11/GL DEPTH BUFFER BIT let fw fh @fb-size aspect / double fw double max 1 long fh cx cy cz @cam-pos cx double cx cy double cy cz double cz yaw double @cam-yaw pitch double @cam-pitch cp Math/cos pitch fwx Math/sin yaw cp fwy Math/sin pitch fwz - Math/cos yaw cp look-at view-buf cx cy cz + cx fwx 20.0 + cy fwy 20.0 + cz fwz 20.0 0.0 1.0 0.0 perspective proj-buf CAM-FOV-RAD aspect NEAR-PLANE FAR-PLANE ;; stream the freshly-computed matrices orphan + upload GL15/glBindBuffer GL15/GL ARRAY BUFFER vbo-model GL15/glBufferData GL15/GL ARRAY BUFFER model GL15/GL STREAM DRAW GL20/glUseProgram shader GL20/glUniformMatrix4fv u-view false ^FloatBuffer view-buf GL20/glUniformMatrix4fv u-proj false ^FloatBuffer proj-buf GL20/glUniform3f u-light float 0.5 float 0.8 float 0.3 GL20/glUniform3f u-viewpos float cx float cy float cz GL30/glBindVertexArray vao GL31/glDrawArraysInstanced GL11/GL TRIANGLES 0 36 int NUM-CUBES GL30/glBindVertexArray 0 let refresh? = title-t' 0.5 when refresh? update-title window nt / cpu-accum' cpu-frames' fps frame-ms title-t'' if refresh? 0.0 title-t' cpu-accum'' if refresh? 0.0 cpu-accum' cpu-frames'' if refresh? 0 cpu-frames' GLFW/glfwSwapBuffers window GLFW/glfwPollEvents recur now idx' cnt' elapsed' title-t'' cpu-accum'' cpu-frames'' C code: define GL SILENCE DEPRECATION 1 define GLFW INCLUDE NONE 1 include <OpenGL/gl3.h include <GLFW/glfw3.h include <math.h include <omp.h include <stdbool.h include <stdint.h include <stdio.h include <stdlib.h include <mach/mach.h include <sys/sysctl.h // ── configuration ─────────────────────────────────────────────────────────── define NUM CUBES 100000 define CUBE SPREAD 200.0f define ROTATION SPEED 0.4f define MOVEMENT SPEED 0.5f define OSC AMPLITUDE 3.0f define WINDOW W 1920 define WINDOW H 1080 define NEAR PLANE 0.1f define FAR PLANE 500.0f define FPS SAMPLE COUNT 120 // per-cube static CPU inputs: bpx,bpy,bpz, rsx,rsy,rsz, mfx,mfy,mfz, scale define STATIC FLOATS 10 // per-cube streamed output: a column-major mat4 define MODEL FLOATS 16 ifndef M PI define M PI 3.14159265358979323846 endif define TAU 6.28318530717958647692f // ── shaders ────────────────────────────────────────────────────────────────── // Vertex shader is now trivial: the model matrix arrives per-instance built on // the CPU and we just transform by it. Fragment shader is unchanged. static const char vertex shader src = " version 330 core\n" "layout location = 0 in vec3 aPos;\n" "layout location = 1 in vec3 aNormal;\n" "layout location = 3 in vec4 aModel0;\n" "layout location = 4 in vec4 aModel1;\n" "layout location = 5 in vec4 aModel2;\n" "layout location = 6 in vec4 aModel3;\n" "layout location = 7 in vec3 aColor;\n" "uniform mat4 uView;\n" "uniform mat4 uProjection;\n" "out vec3 vColor;\n" "out vec3 vNormal;\n" "out vec3 vFragPos;\n" "void main {\n" " mat4 model = mat4 aModel0, aModel1, aModel2, aModel3 ;\n" " vec4 worldPos = model vec4 aPos, 1.0 ;\n" " gl Position = uProjection uView worldPos;\n" " vColor = aColor;\n" " vNormal = mat3 model aNormal;\n" " vFragPos = worldPos.xyz;\n" "}\n"; static const char fragment shader src = " version 330 core\n" "in vec3 vColor;\n" "in vec3 vNormal;\n" "in vec3 vFragPos;\n" "out vec4 FragColor;\n" "uniform vec3 uLightDir;\n" "uniform vec3 uViewPos;\n" "void main {\n" " vec3 norm = normalize vNormal ;\n" " vec3 lightDir = normalize uLightDir ;\n" " vec3 ambient = 0.15 vColor;\n" " float diff = max dot norm, lightDir , 0.0 ;\n" " vec3 diffuse = diff vColor;\n" " vec3 viewDir = normalize uViewPos - vFragPos ;\n" " vec3 halfDir = normalize lightDir + viewDir ;\n" " float spec = pow max dot norm, halfDir , 0.0 , 32.0 ;\n" " vec3 specular = 0.3 spec vec3 1.0 ;\n" " float dist = length vFragPos - uViewPos ;\n" " float fog = exp -dist 0.008 ;\n" " fog = clamp fog, 0.0, 1.0 ;\n" " vec3 result = ambient + diffuse + specular;\n" " result = mix vec3 0.02, 0.02, 0.05 , result, fog ;\n" " FragColor = vec4 result, 1.0 ;\n" "}\n"; // ── cube geometry pos.xyz, normal.xyz ───────────────────────────────────── static const float cube vertices = { -0.5f, -0.5f, 0.5f, 0, 0, 1, 0.5f, -0.5f, 0.5f, 0, 0, 1, 0.5f, 0.5f, 0.5f, 0, 0, 1, 0.5f, 0.5f, 0.5f, 0, 0, 1, -0.5f, 0.5f, 0.5f, 0, 0, 1, -0.5f, -0.5f, 0.5f, 0, 0, 1, -0.5f, -0.5f, -0.5f, 0, 0, -1, -0.5f, 0.5f, -0.5f, 0, 0, -1, 0.5f, 0.5f, -0.5f, 0, 0, -1, 0.5f, 0.5f, -0.5f, 0, 0, -1, 0.5f, -0.5f, -0.5f, 0, 0, -1, -0.5f, -0.5f, -0.5f, 0, 0, -1, -0.5f, 0.5f, -0.5f, 0, 1, 0, -0.5f, 0.5f, 0.5f, 0, 1, 0, 0.5f, 0.5f, 0.5f, 0, 1, 0, 0.5f, 0.5f, 0.5f, 0, 1, 0, 0.5f, 0.5f, -0.5f, 0, 1, 0, -0.5f, 0.5f, -0.5f, 0, 1, 0, -0.5f, -0.5f, -0.5f, 0, -1, 0, 0.5f, -0.5f, -0.5f, 0, -1, 0, 0.5f, -0.5f, 0.5f, 0, -1, 0, 0.5f, -0.5f, 0.5f, 0, -1, 0, -0.5f, -0.5f, 0.5f, 0, -1, 0, -0.5f, -0.5f, -0.5f, 0, -1, 0, 0.5f, -0.5f, -0.5f, 1, 0, 0, 0.5f, 0.5f, -0.5f, 1, 0, 0, 0.5f, 0.5f, 0.5f, 1, 0, 0, 0.5f, 0.5f, 0.5f, 1, 0, 0, 0.5f, -0.5f, 0.5f, 1, 0, 0, 0.5f, -0.5f, -0.5f, 1, 0, 0, -0.5f, -0.5f, -0.5f, -1, 0, 0, -0.5f, -0.5f, 0.5f, -1, 0, 0, -0.5f, 0.5f, 0.5f, -1, 0, 0, -0.5f, 0.5f, 0.5f, -1, 0, 0, -0.5f, 0.5f, -0.5f, -1, 0, 0, -0.5f, -0.5f, -0.5f, -1, 0, 0, }; // ── minimal column-major math ──────────────────────────────────────────────── typedef float vec3 3 ; typedef float mat4 16 ; static void vec3 normalize const vec3 v, vec3 out { float len = sqrtf v 0 v 0 + v 1 v 1 + v 2 v 2 ; if len == 0.0f { out 0 = v 0 ; out 1 = v 1 ; out 2 = v 2 ; return; } float inv = 1.0f / len; out 0 = v 0 inv; out 1 = v 1 inv; out 2 = v 2 inv; } static float vec3 dot const vec3 a, const vec3 b { return a 0 b 0 + a 1 b 1 + a 2 b 2 ; } static void vec3 cross const vec3 a, const vec3 b, vec3 out { out 0 = a 1 b 2 - a 2 b 1 ; out 1 = a 2 b 0 - a 0 b 2 ; out 2 = a 0 b 1 - a 1 b 0 ; } static void mat4 perspective float fovy rad, float aspect, float near, float far, mat4 m { float t = tanf fovy rad / 2.0f ; for int i = 0; i < 16; i++ m i = 0.0f; m 0 = 1.0f / aspect t ; m 5 = 1.0f / t; m 10 = - far + near / far - near ; m 11 = -1.0f; m 14 = - 2.0f far near / far - near ; } static void mat4 look at const vec3 eye, const vec3 center, const vec3 up, mat4 m { vec3 f, s, u, dir; dir 0 = center 0 -eye 0 ; dir 1 = center 1 -eye 1 ; dir 2 = center 2 -eye 2 ; vec3 normalize dir, f ; vec3 fxup; vec3 cross f, up, fxup ; vec3 normalize fxup, s ; vec3 cross s, f, u ; m 0 =s 0 ; m 1 =u 0 ; m 2 =-f 0 ; m 3 =0; m 4 =s 1 ; m 5 =u 1 ; m 6 =-f 1 ; m 7 =0; m 8 =s 2 ; m 9 =u 2 ; m 10 =-f 2 ; m 11 =0; m 12 =-vec3 dot s, eye ; m 13 =-vec3 dot u, eye ; m 14 =vec3 dot f, eye ; m 15 =1; } // ── deterministic PRNG SplitMix64, seed 42 ───────────────────────────────── static uint64 t rng state = 42; static float rng float void { rng state += 0x9E3779B97F4A7C15ULL; uint64 t z = rng state; z = z ^ z 30 0xBF58476D1CE4E5B9ULL; z = z ^ z 27 0x94D049BB133111EBULL; z = z ^ z 31 ; return float z 40 1.0f / 16777216.0f ; } // ── fly camera ─────────────────────────────────────────────────────────────── typedef struct { vec3 pos; float yaw, pitch, move speed, sensitivity, fov rad; } FlyCam; static void flycam forward const FlyCam self, vec3 out { float cp = cosf self- pitch ; out 0 = sinf self- yaw cp; out 1 = sinf self- pitch ; out 2 = -cosf self- yaw cp; } static void flycam right const FlyCam self, vec3 out { out 0 = cosf self- yaw ; out 1 = 0.0f; out 2 = sinf self- yaw ; } static void flycam view matrix const FlyCam self, mat4 m { vec3 fwd, target; flycam forward self, fwd ; target 0 = self- pos 0 + fwd 0 20.0f; target 1 = self- pos 1 + fwd 1 20.0f; target 2 = self- pos 2 + fwd 2 20.0f; vec3 world up = {0.0f, 1.0f, 0.0f}; mat4 look at self- pos, target, world up, m ; } static void flycam process keyboard FlyCam self, GLFWwindow window, float dt { float speed = self- move speed dt; vec3 fwd, r; flycam forward self, fwd ; vec3 fwd xz raw = {fwd 0 , 0.0f, fwd 2 }, fwd xz; vec3 normalize fwd xz raw, fwd xz ; flycam right self, r ; if glfwGetKey window, GLFW KEY W == GLFW PRESS { self- pos 0 +=fwd xz 0 speed; self- pos 2 +=fwd xz 2 speed; } if glfwGetKey window, GLFW KEY S == GLFW PRESS { self- pos 0 -=fwd xz 0 speed; self- pos 2 -=fwd xz 2 speed; } if glfwGetKey window, GLFW KEY A == GLFW PRESS { self- pos 0 -=r 0 speed; self- pos 2 -=r 2 speed; } if glfwGetKey window, GLFW KEY D == GLFW PRESS { self- pos 0 +=r 0 speed; self- pos 2 +=r 2 speed; } if glfwGetKey window, GLFW KEY SPACE == GLFW PRESS self- pos 1 +=speed; if glfwGetKey window, GLFW KEY LEFT SHIFT == GLFW PRESS self- pos 1 -=speed; } // ── memory usage for the title bar : resident set size / physical RAM ─────── static size t process rss mb void { struct mach task basic info info; mach msg type number t count = MACH TASK BASIC INFO COUNT; if task info mach task self , MACH TASK BASIC INFO, task info t &info, &count = KERN SUCCESS return 0; return size t info.resident size / 1024 1024 ; } static size t physical ram mb void { uint64 t mem = 0; size t len = sizeof mem ; if sysctlbyname "hw.memsize", &mem, &len, NULL, 0 = 0 return 0; return size t mem / 1024 1024 ; } // ── FPS counter ─────────────────────────────────────────────────────────────── typedef struct { double last time, frame times FPS SAMPLE COUNT , fps, frame ms; size t frame index, frame count; } FpsCounter; static void fps init FpsCounter self { self- last time = glfwGetTime ; for int i = 0; i < FPS SAMPLE COUNT; i++ self- frame times i = 0.0; self- frame index = 0; self- frame count = 0; self- fps = 0.0; self- frame ms = 0.0; } static float fps update FpsCounter self { double now = glfwGetTime ; double dt = now - self- last time; self- frame times self- frame index = dt; self- frame index = self- frame index + 1 % FPS SAMPLE COUNT; if self- frame count < FPS SAMPLE COUNT self- frame count++; double total = 0.0; for size t i = 0; i < self- frame count; i++ total += self- frame times i ; self- fps = total 0.0 ? double self- frame count / total : 0.0; self- frame ms = dt 1000.0; self- last time = now; return float dt; } // ── threading OpenMP ──────────────────────────────────────────────────────── // Cycled live with T. g num threads feeds the num threads ... clause below. static int g thread counts 16 ; static int g thread count n = 0; static int g thread idx = 0; static int g num threads = 1; // ── shared polynomial sin/cos identical algorithm in C and Clojure ───────── // Replaces libm sinf/cosf so the C-vs-Clojure comparison isolates language/JIT // instead of measuring two different transcendental implementations. Range-reduce // to -pi, pi in double, robust as the animation time grows unbounded , then a // parabola + one Newton-ish refinement. Max abs error ~9e-4 — invisible here. define INV TWO PI 0.15915494309189535 define TWO PI D 6.283185307179586 define HALF PI F 1.5707963267948966f define B SIN 1.2732395447351628f // 4/pi define C SIN -0.40528473456935109f // -4/pi^2 define P SIN 0.225f static inline float fast sinf float x { double xd = double x; double k = double long long xd INV TWO PI + xd = 0.0 ? 0.5 : -0.5 ; // round to nearest float r = float xd - k TWO PI D ; // r in -pi, pi float y = B SIN r + C SIN r fabsf r ; return P SIN y fabsf y - y + y; } static inline float fast cosf float x { return fast sinf x + HALF PI F ; } // ── static instance inputs + streamed model output ─────────────────────────── static float cpu static = NULL; // NUM CUBES STATIC FLOATS static float model data = NULL; // NUM CUBES MODEL FLOATS recomputed each frame static float color data = NULL; // NUM CUBES 3 static static bool generate instance data void { cpu static = malloc size t NUM CUBES STATIC FLOATS sizeof float ; model data = malloc size t NUM CUBES MODEL FLOATS sizeof float ; color data = malloc size t NUM CUBES 3 sizeof float ; if cpu static || model data || color data return false; const float half = CUBE SPREAD / 2.0f; for size t i = 0; i < NUM CUBES; i++ { float px = rng float CUBE SPREAD - half; float py = rng float CUBE SPREAD - half; float pz = rng float CUBE SPREAD - half; float hue = rng float TAU; float r = 0.5f + 0.5f sinf hue ; float g = 0.5f + 0.5f sinf hue + 2.094f ; float b = 0.5f + 0.5f sinf hue + 4.188f ; float sc = 0.3f + rng float 0.7f; float rsx = rng float - 0.5f ROTATION SPEED 2.0f; float rsy = rng float - 0.5f ROTATION SPEED 2.0f; float rsz = rng float - 0.5f ROTATION SPEED 2.0f; float mfx = 0.5f + rng float MOVEMENT SPEED; float mfy = 0.5f + rng float MOVEMENT SPEED; float mfz = 0.5f + rng float MOVEMENT SPEED; float s = &cpu static i STATIC FLOATS ; s 0 =px; s 1 =py; s 2 =pz; s 3 =rsx; s 4 =rsy; s 5 =rsz; s 6 =mfx; s 7 =mfy; s 8 =mfz; s 9 =sc; color data i 3+0 =r; color data i 3+1 =g; color data i 3+2 =b; } return true; } // THE CPU WORKLOAD: build every cube's column-major model matrix for time t . // This is exactly what the GPU vertex shader used to do per vertex; here it runs // once per cube on the CPU each frame. 9 transcendentals + ~30 flops per cube. static void compute models float elapsed { const float t = elapsed 5.0f; pragma omp parallel for schedule static num threads g num threads for int i = 0; i < NUM CUBES; i++ { const float s = &cpu static i STATIC FLOATS ; float apx = s 0 + OSC AMPLITUDE fast sinf s 6 t ; float apy = s 1 + OSC AMPLITUDE fast sinf s 7 t ; float apz = s 2 + OSC AMPLITUDE fast sinf s 8 t ; float ax = s 3 t, ay = s 4 t, az = s 5 t; float scale = s 9 ; float cx = fast cosf ax , sx = fast sinf ax ; float cy = fast cosf ay , sy = fast sinf ay ; float cz = fast cosf az , sz = fast sinf az ; float m = &model data i MODEL FLOATS ; m 0 = scale cy cz ; m 1 = scale sx sy cz + cx sz ; m 2 = scale -cx sy cz + sx sz ; m 3 = 0.0f; m 4 = scale -cy sz ; m 5 = scale -sx sy sz + cx cz ; m 6 = scale cx sy sz + sx cz ; m 7 = 0.0f; m 8 = scale sy ; m 9 = scale -sx cy ; m 10 = scale cx cy ; m 11 = 0.0f; m 12 = apx; m 13 = apy; m 14 = apz; m 15 = 1.0f; } } // ── shader compilation ──────────────────────────────────────────────────────── static GLuint compile shader const char src, GLenum kind, const char label { GLuint id = glCreateShader kind ; glShaderSource id, 1, &src, NULL ; glCompileShader id ; GLint ok = 0; glGetShaderiv id, GL COMPILE STATUS, &ok ; if ok { char log 1024 ; GLsizei len = 0; glGetShaderInfoLog id, sizeof log , &len, log ; fprintf stderr, "ERROR::SHADER::%s::COMPILATION FAILED\n%. s\n", label, int len, log ; glDeleteShader id ; return 0; } return id; } static GLuint build program const char vs src, const char fs src { GLuint vs = compile shader vs src, GL VERTEX SHADER, "VERTEX" ; if vs return 0; GLuint fs = compile shader fs src, GL FRAGMENT SHADER, "FRAGMENT" ; if fs { glDeleteShader vs ; return 0; } GLuint program = glCreateProgram ; glAttachShader program, vs ; glAttachShader program, fs ; glLinkProgram program ; glDeleteShader vs ; glDeleteShader fs ; GLint ok = 0; glGetProgramiv program, GL LINK STATUS, &ok ; if ok { char log 1024 ; GLsizei len = 0; glGetProgramInfoLog program, sizeof log , &len, log ; fprintf stderr, "ERROR::SHADER::PROGRAM::LINKING FAILED\n%. s\n", int len, log ; glDeleteProgram program ; return 0; } return program; } // ── shared state for GLFW callbacks ────────────────────────────────────────── static FlyCam cam = { .pos = {0.0f, 50.0f, 150.0f}, .yaw = 0.0f, .pitch = 0.0f, .move speed = 50.0f, .sensitivity = 0.002f, .fov rad = float M PI 70.0 / 180.0 , }; static bool first mouse = true; static double last mouse x = 0.0, last mouse y = 0.0; static bool mouse captured = true, vsync on = false, show stats = true; static int fb w = WINDOW W, fb h = WINDOW H; static void framebuffer size callback GLFWwindow window, int w, int h { void window; fb w = w; fb h = h; glViewport 0, 0, w, h ; } static void cursor pos callback GLFWwindow window, double xpos, double ypos { void window; if first mouse { last mouse x = xpos; last mouse y = ypos; first mouse = false; return; } if mouse captured return; float xoff = float xpos - last mouse x , yoff = float ypos - last mouse y ; last mouse x = xpos; last mouse y = ypos; cam.yaw += xoff cam.sensitivity; cam.pitch -= yoff cam.sensitivity; float max pitch = float M PI / 2.0 - 0.1f; if cam.pitch max pitch cam.pitch = max pitch; if cam.pitch < -max pitch cam.pitch = -max pitch; } static void key callback GLFWwindow window, int key, int scancode, int action, int mods { void scancode; void mods; if action = GLFW PRESS return; switch key { case GLFW KEY ESCAPE: glfwSetWindowShouldClose window, GLFW TRUE ; break; case GLFW KEY TAB: mouse captured = mouse captured; if mouse captured { glfwSetInputMode window, GLFW CURSOR, GLFW CURSOR DISABLED ; first mouse = true; } else glfwSetInputMode window, GLFW CURSOR, GLFW CURSOR NORMAL ; break; case GLFW KEY T: g thread idx = g thread idx + 1 % g thread count n; g num threads = g thread counts g thread idx ; printf "Worker threads: %d\n", g num threads ; break; case GLFW KEY V: vsync on = vsync on; glfwSwapInterval vsync on ? 1 : 0 ; printf "VSync: %s\n", vsync on ? "ON" : "OFF" ; break; case GLFW KEY F1: show stats = show stats; break; default: break; } } // ── main ───────────────────────────────────────────────────────────────────── int main void { if glfwInit == GLFW FALSE { fprintf stderr, "GLFW init failed\n" ; return 1; } glfwDefaultWindowHints ; glfwWindowHint GLFW CONTEXT VERSION MAJOR, 3 ; glfwWindowHint GLFW CONTEXT VERSION MINOR, 3 ; glfwWindowHint GLFW OPENGL PROFILE, GLFW OPENGL CORE PROFILE ; glfwWindowHint GLFW OPENGL FORWARD COMPAT, GLFW TRUE ; GLFWwindow window = glfwCreateWindow WINDOW W, WINDOW H, "CPU Stress Test - 100K Cubes C ", NULL, NULL ; if window { fprintf stderr, "Window creation failed\n" ; glfwTerminate ; return 1; } glfwMakeContextCurrent window ; glfwSwapInterval 0 ; glfwGetFramebufferSize window, &fb w, &fb h ; glfwSetFramebufferSizeCallback window, framebuffer size callback ; glfwSetCursorPosCallback window, cursor pos callback ; glfwSetKeyCallback window, key callback ; glfwSetInputMode window, GLFW CURSOR, GLFW CURSOR DISABLED ; // build the worker-thread cycle: 1,2,4,8,... clamped to the core count int max threads = omp get max threads ; for int v = 1; v <= max threads && g thread count n < 15; v = 2 g thread counts g thread count n++ = v; if g thread count n == 0 || g thread counts g thread count n - 1 = max threads g thread counts g thread count n++ = max threads; g thread idx = 0; g num threads = g thread counts 0 ; printf "\nCPU STRESS TEST - %d CPU-TRANSFORMED CUBES C + OpenMP, poly sin/cos \n" "OpenGL: %s\nGPU: %s\n" "Cores omp get max threads : %d | press T to cycle worker threads\n" "WASD/Space/Shift fly | Mouse look | Tab cursor | T threads | V vsync | F1 stats | Esc quit\n\n", NUM CUBES, glGetString GL VERSION , glGetString GL RENDERER , max threads ; glEnable GL DEPTH TEST ; glDepthFunc GL LESS ; glEnable GL CULL FACE ; glCullFace GL BACK ; GLuint shader = build program vertex shader src, fragment shader src ; if shader { glfwTerminate ; return 1; } if generate instance data { fprintf stderr, "Out of memory generating instance data\n" ; glfwTerminate ; return 1; } printf "Generated %d cubes streaming %.2f MB/frame of model matrices \n", NUM CUBES, double size t NUM CUBES MODEL FLOATS sizeof float / 1024.0 1024.0 ; printf "Framebuffer: %dx%d\n", fb w, fb h ; GLuint vao = 0, vbo cube = 0, vbo model = 0, vbo color = 0; glGenVertexArrays 1, &vao ; glGenBuffers 1, &vbo cube ; glGenBuffers 1, &vbo model ; glGenBuffers 1, &vbo color ; const GLsizei stride cube = 6 sizeof float ; const GLsizei stride model = MODEL FLOATS sizeof float ; glBindVertexArray vao ; // per-vertex cube geometry glBindBuffer GL ARRAY BUFFER, vbo cube ; glBufferData GL ARRAY BUFFER, sizeof cube vertices , cube vertices, GL STATIC DRAW ; glVertexAttribPointer 0, 3, GL FLOAT, GL FALSE, stride cube, void 0 ; glEnableVertexAttribArray 0 ; glVertexAttribPointer 1, 3, GL FLOAT, GL FALSE, stride cube, void 3 sizeof float ; glEnableVertexAttribArray 1 ; // per-instance model matrix mat4 = 4 vec4 attributes, locations 3..6 , streamed glBindBuffer GL ARRAY BUFFER, vbo model ; glBufferData GL ARRAY BUFFER, GLsizeiptr size t NUM CUBES MODEL FLOATS sizeof float , NULL, GL STREAM DRAW ; for int col = 0; col < 4; col++ { GLuint loc = 3 + col; glVertexAttribPointer loc, 4, GL FLOAT, GL FALSE, stride model, void size t col 4 sizeof float ; glEnableVertexAttribArray loc ; glVertexAttribDivisor loc, 1 ; } // per-instance color location 7 , static glBindBuffer GL ARRAY BUFFER, vbo color ; glBufferData GL ARRAY BUFFER, GLsizeiptr size t NUM CUBES 3 sizeof float , color data, GL STATIC DRAW ; glVertexAttribPointer 7, 3, GL FLOAT, GL FALSE, 3 sizeof float , void 0 ; glEnableVertexAttribArray 7 ; glVertexAttribDivisor 7, 1 ; glBindVertexArray 0 ; glBindBuffer GL ARRAY BUFFER, 0 ; GLint u view = glGetUniformLocation shader, "uView" ; GLint u proj = glGetUniformLocation shader, "uProjection" ; GLint u light = glGetUniformLocation shader, "uLightDir" ; GLint u viewpos = glGetUniformLocation shader, "uViewPos" ; FpsCounter fps; fps init &fps ; double elapsed = 0.0, title timer = 0.0; double cpu ms accum = 0.0; int cpu ms frames = 0; double cpu ms avg = 0.0; char title buf 256 ; while glfwWindowShouldClose window == GLFW FALSE { float dt = fps update &fps ; elapsed += dt; title timer += dt; flycam process keyboard &cam, window, dt ; // ── the CPU pass timed ────────────────────────────────────────────── double c0 = glfwGetTime ; compute models float elapsed ; double c1 = glfwGetTime ; cpu ms accum += c1 - c0 1000.0; cpu ms frames++; glClearColor 0.02f, 0.02f, 0.05f, 1.0f ; glClear GL COLOR BUFFER BIT | GL DEPTH BUFFER BIT ; float aspect = float fb w / float fb h 1 ? fb h : 1 ; mat4 view, proj; flycam view matrix &cam, view ; mat4 perspective cam.fov rad, aspect, NEAR PLANE, FAR PLANE, proj ; // stream the freshly-computed matrices orphan + upload glBindBuffer GL ARRAY BUFFER, vbo model ; glBufferData GL ARRAY BUFFER, GLsizeiptr size t NUM CUBES MODEL FLOATS sizeof float , model data, GL STREAM DRAW ; glUseProgram shader ; glUniformMatrix4fv u view, 1, GL FALSE, view ; glUniformMatrix4fv u proj, 1, GL FALSE, proj ; glUniform3f u light, 0.5f, 0.8f, 0.3f ; glUniform3f u viewpos, cam.pos 0 , cam.pos 1 , cam.pos 2 ; glBindVertexArray vao ; glDrawArraysInstanced GL TRIANGLES, 0, 36, NUM CUBES ; glBindVertexArray 0 ; if title timer = 0.5 { title timer = 0.0; cpu ms avg = cpu ms frames 0 ? cpu ms accum / cpu ms frames : 0.0; cpu ms accum = 0.0; cpu ms frames = 0; if show stats { snprintf title buf, sizeof title buf , "CPU Stress C+OpenMP, %d thread%s | %d cubes | CPU: %.2f ms | FPS: %.1f | %.2f ms | Mem: %zu/%zu MB | VSync: %s", g num threads, g num threads == 1 ? "" : "s", NUM CUBES, cpu ms avg, fps.fps, fps.frame ms, process rss mb , physical ram mb , vsync on ? "ON" : "OFF" ; if getenv "STRESS STATS" { puts title buf ; fflush stdout ; } // STRESS STATS=1 - stats on stdout too glfwSetWindowTitle window, title buf ; } else { glfwSetWindowTitle window, "CPU Stress Test - 100K Cubes" ; } } glfwSwapBuffers window ; glfwPollEvents ; } glDeleteVertexArrays 1, &vao ; glDeleteBuffers 1, &vbo cube ; glDeleteBuffers 1, &vbo model ; glDeleteBuffers 1, &vbo color ; glDeleteProgram shader ; free cpu static ; free model data ; free color data ; glfwTerminate ; return 0; }