Clojure is almost as fast as C (with some help) A developer ported a C stress test to Clojure and achieved near-parity performance (0.86 ms vs 0.70 ms per frame) by using the JVM Vector API, static final species references, macros, and JDK 25, demonstrating that Clojure can match C speed with careful optimization. I have a stress test written in C: 100,000 cubes flying around in space. The CPU rebuilds every cube’s 4x4 transform matrix on every frame and sends all of them to the GPU. That is around 900,000 sine evaluations and 6 MB of matrix data per frame, and after that the GPU still has to draw 3.6 million triangles. So the frame is half CPU work, half GPU work. I ported it to Clojure and wanted to see how close I could get to the C version’s FPS. I should say up front that I did not do the optimization work alone: I paired with Claude Code on it, and most of the digging in this post the benchmarks, the JIT logs, the failed attempts comes from that session. I did not expect much. The C version is built with clang at -O2 , and at that level clang auto-vectorizes the transform loop with NEON SIMD instructions without telling you anything. So when you benchmark a language against C, you are not really competing with the loop in the source file. You are competing with whatever the optimizer turned it into. The first measurements confirmed this. C computes all 100K matrices in 0.70 ms on a single thread. My best scalar Clojure loop - primitive arrays, type hints, unchecked math, every trick I knew - took 2.6 ms. Almost four times slower, and I had nothing left to try. HotSpot does not auto-vectorize a loop like this. Clang does. That is the entire gap. The JVM does have an answer though: the Vector API from Project Panama. Instead of hoping the JIT vectorizes your loop, you write the SIMD operations yourself, and since it is just a Java API, it works from Clojure too. My first attempt with the Vector API was a disaster. 7.7 ms. Slower than the scalar loop, ten times behind C. The code was correct, I could even see the vector intrinsics being compiled in the JIT logs, so for a while I was staring at it with no idea what was wrong. The problem is that the API only becomes fast when the JIT can treat the “species” the descriptor that says how wide your vectors are as a compile-time constant. I had stored it in a Clojure var. A var is a field lookup, the JIT cannot fold it, and every single vector operation silently fell back to a slow path that allocates objects. One indirection, 10x. Nothing warns you about this. Three things fixed it: - Reference the species as a static final field FloatVector/SPECIES 128 at every call site, so the JIT sees a constant. - Write the helper math as macros instead of functions, so the whole kernel ends up in one inlinable body. - Use a recent JDK. On JDK 21 the shuffle operations I use for transposing matrices in registers do not compile to the right NEON instructions, on JDK 25 they do. That upgrade alone took the pass from 2.5 ms to about 1 ms. Same code. After adding fused multiply-adds on top clang was already doing that for the C side , the Clojure pass landed at 0.86 ms against C’s 0.70 ms, single-threaded. When I run both apps side by side they average around 370 FPS on my M3 MacBook. At this point neither version is limited by the CPU pass anymore, the GPU is the bottleneck for both, which is what parity means for this test. The other thing I kept an eye on was garbage, because a hot loop that allocates will eventually stutter, and 0.86 ms means nothing if the GC interrupts you every second. The final kernel reads from plain float arrays, keeps everything in SIMD registers, and writes into one preallocated float array that goes directly to OpenGL. There is nothing for the collector to do. The heap sits flat at about 134 MB the whole time. For comparison, the broken first attempt was producing roughly 7.5 GB per second of temporary vector objects. Same algorithm, same API. The entire difference is whether the JIT gets to do its job. Nobody would call this idiomatic Clojure. There is no immutability in the hot path, no laziness, no sequences. It reads like C with parentheses. I am fine with that. You write normal Clojure for the 99% of the program where performance does not matter, and for the one loop where it does, the language lets you go this low without leaving it. The real thanks goes to the JVM developers. Project Panama’s Vector API is what made this possible: explicit SIMD from a dynamic language, landing within 20% of clang’s auto-vectorized output. Ten years ago my answer to this problem would have been writing the kernel in C and calling it through JNI. I am glad I do not have to do that anymore. Clojure code: ns cpu-stress-test :import org.lwjgl BufferUtils org.lwjgl.glfw GLFW GLFWErrorCallback Callbacks GLFWKeyCallbackI GLFWCursorPosCallbackI GLFWFramebufferSizeCallbackI org.lwjgl.opengl GL GL11 GL15 GL20 GL30 GL31 GL33 org.lwjgl.system MemoryUtil java.nio FloatBuffer java.util.concurrent Executors ExecutorService Callable jdk.incubator.vector FloatVector VectorShuffle :gen-class set warn-on-reflection true set unchecked-math :warn-on-boxed ;; ── configuration ─────────────────────────────────────────────────────────── def ^:const NUM-CUBES 100000 def ^:const CUBE-SPREAD 200.0 def ^:const ROTATION-SPEED 0.4 def ^:const INST-MOVE-SPEED 0.5 def ^:const OSC-AMPLITUDE 3.0 def ^:const WINDOW-W 1920 def ^:const WINDOW-H 1080 def ^:const NEAR-PLANE 0.1 def ^:const FAR-PLANE 500.0 def ^:const FPS-SAMPLE-COUNT 120 def ^:const STATIC-FLOATS 10 ; per cube: bp.xyz, rs.xyz, mf.xyz, scale def ^:const MODEL-FLOATS 16 ; per cube: column-major mat4 def ^:const TAU 6.28318530717958647692 def ^:const CAM-MOVE-SPEED 50.0 def ^:const CAM-SENSITIVITY 0.002 def CAM-FOV-RAD Math/PI / 70.0 180.0 def ncores .availableProcessors Runtime/getRuntime ;; thread counts to cycle through with T: 1,2,4,8,... clamped to core count def thread-counts vec sort distinct filter <= ^long % ^long ncores 1 2 4 8 16 long ncores ;; ── shaders ────────────────────────────────────────────────────────────────── def vertex-shader-src " version 330 core layout location = 0 in vec3 aPos; layout location = 1 in vec3 aNormal; layout location = 3 in vec4 aModel0; layout location = 4 in vec4 aModel1; layout location = 5 in vec4 aModel2; layout location = 6 in vec4 aModel3; layout location = 7 in vec3 aColor; uniform mat4 uView; uniform mat4 uProjection; out vec3 vColor; out vec3 vNormal; out vec3 vFragPos; void main { mat4 model = mat4 aModel0, aModel1, aModel2, aModel3 ; vec4 worldPos = model vec4 aPos, 1.0 ; gl Position = uProjection uView worldPos; vColor = aColor; vNormal = mat3 model aNormal; vFragPos = worldPos.xyz; } " def fragment-shader-src " version 330 core in vec3 vColor; in vec3 vNormal; in vec3 vFragPos; out vec4 FragColor; uniform vec3 uLightDir; uniform vec3 uViewPos; void main { vec3 norm = normalize vNormal ; vec3 lightDir = normalize uLightDir ; vec3 ambient = 0.15 vColor; float diff = max dot norm, lightDir , 0.0 ; vec3 diffuse = diff vColor; vec3 viewDir = normalize uViewPos - vFragPos ; vec3 halfDir = normalize lightDir + viewDir ; float spec = pow max dot norm, halfDir , 0.0 , 32.0 ; vec3 specular = 0.3 spec vec3 1.0 ; float dist = length vFragPos - uViewPos ; float fog = exp -dist 0.008 ; fog = clamp fog, 0.0, 1.0 ; vec3 result = ambient + diffuse + specular; result = mix vec3 0.02, 0.02, 0.05 , result, fog ; FragColor = vec4 result, 1.0 ; } " ;; ── cube geometry pos.xyz, normal.xyz ───────────────────────────────────── def cube-vertices float-array -0.5 -0.5 0.5 0 0 1 0.5 -0.5 0.5 0 0 1 0.5 0.5 0.5 0 0 1 0.5 0.5 0.5 0 0 1 -0.5 0.5 0.5 0 0 1 -0.5 -0.5 0.5 0 0 1 -0.5 -0.5 -0.5 0 0 -1 -0.5 0.5 -0.5 0 0 -1 0.5 0.5 -0.5 0 0 -1 0.5 0.5 -0.5 0 0 -1 0.5 -0.5 -0.5 0 0 -1 -0.5 -0.5 -0.5 0 0 -1 -0.5 0.5 -0.5 0 1 0 -0.5 0.5 0.5 0 1 0 0.5 0.5 0.5 0 1 0 0.5 0.5 0.5 0 1 0 0.5 0.5 -0.5 0 1 0 -0.5 0.5 -0.5 0 1 0 -0.5 -0.5 -0.5 0 -1 0 0.5 -0.5 -0.5 0 -1 0 0.5 -0.5 0.5 0 -1 0 0.5 -0.5 0.5 0 -1 0 -0.5 -0.5 0.5 0 -1 0 -0.5 -0.5 -0.5 0 -1 0 0.5 -0.5 -0.5 1 0 0 0.5 0.5 -0.5 1 0 0 0.5 0.5 0.5 1 0 0 0.5 0.5 0.5 1 0 0 0.5 -0.5 0.5 1 0 0 0.5 -0.5 -0.5 1 0 0 -0.5 -0.5 -0.5 -1 0 0 -0.5 -0.5 0.5 -1 0 0 -0.5 0.5 0.5 -1 0 0 -0.5 0.5 0.5 -1 0 0 -0.5 0.5 -0.5 -1 0 0 -0.5 -0.5 -0.5 -1 0 0 ;; ── deterministic PRNG SplitMix64, seed 42 ───────────────────────────────── def ^:const SM-GAMMA unchecked-long Long/parseUnsignedLong "9E3779B97F4A7C15" 16 def ^:const SM-MIX1 unchecked-long Long/parseUnsignedLong "BF58476D1CE4E5B9" 16 def ^:const SM-MIX2 unchecked-long Long/parseUnsignedLong "94D049BB133111EB" 16 def ^:const INV-2P24 / 1.0 16777216.0 ;; Returns ^double, not ^float: Clojure has no primitive float fn return, so a ;; ^float hint would box every result and poison downstream arithmetic. The cloud ;; needn't be bit-identical to C the integer RNG state stream is , so double is fine. defn sm-mix ^double ^long s let z unchecked-multiply bit-xor s unsigned-bit-shift-right s 30 SM-MIX1 z unchecked-multiply bit-xor z unsigned-bit-shift-right z 27 SM-MIX2 z bit-xor z unsigned-bit-shift-right z 31 double unsigned-bit-shift-right z 40 INV-2P24 ;; ── instance generation: static CPU inputs + static GPU color buffer ───────── defn generate-instance-data let ^floats statics float-array NUM-CUBES STATIC-FLOATS ^FloatBuffer color BufferUtils/createFloatBuffer NUM-CUBES 3 half / CUBE-SPREAD 2.0 loop i 0, st long 42 if < i NUM-CUBES let s unchecked-add st SM-GAMMA px - sm-mix s CUBE-SPREAD half s unchecked-add s SM-GAMMA py - sm-mix s CUBE-SPREAD half s unchecked-add s SM-GAMMA pz - sm-mix s CUBE-SPREAD half s unchecked-add s SM-GAMMA hue sm-mix s TAU r + 0.5 0.5 Math/sin hue g + 0.5 0.5 Math/sin + hue 2.094 b + 0.5 0.5 Math/sin + hue 4.188 s unchecked-add s SM-GAMMA sc + 0.3 sm-mix s 0.7 s unchecked-add s SM-GAMMA rsx - sm-mix s 0.5 ROTATION-SPEED 2.0 s unchecked-add s SM-GAMMA rsy - sm-mix s 0.5 ROTATION-SPEED 2.0 s unchecked-add s SM-GAMMA rsz - sm-mix s 0.5 ROTATION-SPEED 2.0 s unchecked-add s SM-GAMMA mfx + 0.5 sm-mix s INST-MOVE-SPEED s unchecked-add s SM-GAMMA mfy + 0.5 sm-mix s INST-MOVE-SPEED s unchecked-add s SM-GAMMA mfz + 0.5 sm-mix s INST-MOVE-SPEED si i STATIC-FLOATS aset statics si float px aset statics + si 1 float py aset statics + si 2 float pz aset statics + si 3 float rsx aset statics + si 4 float rsy aset statics + si 5 float rsz aset statics + si 6 float mfx aset statics + si 7 float mfy aset statics + si 8 float mfz aset statics + si 9 float sc .put color float r .put color float g .put color float b recur inc i s do .flip color ;; transpose AoS - SoA one-time so the SIMD path can do contiguous ;; per-field vector loads: soa = bpx bpy bpz rsx ..rsz mfx ..mfz scale let soa object-array STATIC-FLOATS dotimes k STATIC-FLOATS aset soa int k float-array NUM-CUBES dotimes c NUM-CUBES let cs c STATIC-FLOATS dotimes k STATIC-FLOATS aset ^floats aget soa int k int c aget statics int + cs k statics soa color ;; ── shared polynomial sin/cos identical algorithm to the C version ───────── ;; Replaces Math/sin|cos fdlibm — not a HW intrinsic on Apple Silicon so the ;; C-vs-Clojure comparison isolates language/JIT instead of measuring two ;; different transcendental implementations. Range-reduce to -pi, pi , parabola ;; + one refinement. ^double on the arg vector ⇒ true primitive in/out, no boxing. def ^:const INV-TWO-PI 0.15915494309189535 def ^:const TWO-PI 6.283185307179586 def ^:const HALF-PI 1.5707963267948966 def ^:const B-SIN 1.2732395447351628 ; 4/pi def ^:const C-SIN -0.40528473456935109 ; -4/pi^2 def ^:const P-SIN 0.225 defn fast-sin ^double ^double x let k double long + x INV-TWO-PI if = x 0.0 0.5 -0.5 ; round to nearest r - x k TWO-PI ; r in -pi, pi ar Math/abs r y + B-SIN r C-SIN r ar ay Math/abs y + P-SIN - y ay y y ;; ── SIMD: the same polynomial sin on a FloatVector JDK Panama Vector API ─── ;; Lane-parallel version of fast-sin. Range reduction uses the float "magic ;; number" trick: v + 1.5 2^23 - 1.5 2^23 == round-to-nearest, all add/sub, so ;; it lowers to NEON — no per-lane libm and no copysign branch. ;; ;; CRITICAL for performance: the Vector API only compiles to actual SIMD when ;; C2 can constant-fold the species. A species stored in a Clojure Var is an ;; opaque field load, which silently drops every op onto the boxed Java fallback ;; path ~10x slower than scalar . So the species is referenced as the static ;; final field FloatVector/SPECIES 128 at every call site NEON = 128-bit = 4 ;; float lanes — this project is pinned to Apple Silicon , and the sin polynomial ;; is a macro, not a fn, so the whole kernel is one inlinable method body. def ^:const LANES 4 ; SPECIES 128 float lanes def ^:const MAGIC 12582912.0 ; 1.5 2^23 defmacro vbc "Broadcast a compile-time constant to a FloatVector. Inside a loop C2 hoists the resulting Replicate node out as loop-invariant, so this costs nothing." c FloatVector/broadcast FloatVector/SPECIES 128 float ~c defmacro vsin "Inline lanewise polynomial sin same algorithm as fast-sin on a FloatVector. Uses fma like the clang -O2 build of the C kernel fp-contract is on there ." x let x ~x v .mul x float INV-TWO-PI k .sub .add v float MAGIC float MAGIC ; round-to-nearest, lanewise r .fma k vbc - TWO-PI x ; r = x - k 2pi, in -pi, pi ar .abs r y .mul r .fma ar vbc C-SIN vbc B-SIN ; y = B r + C r |r| ay .abs y .fma .fma y ay .neg y vbc P-SIN y ; P y |y| - y + y ;; ── THE CPU WORKLOAD ───────────────────────────────────────────────────────── ;; Build column-major model matrices for cubes start, end at time t, writing ;; into the heap float model handed to glBufferData via LWJGL's array ;; overload — a pinned zero-copy JNI pass, symmetric with C handing malloc'd ;; memory to GL . Params are left unhinted a fn taking 4 primitives is illegal ;; and re-bound to primitives inside, so the body is fully primitive with no boxing. defn compute-range model statics start end t let ^floats model model ^floats statics statics end long end t double t osc double OSC-AMPLITUDE loop i long start when < i end let si i STATIC-FLOATS apx + aget statics si osc fast-sin aget statics + si 6 t apy + aget statics + si 1 osc fast-sin aget statics + si 7 t apz + aget statics + si 2 osc fast-sin aget statics + si 8 t ax aget statics + si 3 t ay aget statics + si 4 t az aget statics + si 5 t scale double aget statics + si 9 cx fast-sin + ax HALF-PI sx fast-sin ax cy fast-sin + ay HALF-PI sy fast-sin ay cz fast-sin + az HALF-PI sz fast-sin az base i MODEL-FLOATS aset model base float scale cy cz aset model + base 1 float scale + sx sy cz cx sz aset model + base 2 float scale + - cx sy cz sx sz aset model + base 3 float 0.0 aset model + base 4 float scale - cy sz aset model + base 5 float scale + - sx sy sz cx cz aset model + base 6 float scale + cx sy sz sx cz aset model + base 7 float 0.0 aset model + base 8 float scale sy aset model + base 9 float scale - sx cy aset model + base 10 float scale cx cy aset model + base 11 float 0.0 aset model + base 12 float apx aset model + base 13 float apy aset model + base 14 float apz aset model + base 15 float 1.0 recur inc i ;; SIMD variant: same math, LANES cubes at a time, reading SoA arrays with ;; contiguous vector loads. The 16 matrix elements are computed as 16 FloatVectors ;; lane j = cube i+j , then transposed in registers — four 4x4 transposes built ;; from two-vector rearranges NEON tbl — so each cube's 16 floats are written ;; with 4 contiguous vector stores. No scratch buffer, no scalar scatter. ;; Requires start,end to be LANES-aligned guaranteed by run-pass ; a scalar ;; tail covers any remainder. ;; ;; transpose4 turns column vectors A,B,C,W lane j = cube j into row vectors ;; T j = A j, B j, C j, W j and stores them at model i+j 16 + off : ;; ab-lo = A0 B0 A1 B1 cw-lo = C0 W0 C1 W1 - T0 = A0 B0 C0 W0 ;; ab-hi = A2 B2 A3 B3 cw-hi = C2 W2 C3 W3 T1 = A1 B1 C1 W1 ... ;; In a two-vector rearrange, shuffle index k in 0,4 picks this k and the ;; "exceptional" index k-4 negative picks other k . defmacro ^:private transpose4-store model i off A B C W zip-lo zip-hi cat-lo cat-hi let ab-lo .rearrange ~A ~zip-lo ~B ab-hi .rearrange ~A ~zip-hi ~B cw-lo .rearrange ~C ~zip-lo ~W cw-hi .rearrange ~C ~zip-hi ~W .intoArray .rearrange ab-lo ~cat-lo cw-lo ~model int + ~i 16 ~off .intoArray .rearrange ab-lo ~cat-hi cw-lo ~model int + + ~i 1 16 ~off .intoArray .rearrange ab-hi ~cat-lo cw-hi ~model int + + ~i 2 16 ~off .intoArray .rearrange ab-hi ~cat-hi cw-hi ~model int + + ~i 3 16 ~off defn compute-range-simd model soa start end t let ^floats model model ^objects soa soa start long start end long end tf float t osc float OSC-AMPLITUDE hp float HALF-PI lanes long LANES ^floats bpx aget soa 0 ^floats bpy aget soa 1 ^floats bpz aget soa 2 ^floats rsx aget soa 3 ^floats rsy aget soa 4 ^floats rsz aget soa 5 ^floats mfx aget soa 6 ^floats mfy aget soa 7 ^floats mfz aget soa 8 ^floats scl aget soa 9 zero FloatVector/zero FloatVector/SPECIES 128 one FloatVector/broadcast FloatVector/SPECIES 128 float 1.0 zip-lo VectorShuffle/fromValues FloatVector/SPECIES 128 int-array 0 -4 1 -3 zip-hi VectorShuffle/fromValues FloatVector/SPECIES 128 int-array 2 -2 3 -1 cat-lo VectorShuffle/fromValues FloatVector/SPECIES 128 int-array 0 1 -4 -3 cat-hi VectorShuffle/fromValues FloatVector/SPECIES 128 int-array 2 3 -2 -1 simd-end + start quot - end start lanes lanes loop i start when < i simd-end let ii int i BPX FloatVector/fromArray FloatVector/SPECIES 128 bpx ii BPY FloatVector/fromArray FloatVector/SPECIES 128 bpy ii BPZ FloatVector/fromArray FloatVector/SPECIES 128 bpz ii RSX FloatVector/fromArray FloatVector/SPECIES 128 rsx ii RSY FloatVector/fromArray FloatVector/SPECIES 128 rsy ii RSZ FloatVector/fromArray FloatVector/SPECIES 128 rsz ii MFX FloatVector/fromArray FloatVector/SPECIES 128 mfx ii MFY FloatVector/fromArray FloatVector/SPECIES 128 mfy ii MFZ FloatVector/fromArray FloatVector/SPECIES 128 mfz ii SCALE FloatVector/fromArray FloatVector/SPECIES 128 scl ii APX .add BPX .mul vsin .mul MFX tf osc APY .add BPY .mul vsin .mul MFY tf osc APZ .add BPZ .mul vsin .mul MFZ tf osc AX .mul RSX tf AY .mul RSY tf AZ .mul RSZ tf CX vsin .add AX hp SX vsin AX CY vsin .add AY hp SY vsin AY CZ vsin .add AZ hp SZ vsin AZ SXSY .mul SX SY CXSY .mul CX SY M0 .mul SCALE .mul CY CZ M1 .mul SCALE .fma SXSY CZ .mul CX SZ M2 .mul SCALE .fma .neg CXSY CZ .mul SX SZ M4 .mul SCALE .neg .mul CY SZ M5 .mul SCALE .fma .neg SXSY SZ .mul CX CZ M6 .mul SCALE .fma CXSY SZ .mul SX CZ M8 .mul SCALE SY M9 .mul SCALE .neg .mul SX CY M10 .mul SCALE .mul CX CY transpose4-store model ii 0 M0 M1 M2 zero zip-lo zip-hi cat-lo cat-hi transpose4-store model ii 4 M4 M5 M6 zero zip-lo zip-hi cat-lo cat-hi transpose4-store model ii 8 M8 M9 M10 zero zip-lo zip-hi cat-lo cat-hi transpose4-store model ii 12 APX APY APZ one zip-lo zip-hi cat-lo cat-hi recur + i lanes ;; scalar tail only runs if the range wasn't LANES-aligned loop i simd-end when < i end let ix int i td double tf od double osc apx + aget bpx ix od fast-sin aget mfx ix td apy + aget bpy ix od fast-sin aget mfy ix td apz + aget bpz ix od fast-sin aget mfz ix td ax aget rsx ix td ay aget rsy ix td az aget rsz ix td sc double aget scl ix cx fast-sin + ax HALF-PI sx fast-sin ax cy fast-sin + ay HALF-PI sy fast-sin ay cz fast-sin + az HALF-PI sz fast-sin az base i MODEL-FLOATS aset model base float sc cy cz aset model + base 1 float sc + sx sy cz cx sz aset model + base 2 float sc + - cx sy cz sx sz aset model + base 3 float 0.0 aset model + base 4 float sc - cy sz aset model + base 5 float sc + - sx sy sz cx cz aset model + base 6 float sc + cx sy sz sx cz aset model + base 7 float 0.0 aset model + base 8 float sc sy aset model + base 9 float sc - sx cy aset model + base 10 float sc cx cy aset model + base 11 float 0.0 aset model + base 12 float apx aset model + base 13 float apy aset model + base 14 float apz aset model + base 15 float 1.0 recur inc i ;; Run the transform pass for the whole cloud, fanned across nt threads, in ;; either scalar AoS statics or SIMD SoA soa mode. The calling thread ;; computes chunk 0 itself while the pool runs chunks 1..nt-1 cheaper than ;; parking the caller in invokeAll — closer to OpenMP's fork-join cost . defn run-pass ^ExecutorService pool model statics soa nt t simd? let n long NUM-CUBES nt long nt lanes long LANES run fn s e if simd? compute-range-simd model soa s e t compute-range model statics s e t if == nt 1 run 0 n let base quot + n dec nt nt ; ceil n / nt chunk long if simd? quot + base dec lanes lanes lanes ; round up to LANES base futs mapv fn c let start long c chunk e + start chunk end if < e n e n .submit pool reify Callable call when < start n run start end nil range 1 nt run 0 min chunk n run fn ^java.util.concurrent.Future f .get f futs nil ;; ── minimal column-major math view + projection only ─────────────────────── defn perspective ^FloatBuffer b fovy aspect near far let fovy double fovy aspect double aspect near double near far double far t Math/tan / fovy 2.0 dotimes i 16 .put b int i float 0 doto b .put 0 float / 1.0 aspect t .put 5 float / 1.0 t .put 10 float / - + far near - far near .put 11 float -1.0 .put 14 float / - 2.0 far near - far near defn look-at ^FloatBuffer b ex ey ez cx cy cz ux uy uz let ex double ex ey double ey ez double ez cx double cx cy double cy cz double cz ux double ux uy double uy uz double uz dx - cx ex dy - cy ey dz - cz ez dl Math/sqrt + dx dx dy dy dz dz fx / dx dl fy / dy dl fz / dz dl ax - fy uz fz uy ay - fz ux fx uz az - fx uy fy ux al Math/sqrt + ax ax ay ay az az sx / ax al sy / ay al sz / az al vx - sy fz sz fy vy - sz fx sx fz vz - sx fy sy fx doto b .put 0 float sx .put 1 float vx .put 2 float - fx .put 3 float 0 .put 4 float sy .put 5 float vy .put 6 float - fy .put 7 float 0 .put 8 float sz .put 9 float vz .put 10 float - fz .put 11 float 0 .put 12 float - + sx ex sy ey sz ez .put 13 float - + vx ex vy ey vz ez .put 14 float + fx ex fy ey fz ez .put 15 float 1 ;; ── shader compilation ──────────────────────────────────────────────────────── defn compile-shader ^CharSequence src ^long kind ^String label let id GL20/glCreateShader int kind GL20/glShaderSource id src GL20/glCompileShader id when zero? GL20/glGetShaderi id GL20/GL COMPILE STATUS binding out err println str "ERROR::SHADER::" label "::COMPILATION FAILED" println GL20/glGetShaderInfoLog id throw ex-info "shader compilation failed" {:label label} id defn build-program vs-src fs-src let vs compile-shader vs-src GL20/GL VERTEX SHADER "VERTEX" fs compile-shader fs-src GL20/GL FRAGMENT SHADER "FRAGMENT" program GL20/glCreateProgram GL20/glAttachShader program vs GL20/glAttachShader program fs GL20/glLinkProgram program GL20/glDeleteShader vs GL20/glDeleteShader fs when zero? GL20/glGetProgrami program GL20/GL LINK STATUS binding out err println "ERROR::SHADER::PROGRAM::LINKING FAILED" println GL20/glGetProgramInfoLog program throw ex-info "program linking failed" {} program ;; ── shared state for GLFW callbacks ────────────────────────────────────────── def cam-pos atom 0.0 50.0 150.0 def cam-yaw atom 0.0 def cam-pitch atom 0.0 def first-mouse atom true def last-mouse atom 0.0 0.0 def mouse-captured atom true def vsync-on atom false def show-stats atom true def fb-size atom WINDOW-W WINDOW-H def thread-idx atom 0 ; index into thread-counts def simd? atom true ; true = Panama SIMD default , false = scalar defn current-threads ^long long nth thread-counts @thread-idx defn process-keyboard ^long window ^double dt let yaw double @cam-yaw pitch double @cam-pitch speed CAM-MOVE-SPEED dt cp Math/cos pitch fx Math/sin yaw cp fz - Math/cos yaw cp len Math/sqrt + fx fx fz fz nfx if zero? len fx / fx len nfz if zero? len fz / fz len rx Math/cos yaw rz Math/sin yaw down? fn k = GLFW/glfwGetKey window int k GLFW/GLFW PRESS w? down? GLFW/GLFW KEY W s? down? GLFW/GLFW KEY S a? down? GLFW/GLFW KEY A d? down? GLFW/GLFW KEY D up? down? GLFW/GLFW KEY SPACE dn? down? GLFW/GLFW KEY LEFT SHIFT ddx + if w? nfx speed 0.0 if s? - nfx speed 0.0 if d? rx speed 0.0 if a? - rx speed 0.0 ddz + if w? nfz speed 0.0 if s? - nfz speed 0.0 if d? rz speed 0.0 if a? - rz speed 0.0 ddy + if up? speed 0.0 if dn? - speed 0.0 when or w? s? a? d? up? dn? swap cam-pos fn x y z + double x ddx + double y ddy + double z ddz defn make-key-callback reify GLFWKeyCallbackI invoke window key scancode action mods when = action GLFW/GLFW PRESS condp = key GLFW/GLFW KEY ESCAPE GLFW/glfwSetWindowShouldClose window true GLFW/GLFW KEY TAB do swap mouse-captured not if @mouse-captured do GLFW/glfwSetInputMode window GLFW/GLFW CURSOR GLFW/GLFW CURSOR DISABLED reset first-mouse true GLFW/glfwSetInputMode window GLFW/GLFW CURSOR GLFW/GLFW CURSOR NORMAL GLFW/GLFW KEY T do swap thread-idx fn i mod inc long i count thread-counts println "Worker threads:" current-threads GLFW/GLFW KEY M do swap simd? not println "Compute mode:" if @simd? "SIMD Panama " "scalar" GLFW/GLFW KEY V do swap vsync-on not GLFW/glfwSwapInterval int if @vsync-on 1 0 println "VSync:" if @vsync-on "ON" "OFF" GLFW/GLFW KEY F1 swap show-stats not nil defn make-cursor-callback reify GLFWCursorPosCallbackI invoke window xpos ypos if @first-mouse do reset last-mouse xpos ypos reset first-mouse false when @mouse-captured let lx ly @last-mouse xoff - double xpos double lx yoff - double ypos double ly max-pitch - / Math/PI 2.0 0.1 reset last-mouse xpos ypos swap cam-yaw fn y + double y xoff CAM-SENSITIVITY swap cam-pitch fn p let np - double p yoff CAM-SENSITIVITY cond np max-pitch max-pitch < np - max-pitch - max-pitch :else np defn make-fb-callback reify GLFWFramebufferSizeCallbackI invoke window w h reset fb-size w h GL11/glViewport 0 0 int w int h def log-stats? some? System/getenv "STRESS STATS" ; STRESS STATS=1 - also print stats to stdout defn update-title window threads cpu-ms fps frame-ms let window long window mode if @simd? "SIMD" "scalar" if @show-stats let rt Runtime/getRuntime used-mb quot - .totalMemory rt .freeMemory rt 1024 1024 max-mb quot .maxMemory rt 1024 1024 title format "CPU Stress Clojure, %s, %d thread%s | %d cubes | CPU: %.2f ms | FPS: %.1f | %.2f ms | Mem: %d/%d MB | VSync: %s" mode threads if == long threads 1 "" "s" NUM-CUBES cpu-ms fps frame-ms used-mb max-mb if @vsync-on "ON" "OFF" when log-stats? println title flush GLFW/glfwSetWindowTitle window title GLFW/glfwSetWindowTitle window "CPU Stress Test - 100K Cubes" ;; ── main ───────────────────────────────────────────────────────────────────── defn -main & args .set GLFWErrorCallback/createPrint System/err when-not GLFW/glfwInit binding out err println "GLFW init failed" System/exit 1 GLFW/glfwDefaultWindowHints GLFW/glfwWindowHint GLFW/GLFW CONTEXT VERSION MAJOR 3 GLFW/glfwWindowHint GLFW/GLFW CONTEXT VERSION MINOR 3 GLFW/glfwWindowHint GLFW/GLFW OPENGL PROFILE GLFW/GLFW OPENGL CORE PROFILE GLFW/glfwWindowHint GLFW/GLFW OPENGL FORWARD COMPAT GLFW/GLFW TRUE let window GLFW/glfwCreateWindow int WINDOW-W int WINDOW-H "CPU Stress Test - 100K Cubes Clojure " MemoryUtil/NULL MemoryUtil/NULL when = window MemoryUtil/NULL binding out err println "Window creation failed" GLFW/glfwTerminate System/exit 1 GLFW/glfwMakeContextCurrent window GLFW/glfwSwapInterval 0 let wbuf int-array 1 hbuf int-array 1 GLFW/glfwGetFramebufferSize window wbuf hbuf reset fb-size aget wbuf 0 aget hbuf 0 GLFW/glfwSetFramebufferSizeCallback window make-fb-callback GLFW/glfwSetCursorPosCallback window make-cursor-callback GLFW/glfwSetKeyCallback window make-key-callback GLFW/glfwSetInputMode window GLFW/GLFW CURSOR GLFW/GLFW CURSOR DISABLED GL/createCapabilities println format str "\n" "CPU STRESS TEST - %d CPU-TRANSFORMED CUBES Clojure / LWJGL \n" "OpenGL: %s\nGPU: %s\n" "Cores: %d | thread cycle T : %s | SIMD lanes: %d M to toggle Panama SIMD \n" "WASD/Space/Shift fly | Mouse look | Tab cursor | T threads | M SIMD | V vsync | F1 stats | Esc quit\n" NUM-CUBES GL11/glGetString GL11/GL VERSION GL11/glGetString GL11/GL RENDERER ncores pr-str thread-counts LANES GL11/glEnable GL11/GL DEPTH TEST GL11/glDepthFunc GL11/GL LESS GL11/glEnable GL11/GL CULL FACE GL11/glCullFace GL11/GL BACK let shader int build-program vertex-shader-src fragment-shader-src statics soa color-buf generate-instance-data model-bytes long NUM-CUBES MODEL-FLOATS 4 ^floats model float-array NUM-CUBES MODEL-FLOATS ^ExecutorService pool Executors/newFixedThreadPool ncores cube-buf doto BufferUtils/createFloatBuffer alength ^floats cube-vertices .put ^floats cube-vertices .flip stride-cube int 6 4 stride-model int MODEL-FLOATS 4 vao GL30/glGenVertexArrays vbo-cube GL15/glGenBuffers vbo-model GL15/glGenBuffers vbo-color GL15/glGenBuffers println format "Generated %d cubes streaming %.2f MB/frame of model matrices " NUM-CUBES / double model-bytes 1024.0 1024.0 let fw fh @fb-size println format "Framebuffer: %dx%d" long fw long fh GL30/glBindVertexArray vao ;; per-vertex cube geometry GL15/glBindBuffer GL15/GL ARRAY BUFFER vbo-cube GL15/glBufferData GL15/GL ARRAY BUFFER ^FloatBuffer cube-buf GL15/GL STATIC DRAW GL20/glVertexAttribPointer 0 3 GL11/GL FLOAT false stride-cube long 0 GL20/glEnableVertexAttribArray 0 GL20/glVertexAttribPointer 1 3 GL11/GL FLOAT false stride-cube long 3 4 GL20/glEnableVertexAttribArray 1 ;; per-instance model matrix mat4 = locations 3..6 , streamed each frame GL15/glBindBuffer GL15/GL ARRAY BUFFER vbo-model GL15/glBufferData GL15/GL ARRAY BUFFER model-bytes GL15/GL STREAM DRAW dotimes col 4 let loc int + 3 col GL20/glVertexAttribPointer loc 4 GL11/GL FLOAT false stride-model long col 4 4 GL20/glEnableVertexAttribArray loc GL33/glVertexAttribDivisor loc 1 ;; per-instance color location 7 , static GL15/glBindBuffer GL15/GL ARRAY BUFFER vbo-color GL15/glBufferData GL15/GL ARRAY BUFFER ^FloatBuffer color-buf GL15/GL STATIC DRAW GL20/glVertexAttribPointer 7 3 GL11/GL FLOAT false int 3 4 long 0 GL20/glEnableVertexAttribArray 7 GL33/glVertexAttribDivisor 7 1 GL30/glBindVertexArray 0 GL15/glBindBuffer GL15/GL ARRAY BUFFER 0 let u-view GL20/glGetUniformLocation shader "uView" u-proj GL20/glGetUniformLocation shader "uProjection" u-light GL20/glGetUniformLocation shader "uLightDir" u-viewpos GL20/glGetUniformLocation shader "uViewPos" view-buf BufferUtils/createFloatBuffer 16 proj-buf BufferUtils/createFloatBuffer 16 frame-times double-array FPS-SAMPLE-COUNT ;; ── render loop ────────────────────────────────────────────────────── loop last-t GLFW/glfwGetTime idx 0 cnt 0 elapsed 0.0 title-t 0.0 cpu-accum 0.0 cpu-frames 0 if GLFW/glfwWindowShouldClose window do .shutdown pool GL30/glDeleteVertexArrays vao GL15/glDeleteBuffers vbo-cube GL15/glDeleteBuffers vbo-model GL15/glDeleteBuffers vbo-color GL20/glDeleteProgram shader Callbacks/glfwFreeCallbacks window GLFW/glfwDestroyWindow window GLFW/glfwTerminate some- GLFW/glfwSetErrorCallback nil .free let now GLFW/glfwGetTime dt - now last-t aset frame-times idx dt idx' rem inc idx FPS-SAMPLE-COUNT cnt' if < cnt FPS-SAMPLE-COUNT inc cnt cnt total double loop i 0 acc 0.0 if < i cnt' recur inc i + acc aget frame-times i acc fps if total 0.0 / double cnt' total 0.0 frame-ms dt 1000.0 elapsed' + elapsed dt title-t' + title-t dt nt current-threads process-keyboard window dt ;; ── the CPU pass timed ────────────────────────────────────── let c0 GLFW/glfwGetTime run-pass pool model statics soa nt elapsed' 5.0 @simd? c1 GLFW/glfwGetTime cpu-accum' + cpu-accum - c1 c0 1000.0 cpu-frames' inc cpu-frames GL11/glClearColor float 0.02 float 0.02 float 0.05 float 1.0 GL11/glClear bit-or GL11/GL COLOR BUFFER BIT GL11/GL DEPTH BUFFER BIT let fw fh @fb-size aspect / double fw double max 1 long fh cx cy cz @cam-pos cx double cx cy double cy cz double cz yaw double @cam-yaw pitch double @cam-pitch cp Math/cos pitch fwx Math/sin yaw cp fwy Math/sin pitch fwz - Math/cos yaw cp look-at view-buf cx cy cz + cx fwx 20.0 + cy fwy 20.0 + cz fwz 20.0 0.0 1.0 0.0 perspective proj-buf CAM-FOV-RAD aspect NEAR-PLANE FAR-PLANE ;; stream the freshly-computed matrices orphan + upload GL15/glBindBuffer GL15/GL ARRAY BUFFER vbo-model GL15/glBufferData GL15/GL ARRAY BUFFER model GL15/GL STREAM DRAW GL20/glUseProgram shader GL20/glUniformMatrix4fv u-view false ^FloatBuffer view-buf GL20/glUniformMatrix4fv u-proj false ^FloatBuffer proj-buf GL20/glUniform3f u-light float 0.5 float 0.8 float 0.3 GL20/glUniform3f u-viewpos float cx float cy float cz GL30/glBindVertexArray vao GL31/glDrawArraysInstanced GL11/GL TRIANGLES 0 36 int NUM-CUBES GL30/glBindVertexArray 0 let refresh? = title-t' 0.5 when refresh? update-title window nt / cpu-accum' cpu-frames' fps frame-ms title-t'' if refresh? 0.0 title-t' cpu-accum'' if refresh? 0.0 cpu-accum' cpu-frames'' if refresh? 0 cpu-frames' GLFW/glfwSwapBuffers window GLFW/glfwPollEvents recur now idx' cnt' elapsed' title-t'' cpu-accum'' cpu-frames'' C code: define GL SILENCE DEPRECATION 1 define GLFW INCLUDE NONE 1 include