Mike Acton: Convex Primitive Collision Detection – Reference and LLM-Optimized

wpnews.pro

This repository implements the collision query from K. Tracy, T. A. Howell, and Z. Manchester, "Differentiable Collision Detection for a Set of Convex Primitives" (arXiv:2207.00669, documents/2207.00669.pdf

). For a pair of convex primitives — sphere, box, capsule, or convex polytope — it computes the minimum uniform scaling α that must be applied to both shapes for them to touch (the paper's problem (10)), and the contact points from eq. (24). α < 1

means they overlap, α > 1

means they are separated.

This is a narrow-phase solver. It assumes the caller has already run a cheap broadphase and discarded pairs whose world AABBs do not overlap, so only AABB-overlapping pairs are ever queried. The committed benchmark reflects that assumption — its 1000 pairs are all AABB-overlapping (near-contact or penetrating), so the timing measures real narrow-phase work rather than the trivial rejection of far-apart shapes.

There are two implementations here:

— a reference C implementation that follows the paper directly.src/

— an optimized single-precision implementation that produces the same collision flags and the same distances (within a stated tolerance) and runs the committed 1000-pair benchmarksrc-optimized/

about 102× faster than the reference: reference median ≈ 0.276 s, optimized median ≈ 0.0027 s (median-of-5, single thread, on my machine — gcc 11, x86-64, WSL2).

That 102× crossed the 100× target I set for the committed benchmark. It also holds up off that benchmark: on alternate-seed inputs it measures 97.6–101.7× (four seeds), all passing correctness. I would not call it a uniform 100× — two of the four seeds land just under — so I claim "100× on the committed benchmark, ~98–102× generally," and no more. Numbers and caveats are in Results and limits.

Two reasons, equally important:

To provide the optimized collision routines.src-optimized/

is real, tested code you can build and use, held to the reference by an independent harness.To show how an LLM was used to do the optimization— concretely and reproducibly. Every phase of this project was generated by a language model from an instruction document I wrote, and every result was checked by a harness that the model could not talk its way around. I want the method to be inspectable, not a story you have to take on faith.

The model under test here was GPT-5.5. This is one model, one run — a case study in how to drive an LLM at an optimization problem, not a benchmark comparing models.

I find it clearest to separate the four roles explicitly.

Role	What it did
Me (the human)
Defined the problem and the output contract. Set the 100× target. Wrote the four instruction documents. Encoded my engineering approach as operating rules fed into every conversation. Course-corrected and decided what to keep.
GPT-5.5 (the model)
Generated the reference implementation, the test harness, and the optimized solver from those documents; proposed and implemented each optimization; kept the optimization log.
The test harness
The ground truth. Compared optimized output against the reference, validated it with independent code, certified the contact points, checked determinism, and timed it. Nothing here is claimed without it.
nagent (the LLM harness)
The agent loop that ran the optimization — structured, file-based, and grounded each turn by the proof harness.

Where the 100× target came from: I read the reference code and made a judgment call about what I thought was achievable on this hardware. It is not a derived bound or a proof of a ceiling — it is an engineer's estimate, and I state it as one. It turned out to be roughly the right order of magnitude to push hard against.

My approach is itself written down, in context/data-oriented-design.md

. Those operating rules — start from the real data, state the cost, remove work before doing it faster, handle the common case straight-line — were injected into every optimization conversation. So "what I contributed" is not just the target and the prompts; it is the method the model was made to follow.

The structured state and the per-turn proof are not ceremony. An LLM left to optimize on its own tends to drift: it reasons from its recollection of the last result instead of a fresh measurement, it can lose a good change that was never committed, and it can report a result it did not actually run. Keeping the working state in inspectable files, committing every kept gain immediately, and injecting the real gate-and-speedup status every turn are what turn "the model says it is faster and correct" into "measured faster, gates pass, committed." That is the difference between a demo and a result.

Each phase is an instruction document and the artifact it produced. The documents live in prompts/

.

A faithful C11 port of the paper: the α solve and the contact points from eq. (24), with explicit input validation. This is the correctness anchor everything else is measured against.prompts/create-reference.md

→ the reference (src/

). - This specifies the test, comparison, and measurement scaffold and its constraints: a fixed, committed 1000-pair input; a reference-vs-optimized comparator; an independent validator that shares no code with either solver; a contact-point certifier; a determinism check; and a median-of-5 timing protocol. Crucially, the harness was built and proven against anprompts/create-optimized-test-harness.md

→ the harness.identity copyof the referencebefore any optimization existed (seeperformance-test-optimized/HARNESS-BASELINE.md

), so the measurement pipeline itself was trusted before it was used to judge anything. - This is my optimization approach turned into instructions the model iterates on: profile where the cycles go, rank candidates by payoff, prefer removing work, run a simplification pass, keep the common case branch-minimal, and treat data layout and batching as first-class. The model ran this loop insideprompts/create-optimized.md

→ the optimized solver (src-optimized/

).nagent(https://github.com/macton/nagent) — a data-oriented agent loop where the working state lives in plain files and the model acts only through a fixed set of structured tags. The proof harness was wired to runonce per turn:

nagent --read prompts/create-optimized.md \
       --hook-per-run ./prove-optimized-harness.sh \
       "Continue until 100x target reached."

so every turn began with the real, measured gate status injected into the conversation — not the model's memory of it.

Described below.prompts/create-visualizer.md

→ the visualizer (viz/

).

The full per-hypothesis history, with measurements and keep/revert decisions, is in src-optimized/OPTIMIZATION-LOG.md

. The git history mirrors it: one commit per kept change, plus a commit recording each rejected trial. The shape of the progress matters more than any single step — it was incremental, measured, and reversible, and the dead ends were written down rather than hidden.

Kept (roughly in order):

Replace the reference's log-barrier Newton solve with a support/GJK + bisection computation of α — the single largest win.
Per-type specializations: separating-axis (SAT) paths for box-box and an asymmetric SAT for box-polytope; shifted GJK paths for sphere/capsule-polytope.
Move per-shape work into a build-stage precompute that is excluded from the timed solve(the runtime solves from a flat precomputed table). - Single precision throughout, made safe by re-centering each pair to metre scale before solving.
Stop building global polytope half-spaces up front; compute the few axes a pair actually needs, and precompute the polytope's unique hull edges for the box-poly SAT.
Compact the active-path build state; specialize and force-inline the hot support function.
Closed-form (analytic) contact witnesses for the radius-shape families (sphere/capsule, box-capsule, sphere/capsule-polytope), avoiding GJK for the witness where the geometry allows it.
Reduce bisection/refinement iteration counts where the extra steps did not change the result within tolerance.

Rejected (recorded, not hidden): a box-poly shifted-GJK path and a box-poly SAT path that either regressed or broke the tolerance/flag contract; several inlining, bracketing, and iteration-cap trials that did not measurably help; a copy-removal in the solve wrapper; and assorted witness-bookkeeping changes. Each is a one-line commit and a log entry with the reason it was dropped.

The log also records the cost of each hypothesis — wall-clock and tokens — so the price of the whole exercise is visible, not just the result.

The optimized solver is not bit-for-bit identical to the reference, and it is not supposed to be. It is accepted only when:

the collision flags are identical— it flags exactly the same pairs as colliding as the reference; and every distance agrees within|Δ| ≤ 1 mm + 0.1%·|d_ref| + 5e-4·(|c1−c2|/α²)

(build/compare_results

). The 1 mm floor is the documented resolution; the relative term covers large separations; the last is a conditioning term — a fixed α error scales by|c1−c2|/α²

, so it grows only at extreme penetration, where single precision genuinely cannot resolve the depth and the value is least actionable.

Contact points are certified for validity, not matched: a face or edge contact has many equally valid witness points, so build/validate_contacts

independently checks that each emitted point lies on both surfaces and is separated by the reported distance, rather than requiring it to equal the reference's choice.

viz/

is a small, self-contained web tool (prompts/create-visualizer.md

) that renders one query pair at a time: the two primitives, the contact points emitted by both the reference and the optimized solver, and the separation between them. It is how I eyeball that a result is geometrically sane, not just within a tolerance number. The images in this README were produced by it.

cd viz
python3 -m http.server 8000      # ES modules need an origin

Measured on my machine (gcc 11.4.0, x86-64, WSL2), committed 1000-pair input, median-of-5, single thread.

Speedup:**≈ 102× on the committed input (reference median ≈ 0.276 s, optimized ≈ 0.0027 s) — over the 100× target.Generalization: four alternate-seed inputs measure97.6×, 97.8×, 101.7×, 102.1×**, all passing correctness. So it generalizes well, but not uniformly to 100× — I claim 100× on the committed benchmark and ~98–102× generally, not a universal 100×.Gates (every kept step): full reference test suite 178/0; comparator 0 flag mismatches, 0 distances over tolerance; independent validator 0 failures; contacts 1000/1000 valid; output byte-identical run-to-run; committed input checksum unchanged.

Two honest caveats. First, these are wall-clock medians on one noisy machine; treat them as the right order of magnitude, not three significant figures. Second, some of the late gains came from reducing solver iteration counts, which spends down the accuracy margin (max distance deviation grew from ~1 mm toward ~5 mm while staying inside the conditioned tolerance), and from two selective fast-math compiler flags (-fno-signed-zeros

, -fno-math-errno

; the more aggressive ones were tried and rejected when they broke the gates). The most durable headroom from here is structural — batching and data layout — rather than more iteration-shaving.

make clean && make
make test                              # 178 passed, 0 failed
./run-performance-test                 # reference timing on the committed input

make -f Makefile.optimized optimized
./prove-optimized-harness.sh           # prints a FINAL SUMMARY + PROOF verdict
./prove-optimized-harness.sh --verbose # same, streaming every step

The proof harness verifies the committed input's sha256 is unchanged (9bd4939dc3d6c7d66459fe064768bf2d904b59410c4d8929107c9264c96dd555

), so the benchmark cannot be quietly edited to flatter a result.

Units: metres; positions and distances are float32. Every world-AABB corner within ±8,192 m; every primitive's world-AABB extent within [0.1, 250] m; results correct to 1 mm.

The library allocates nothing — the caller passes a scratch buffer:

#include "src/collide.h"
size_t cp_collide_scratch_bytes(uint32_t prim_count);
void   cp_collide_pairs(const cp_prim *prims, uint32_t prim_count,
                        const cp_pair *pairs, uint32_t pair_count,
                        cp_result *results, void *scratch, size_t scratch_bytes);

Arrays in, arrays out; pairs reference primitives by index; a single query is

pair_count = 1

. Out-of-range coordinates/sizes, bad primitives, and non-convergence are reported per result via an explicitstatus

— never clamped, never silently accepted.

K. Tracy, T. A. Howell, Z. Manchester.

Differentiable Collision Detection for a Set of Convex Primitives.arXiv:2207.00669. (documents/2207.00669.pdf

)

source & further reading

github.com — original article

Mike Acton: Convex Primitive Collision Detection – Reference and LLM-Optimized

Run your AI side-project on zahid.host