Low-level Haskell: The cursed way to emulate inline assembly in Haskell/GHC A Haskell developer explores methods to invoke low-level CPU instructions from GHC, such as computing the full 128-bit product of two 64-bit integers, despite GHC lacking inline assembly support. The post compares approaches including GHC's intrinsic `timesWord2#` and C FFI workarounds, highlighting performance trade-offs. Low-level Haskell: The cursed way to emulate inline assembly in Haskell/GHC, or how to return multiple values from a foreign function This article is an English version of my earlier post “ 【低レベルHaskell】Haskell GHC でもインラインアセンブリに肉薄したい! https://qiita.com/mod poppo/items/793fdb08e62591d6f3fb ” in Japanese . The translation was assisted by AI if you don’t like reading AI-generated content, please read the Japanese version . Modern CPUs have many instructions specialized for particular purposes. Examples include SIMD, instructions useful for hashing and cryptography, and a variety of others. C and C++ have inline assembly and intrinsics, which let you write code that takes advantage of such instructions. Haskell GHC , on the other hand, has no such mechanism. But that’s no reason to give up just yet. Let’s find a way to invoke obscure CPU instructions from Haskell, and as efficiently as possible. First, let me list a few CPU instructions that would be nice to use from Haskell. The subject: the high and low halves of a product of 64-bit integers Consider computing the product of two 64-bit integers, obtaining both the high 64 bits and the low 64 bits 128 bits in total . The ordinary multiplication found in C and Haskell, :: Word64 - Word64 - Word64 , can only compute the low 64 bits. At the machine-code / assembly level on x86, however, the high 64 bits are computed alongside the product as well. For this kind of processing — “easy at the machine-code level, but non-trivial at the C or Haskell level” — we’d like to use inline assembly or intrinsics. Actually, GHC has an intrinsic timesWord2 :: Word - Word - Word , Word , so you can do this in one shot using it. I chose this subject anyway so that we can measure how much slower the alternatives get compared to a GHC intrinsic. As another subject, carry-less multiplication polynomial multiplication over a finite field would also be useful for certain purposes. I won’t go into detail in this article, but I’ve placed the test results in the repository. In C GCC/Clang have the int128 type, so you can compute this in one shot using it. No inline assembly or intrinsics required. unsigned int128 wideningMul uint64 t a, uint64 t b { return unsigned int128 a unsigned int128 b; } If we deliberately wrote it with inline assembly, it might look like this: uint64 t wideningMul inlasm uint64 t a, uint64 t b, uint64 t outHigh { uint64 t lo, hi; asm "movq %2, %%rax;" // mulq computes the product of %rax and the operand here %3 , // placing the high 64 bits in %rdx and the low 64 bits in %rax "mulq %3;" "movq %%rax, %0;" "movq %%rdx, %1;" : "=r" lo , "=r" hi : "r" a , "r" b : "%rax", "%rdx" ; outHigh = hi; return lo; } How to return multiple values Now, this operation takes two uint64 t s and returns a 128-bit value — that is, two uint64 t s. Since C’s syntax has no multiple-value return, you have to choose one of the following ways to return the values: - Return a struct by value: define a struct like struct uint128 { uint64 t lo, hi; } and return it by value.- Returning unsigned int128 by value corresponds to this internally. See the x86 64 ABI for details. - Returning - Take and pass a pointer: take the location where the second and later return values should be stored as a pointer argument. - Example: the wideningMul inlasm function I wrote earlier. - Example: the As an example of the former, the C standard div , ldiv , and lldiv functions return a {,l,ll}div t struct by value. The advantage of returning a struct by value is that, depending on the ABI, if the struct is small the values can be returned while kept in registers. The disadvantage, on the other hand, is that other languages’ C FFI may not support it. In fact, GHC’s current C FFI does not support passing structs by value. There is a proposal to make C structs passable by value through the FFI, but it has seen no movement: c structures · Wiki · Glasgow Haskell Compiler / GHC · GitLab https://gitlab.haskell.org/ghc/ghc/-/wikis/c-structures Support C structures in Haskell FFI 9700 · Issue · ghc/ghc https://gitlab.haskell.org/ghc/ghc/-/issues/9700 Using the C FFI with a pointer Safe FFI When there’s something Haskell can’t do, let’s borrow the power of another language To that end, Haskell has an FFI. With it, you can call functions written in C. Let’s give it a try right away: include