{"slug": "low-level-haskell-the-cursed-way-to-emulate-inline-assembly-in-haskell-ghc", "title": "Low-level Haskell: The cursed way to emulate inline assembly in Haskell/GHC", "summary": "A Haskell developer explores methods to invoke low-level CPU instructions from GHC, such as computing the full 128-bit product of two 64-bit integers, despite GHC lacking inline assembly support. The post compares approaches including GHC's intrinsic `timesWord2#` and C FFI workarounds, highlighting performance trade-offs.", "body_md": "# Low-level Haskell: The cursed way to emulate inline assembly in Haskell/GHC, or how to return multiple values from a foreign function\n\nThis article is an English version of my earlier post “[【低レベルHaskell】Haskell (GHC) でもインラインアセンブリに肉薄したい！](https://qiita.com/mod_poppo/items/793fdb08e62591d6f3fb)” (in Japanese). The translation was assisted by AI (if you don’t like reading AI-generated content, please read the Japanese version!).\n\nModern CPUs have many instructions specialized for particular purposes. Examples include SIMD, instructions useful for hashing and cryptography, and a variety of others. C and C++ have inline assembly and intrinsics, which let you write code that takes advantage of such instructions.\n\nHaskell (GHC), on the other hand, has no such mechanism. But that’s no reason to give up just yet. Let’s find a way to invoke obscure CPU instructions from Haskell, and as efficiently as possible.\n\nFirst, let me list a few CPU instructions that would be nice to use from Haskell.\n\n## The subject: the high and low halves of a product of 64-bit integers\n\nConsider computing the product of two 64-bit integers, obtaining both the high 64 bits and the low 64 bits (128 bits in total).\n\nThe ordinary multiplication found in C and Haskell, `(*) :: Word64 -> Word64 -> Word64`\n\n, can only compute the low 64 bits. At the machine-code / assembly level on x86, however, the high 64 bits are computed alongside the product as well.\n\nFor this kind of processing — “easy at the machine-code level, but non-trivial at the C or Haskell level” — we’d like to use inline assembly or intrinsics.\n\n(Actually, GHC has an intrinsic `timesWord2# :: Word# -> Word# -> (# Word#, Word# #)`\n\n, so you can do this in one shot using it. I chose this subject anyway so that we can measure how much slower the alternatives get compared to a GHC intrinsic.)\n\nAs another subject, **carry-less multiplication** (polynomial multiplication over a finite field) would also be useful for certain purposes. I won’t go into detail in this article, but I’ve placed the test results in the repository.\n\n### In C\n\nGCC/Clang have the `__int128`\n\ntype, so you can compute this in one shot using it. No inline assembly or intrinsics required.\n\n```\nunsigned __int128 wideningMul(uint64_t a, uint64_t b)\n{\n    return (unsigned __int128)a * (unsigned __int128)b;\n}\n```\n\nIf we deliberately wrote it with inline assembly, it might look like this:\n\n```\nuint64_t wideningMul_inlasm(uint64_t a, uint64_t b, uint64_t *outHigh)\n{\n    uint64_t lo, hi;\n    asm(\"movq %2, %%rax;\"\n        // mulq computes the product of %rax and the operand (here %3),\n        // placing the high 64 bits in %rdx and the low 64 bits in %rax\n        \"mulq %3;\"\n        \"movq %%rax, %0;\"\n        \"movq %%rdx, %1;\"\n        : \"=r\"(lo), \"=r\"(hi)\n        : \"r\"(a), \"r\"(b)\n        : \"%rax\", \"%rdx\");\n    *outHigh = hi;\n    return lo;\n}\n```\n\n## How to return multiple values\n\nNow, this operation takes two `uint64_t`\n\ns and returns a 128-bit value — that is, two `uint64_t`\n\ns. Since C’s syntax has no multiple-value return, you have to choose one of the following ways to return the values:\n\n- Return a struct by value: define a struct like\n`struct uint128 { uint64_t lo, hi; }`\n\nand return it by value.- Returning\n`unsigned __int128`\n\nby value corresponds to this internally. See the x86_64 ABI for details.\n\n- Returning\n- Take and pass a pointer: take the location where the second and later return values should be stored as a pointer argument.\n- Example: the\n`wideningMul_inlasm`\n\nfunction I wrote earlier.\n\n- Example: the\n\nAs an example of the former, the C standard `div`\n\n, `ldiv`\n\n, and `lldiv`\n\nfunctions return a `{,l,ll}div_t`\n\nstruct by value.\n\nThe advantage of returning a struct by value is that, depending on the ABI, if the struct is small the values can be returned while kept in registers.\n\nThe disadvantage, on the other hand, is that other languages’ C FFI may not support it. In fact, GHC’s current C FFI does not support passing structs by value.\n\nThere is a proposal to make C structs passable by value through the FFI, but it has seen no movement:\n\n[c structures · Wiki · Glasgow Haskell Compiler / GHC · GitLab](https://gitlab.haskell.org/ghc/ghc/-/wikis/c-structures)[Support C structures in Haskell FFI (#9700) · Issue · ghc/ghc](https://gitlab.haskell.org/ghc/ghc/-/issues/9700)\n\n## Using the C FFI (with a pointer)\n\n### Safe FFI\n\nWhen there’s something Haskell can’t do, let’s borrow the power of another language! To that end, Haskell has an FFI. With it, you can call functions written in C.\n\nLet’s give it a try right away:\n\n```\n#include <stdint.h>\n\nextern uint64_t wideningMul_with_ptr(uint64_t a, uint64_t b, uint64_t *outHigh)\n{\n    unsigned __int128 result = (unsigned __int128)a * (unsigned __int128)b;\n    *outHigh = (uint64_t)(result >> 64);\n    return (uint64_t)result;\n}\n```\n\nAs I wrote earlier, the current GHC FFI can’t pass structs by value, so we’ll pass one of the return values via a pointer.\n\nThe Haskell side looks like this:\n\n``` python\nforeign import ccall \"wideningMul_with_ptr\"\n  c_wideningMul_with_ptr :: Word64 -> Word64 -> Ptr Word64 -> IO Word64\n\nwideningMulWithPtr :: Word64 -> Word64 -> Word128\nwideningMulWithPtr !a !b = unsafePerformIO $ do\n  Marshal.alloca $ \\outHigh -> do\n    lo <- c_wideningMul_with_ptr a b outHigh\n    hi <- peek outHigh\n    return $ Word128 hi lo\n```\n\nDealing with a pointer means performing `IO`\n\nin order to allocate space and read out the value. However, the operation as a whole (“multiplying two 64-bit integers”) can be considered pure, so I’ve used `unsafePerformIO`\n\nto write it as a pure function[ 1](#fn1).\n\n(By the way, for the `Word128`\n\ntype I used the `Data.WideWord.Word128`\n\ntype from the `wide-word`\n\npackage.)\n\nTrying it out:\n\n```\n> wideningMulWithPtr 123 456\n56088\n> 123 * 456\n56088\n> wideningMulWithPtr (2^63) (2^62) -- using a CPU instruction\n42535295865117307932921825928971026432\n> 2^125 -- computing with arbitrary-precision arithmetic\n42535295865117307932921825928971026432\n```\n\nSo it seems to be computing correctly.\n\nIf writing the C code in a separate file is a hassle, using a package like [inline-c](http://hackage.haskell.org/package/inline-c), as in\n\n- @tanakh’s article “\n[Haskellにインラインアセンブリを書く](https://qiita.com/tanakh/items/08c15f6e72dbe2da61a8)” (Writing inline assembly in Haskell, in Japanese)\n\nmight be one option.\n\n### Unsafe FFI\n\nNow, Haskell’s FFI has a concept called the safety level. The default is `safe`\n\n, which means “it is safe for the called external code to call back into Haskell functions.”\n\nThe opposite of `safe`\n\nis `unsafe`\n\n, which means “who knows what happens if the called external code calls back into Haskell.”\n\nAn unsafe FFI carries risk, but in exchange it can be expected to have lower overhead.\n\nYou specify the safety level by writing `safe`\n\nor `unsafe`\n\nright after the calling convention in a `foreign import`\n\ndeclaration:\n\n``` python\nforeign import ccall unsafe \"wideningMul_with_ptr\"\n  c_wideningMul_with_ptr :: Word64 -> Word64 -> Ptr Word64 -> IO Word64\n```\n\nWhen using the inline-c package, you use the quasiquoters (`exp`\n\n, `pure`\n\n, `block`\n\n) found in `Language.C.Inline.Unsafe`\n\n.\n\n## Using the C FFI (calling twice)\n\nIn the previous section, we went through a pointer in order to return multiple values from a function written in C.\n\nHowever, passing values via a pointer is presumably slower than passing them in registers[citation needed]. You also need to allocate a memory region (GHC’s `alloca`\n\nfunction allocates on the heap, not the stack). If we can pass the values without using a pointer, so much the better.\n\nFortunately, our subject — “multiplying 64-bit integers” — is a low-cost operation. Given that, wouldn’t it be acceptable to perform the same computation for each of the high 64 bits and the low 64 bits separately?\n\n```\n// compute the low 64 bits\nextern uint64_t wideningMul_lo(uint64_t a, uint64_t b)\n{\n    unsigned __int128 result = (unsigned __int128)a * (unsigned __int128)b;\n    return (uint64_t)result;\n}\n\n// compute the high 64 bits\nextern uint64_t wideningMul_hi(uint64_t a, uint64_t b)\n{\n    unsigned __int128 result = (unsigned __int128)a * (unsigned __int128)b;\n    return (uint64_t)(result >> 64);\n}\nphp\nforeign import ccall unsafe \"wideningMul_lo\"\n  c_wideningMul_lo :: Word64 -> Word64 -> Word64\n\nforeign import ccall unsafe \"wideningMul_hi\"\n  c_wideningMul_hi :: Word64 -> Word64 -> Word64\n\nwideningMul2 :: Word64 -> Word64 -> Word128\nwideningMul2 !a !b = Word128 (c_wideningMul_hi a b) (c_wideningMul_lo a b)\n```\n\nI’ll compare the “call twice and keep everything in registers” approach against the “pass via a pointer” approach later.\n\n## Using the C FFI (using SIMD registers)\n\nGHC has no 128-bit integer type that the FFI can handle, but it can use 128-bit-wide SIMD registers — things like `Word64X2#`\n\n. Using one, you can pass 128 bits of data in a register in a single function call (in the System V ABI’s case).\n\n```\nextern __m128i wideningMul_xmm(uint64_t a, uint64_t b)\n{\n    union {\n        __m128i m128;\n        unsigned __int128 u128;\n    } u;\n    u.u128 = (unsigned __int128)a * (unsigned __int128)b;\n    return u.m128;\n}\n{-# LANGUAGE MagicHash #-}\n{-# LANGUAGE UnboxedTuples #-}\n{-# LANGUAGE UnliftedFFITypes #-}\n\nforeign import ccall unsafe \"wideningMul_xmm\"\n  c_wideningMul_xmm :: Word64 -> Word64 -> Word64X2#\n\nwideningMulXMM :: Word64 -> Word64 -> Word128\nwideningMulXMM !a !b = case unpackWord64X2# (c_wideningMul_xmm a b) of\n  (# lo, hi #) -> Word128 (W64# hi) (W64# lo)\n```\n\nThis isn’t a general-purpose way to return multiple values, but I brought it up because this particular subject happens to fit in 128 bits.\n\n## Black magic: `foreign import prim`\n\nAs I wrote earlier, current GHC cannot handle C functions that return a struct by value. And when you want to return multiple values from external code, you need to use a pointer or go through multiple function calls. However, GHC does have a means of returning values from external code using multiple registers (multiple-value return): that is `foreign import prim`\n\n.\n\n### About GHC’s PrimOps\n\nAn operation that corresponds directly to a machine instruction, such as integer addition, is called a **primitive operation** in GHC parlance (in other words, a GHC intrinsic). For integer addition, the function\n\n``` php\n(+#) :: Int# -> Int# -> Int#\n```\n\nis defined. In the old days, the intrinsics (pseudo-)defined in the `GHC.Prim`\n\nmodule of the ghc-prim package were exposed via the `GHC.Exts`\n\nmodule of the base package, and you used those. But due to the decoupling of the base package from GHC, `GHC.Exts`\n\nhas been frozen, and the newest intrinsics are now available via the `GHC.Internal.Exts`\n\nmodule of the ghc-internal package or the `GHC.PrimOps`\n\nmodule of the ghc-experimental package.\n\nWell, since we want to define our own intrinsics, there’s no need to go into detail about how to use existing ones.\n\nAccording to the GHC Wiki page ([prim ops · Wiki · Glasgow Haskell Compiler / GHC · GitLab](https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/prim-ops)), GHC’s intrinsics come in three kinds: inline PrimOps, out-of-line PrimOps, and foreign out-of-line PrimOps (`foreign import prim`\n\n).\n\n- inline PrimOps: expanded into an instruction sequence on the spot. Hardcoded into GHC.\n- out-of-line PrimOps: follow a dedicated calling convention. Hardcoded into GHC.\n- foreign out-of-line PrimOps: follow a dedicated calling convention. Library developers can define them via\n`foreign import prim`\n\n. Intended for libraries bundled with GHC.\n\nOf these (or rather, among all the methods available in Haskell for hitting a specific CPU instruction), inline PrimOps are presumably the lowest-cost, but modifying GHC just to use a single instruction is quite a big undertaking[ 2](#fn2). Hence, foreign out-of-line PrimOps are relatively easy to use.\n\n### Using `foreign import prim`\n\nAn example of Haskell code using `foreign import prim`\n\nlooks like this:\n\n```\n{-# LANGUAGE GHCForeignImportPrim, UnliftedFFITypes, MagicHash, UnboxedTuples #-}\n\nforeign import prim \"wideningMul_prim\"\n  wideningMul_prim# :: Word# -> Word# -> (# Word#, Word# #)\n\nwideningMul :: Word64 -> Word64 -> Word128\nwideningMul (W64# a) (W64# b)\n  = case wideningMul_prim# a b of\n      (# lo, hi #) -> Word128 (W64# hi) (W64# lo)\n```\n\nAt first glance, the only change is that the `foreign import`\n\ncalling convention went from `ccall`\n\nto `prim`\n\n. What a relief! (The `#`\n\ns on the type names and tuples are *common in low-level Haskell*, so nothing to be surprised about at this point. You can also write `ccall`\n\nFFI that uses `#`\n\nall over the place.)\n\nAs for the GHC extensions used: to use `foreign import prim`\n\n, you need the `GHCForeignImportPrim`\n\nextension[ 3](#fn3). Also, the arguments and return values can basically only be unlifted types, so the\n\n`UnliftedFFITypes`\n\nextension is required too. `MagicHash`\n\nand `UnboxedTuples`\n\ngo without saying.There’s a brief explanation of the `GHCForeignImportPrim`\n\nextension in the User’s Guide, but no detailed explanation of the calling convention.\n\nI’m about to present code that touches GHC’s internal calling convention, but this is **completely unsupported** and **may stop working depending on GHC’s configuration or differences between minor versions**.\n\nIn fact, I have **changed GHC’s internal calling convention with my own hands** before. That was the first of my contributions to GHC to get merged.\n\n[Fewer FP registers than available are used for parameter passing on AArch64 (#17953) · Issue · ghc/ghc](https://gitlab.haskell.org/ghc/ghc/-/issues/17953)[Support auto-detection of MAX_REAL_{FLOAT,DOUBLE}_REG up to 6 (#17953) (!5117) · Merge requests · Glasgow Haskell Compiler / GHC · GitLab](https://gitlab.haskell.org/ghc/ghc/-/merge_requests/5117)\n\nIf you make use of `foreign import prim`\n\nfor some reason, I recommend always keeping an eye on the development status of GHC proper.\n\n### Implementing a foreign out-of-line PrimOp\n\nNow that the Haskell side is ready to use our own PrimOp, the next step is to prepare the implementation.\n\nGHC’s out-of-line PrimOps are apparently expected to be written in Cmm, but\n\n- Cmm is hard to understand!\n- It seems you can’t use inline assembly or architecture-dependent intrinsics from Cmm!\n\nso we’ll write it directly in assembly. (As I’ll mention later, besides Cmm and raw assembly there’s a third path: messing with LLVM IR.)\n\nGHC’s calling convention for PrimOps, in the case of x86_64, is roughly:\n\n- Arguments and return values are basically passed in registers. The first is\n`%rbx`\n\n, the second is`%r14`\n\n, the third is`%rsi`\n\n, the fourth is`%rdi`\n\n, … - To return to the caller, do\n`jmp *(%rbp)`\n\n.\n\nFor the register usage, see [rts/include/stg/MachRegs/x86.h](https://gitlab.haskell.org/ghc/ghc/-/blob/master/rts/include/stg/MachRegs/x86.h).\n\nThe actual assembly code looks like this:\n\n```\n    .globl _wideningMul_prim\n_wideningMul_prim:\n    ## first argument: %rbx\n    ## second argument: %r14\n    movq %rbx, %rax\n    mulq %r14 ## compute the product of %rax and %r14, putting the high half in %rdx and the low half in %rax\n    movq %rax, %rbx ## put the first return value (lo) in %rbx\n    movq %rdx, %r14 ## put the second return value (hi) in %r14\n    jmp *(%rbp)\n```\n\nIn GHC’s internal calling convention, functions (and continuations) are always tail-called, so registers — except those reserved for the STG machine — are free to use.\n\nBy the way, if you really want to write it in C, there’s apparently a method: “write it as a function with a specific prototype, compile it with Clang, and edit the LLVM IR output by `-S -emit-llvm`\n\nto change the calling convention to ghccc[ 4](#fn4).” See the links below for details.\n\nA collection of links that may be helpful:\n\n[Parsing Market Data with Ragel, clang and GHC primops - Ten Cache Misses](http://breaks.for.alienz.org/blog/2012/02/09/parsing-market-data-feeds-with-ragel/): modifies the LLVM IR that clang emits to use GHC’s calling convention[haskell - foreign import prim call to LLVM - Stack Overflow](https://stackoverflow.com/questions/33910131/foreign-import-prim-call-to-llvm): same as above[haskell - Using](https://stackoverflow.com/questions/41213378/using-foreign-import-prim-with-a-c-function-using-stg-calling-convention): same as above`foreign import prim`\n\nwith a C function using STG calling convention - Stack Overflow[Almost Inline ASM in Haskell With Foreign Import Prim - Brandon.Si(mmons)](http://brandon.si/code/almost-inline-asm-in-haskell-with-foreign-import-prim/): writes x86_64 assembly directly without going through LLVM- I referred to this article heavily when writing this post\n- The code is here →\n[jberryman/almost-inline-asm-haskell-example: An example of using](https://github.com/jberryman/almost-inline-asm-haskell-example)`foreign import prim`\n\nin ghc haskell to call assembly with low overhead\n\n### Implementing a foreign out-of-line PrimOp (improved)\n\nThe assembly code above has a few issues.\n\n- The function name we’re implementing this time is\n`wideningMul_prim`\n\n, but I gave its assembly name (symbol name) a leading underscore as`_wideningMul_prim`\n\n. Whether a leading underscore is added generally depends on the platform (OS). For example, it is added on macOS but not on Linux. - When “returning from the function” I used\n`jmp *(%rbp)`\n\n, but this doesn’t work if GHC is configured with`--disable-tables-next-to-code`\n\n.\n\nFor the first problem, when an underscore is added a macro `LEADING_UNDERSCORE`\n\nis defined in `ghcconfig.h`\n\n, so you can use that. If you change the assembly source’s extension to an uppercase `.S`\n\n, the preprocessor becomes available, and on GHC 9.12 and later you can `#include \"ghcconfig.h\"`\n\n(I made the change that enables this). Note that the assembly source’s filename must not start with an uppercase letter, or GHC will mistakenly think a Haskell module name was given on the command line (`Foo.S`\n\nwon’t do; it must be `foo.S`\n\n).\n\n```\n#include \"ghcconfig.h\"\n#if defined(LEADING_UNDERSCORE)\n#define SYMBOL(name) _##name\n#else\n#define SYMBOL(name) name\n#endif\n    .globl SYMBOL(wideningMul_prim)\nSYMBOL(wideningMul_prim):\n    ...\n```\n\nAs for the second problem, what I’m calling “returning from the function” means “invoking the continuation pushed on top of the STG stack (`%rbp`\n\non x86_64).” When GHC is configured with `--enable-tables-next-to-code`\n\n(TNTC), the address of the continuation object is the address of the code itself, but otherwise the address of the code is stored at the start of the continuation object. For this layout, refer to [rts/include/rts/storage/InfoTables.h](https://gitlab.haskell.org/ghc/ghc/-/blob/44309cd377f115d3e6b788097750bd534b47f3e4/rts/include/rts/storage/InfoTables.h#L186).\n\n```\n// with --enable-tables-next-to-code\nstruct StgInfoTable {\n    ...\n    // <- this address is pushed on the STG stack\n    StgCode code[]; // machine code\n};\n\n// with --disable-tables-next-to-code\nstruct StgInfoTable {\n    // <- this address is pushed on the STG stack\n    StgFunPtr entry; // pointer to machine code\n    ...\n};\n```\n\nThis too can be distinguished using the `TABLES_NEXT_TO_CODE`\n\nmacro in `ghcconfig.h`\n\n.\n\n```\n    ...\n#if defined(TABLES_NEXT_TO_CODE)\n    jmp *(%rbp)\n#else\n    movq (%rbp),%rax\n    jmp *(%rax)\n#endif\n```\n\nBesides this, GHC also has a configuration called an unregisterised build, and I think the method described here won’t work in that case either.\n\n### Combining a C struct return with `foreign import prim`\n\n(the thunk approach)\n\nWhen the work fits in a single instruction as it does here, hand-coding the body of the PrimOp in assembly is no big deal, but for somewhat more complex code you’ll want to write it in a high-level language like C.\n\nSo consider the following approach:\n\n- Write the body of the processing as a C function (using struct return by value).\n- Write a PrimOp in assembly that calls the C function.\n- Call the assembly-written one from Haskell via\n`foreign import prim`\n\n.\n\nIn other words, you absorb the difference between the C calling convention and the GHC calling convention with a small amount of assembly code in between. Let’s call this intervening code a thunk. (In Haskell, “thunk” tends to evoke the lazy-evaluation sense, but this is a different sort of thunk.)\n\nThe C part looks like this:\n\n```\n// __int128 is equivalent to struct { uint64_t lo, hi }; on System V/x86_64 ABI\nextern unsigned __int128 wideningMul_uint128(uint64_t a, uint64_t b)\n{\n    return (unsigned __int128)a * (unsigned __int128)b;\n}\n```\n\nThe thunk written in assembly looks like this:\n\n```\n#include \"ghcconfig.h\"\n#if defined(LEADING_UNDERSCORE)\n#define SYMBOL(name) _##name\n#else\n#define SYMBOL(name) name\n#endif\n    .globl SYMBOL(wideningMul_thunk)\nSYMBOL(wideningMul_thunk):\n    ## GHC:\n    ##   first argument: %rbx\n    ##   second argument: %r14\n    ## C:\n    ##   first argument: %rdi\n    ##   second argument: %rsi\n    movq %rbx, %rdi\n    movq %r14, %rsi\n    subq $8, %rsp\n    callq SYMBOL(wideningMul_uint128)\n    addq $8, %rsp\n    ## C:\n    ##   first return value: %rax\n    ##   second return value: %rdx\n    ## GHC:\n    ##   first return value: %rbx\n    ##   second return value: %r14\n    movq %rax, %rbx\n    movq %rdx, %r14\n#if defined(TABLES_NEXT_TO_CODE)\n    jmp *(%rbp)\n#else\n    movq (%rbp),%rax\n    jmp *(%rax)\n#endif\n```\n\nWhen you call the C function, you might worry that the registers used by the STG machine get clobbered, but in the x86_64 case the STG registers are deliberately mapped onto the callee-saved registers of the C calling convention, so there’s no need to save them (in both the System V ABI and the Microsoft calling convention).\n\nYou also need to check the alignment of the stack pointer `%rsp`\n\n. In both the x86_64 System V ABI and the Microsoft calling convention, the stack pointer must be a multiple of 16 at the point `callq`\n\nexecutes. In GHC’s calling convention, you can assume that on entry to a function the stack pointer has the form `16 * n - 8`\n\n. That is, it mimics the state right after issuing a `callq`\n\ninstruction with the stack pointer aligned to a multiple of 16. In recent GHC this has become `64 * n - 8`\n\ndue to AVX (a change I made myself). For details, see GHC’s `rts/StgCRun.c`\n\n.\n\nTo state the conclusion about the stack pointer: subtract 8 from `%rsp`\n\njust before the `callq`\n\ninstruction and you’ll satisfy the requirements of the C calling convention.\n\nThis is wishful thinking, but it would be fun to have a Haskell library that auto-generates thunks like this with Template Haskell, so you could directly call C functions that pass (return) structs by value. Somebody please make it!\n\n## Benchmark\n\nI’ve considered various ways to implement “obtain the product of two 64-bit integers as a 128-bit integer.” So let’s compare them. The comparison targets are:\n\n- Use Haskell’s arbitrary-precision arithmetic (\n`Integer`\n\ntype). - Convert the operands to the\n`Word128`\n\ntype of the`wide-word`\n\npackage, then multiply.- The\n`wide-word`\n\npackage internally uses`timesWord2#`\n\n.\n\n- The\n- Use\n`timesWord2#`\n\nfrom`GHC.Prim`\n\n.- Internally to GHC, this is a kind of inline PrimOp called\n`WordMul2`\n\n.\n\n- Internally to GHC, this is a kind of inline PrimOp called\n- Use the C FFI.\n- Safe FFI / Unsafe FFI\n- Using a pointer (unsafePerformIO / unsafeDupablePerformIO / ****PerformIO) / calling twice / using SIMD registers\n\n- Use\n`foreign import prim`\n\n.- Implemented in assembly\n- The thunk approach\n\nSince GHC already has the built-in `timesWord2#`\n\nfor the operation of multiplying two 64-bit integers to get 128 bits, we can also compare “how much difference there is between inline PrimOps and (foreign) out-of-line PrimOps.”\n\nLet me predict which is fastest: the inline PrimOp `timesWord2#`\n\nshould be fastest, followed by foreign out-of-line PrimOps (`foreign import prim`\n\n). I have a feeling that `Integer`\n\n’s arbitrary-precision arithmetic is slowest.\n\nHere are the actual benchmark results:\n\nIf you’d rather read it as text, see [here](https://github.com/minoki/hs-inline-asm-test/blob/ghc-9.12/widening-mul-report.txt).\n\nTo summarize:\n\n- The fastest is\n**GHC’s built-in**, at about`timesWord2#`\n\nfunction**4.0ns**.`Word128`\n\n, which internally uses`timesWord2#`\n\n, achieves equivalent performance. - The runner-up is\n(assembly only), at about`foreign import prim`\n\n**4.5ns**. The** thunk approach**and “** unsafe FFI + calling twice without a pointer**” follow it (~5.8ns). - Using XMM registers with unsafe FFI is about\n**7.0ns**. - Next is “\n**unsafe FFI + passing a pointer**,” at about** 19ns**. Using unsafeDupablePerformIO or the unspeakable blasphemous one instead of unsafePerformIO gives roughly the same. - Next is the one using\n, at about`Integer`\n\n’s arbitrary-precision arithmetic**32.5ns**. - Dead last is “\n**safe FFI**,” at over** 60ns**. The reason “calling twice” is slower than “passing a pointer” is presumably that the cost of safe FFI outweighed the cost of passing a pointer.\n\nThe source code I used is on [GitHub](https://github.com/minoki/hs-inline-asm-test). The benchmark was run on WSL2 on my Zen4 machine.\n\n## Closing thoughts / summary\n\nUnsurprisingly, the version compiled to an inline instruction sequence built into GHC is fastest, but we found that `foreign import prim`\n\nachieves performance that comes close to it. (Since this was a microbenchmark, the gap might widen in more practical examples.)\n\nWhat’s surprising is that “unsafe FFI + calling twice without a pointer” holds its own against `foreign import prim`\n\n. In other words, the C FFI can rival `foreign import prim`\n\ndepending on how you do it. If the performance difference is slight, choosing the C FFI over the black magic of `foreign import prim`\n\nwould be a reasonable call.\n\nConversely, even when using the C FFI, if you needlessly use the safe level, it became slower than the “naively written in Haskell without using the CPU’s handy instructions” version.\n\n**When using the C FFI for speed, check that the safety level isn’t needlessly set to safe.**\n\n- NB: Use\n`safe`\n\nif the foreign call runs for a long time or may call back into Haskell, since an`unsafe`\n\ncall blocks GC.`unsafe`\n\nis best for short calls.\n\nThat, I think, is a fairly important lesson.\n\nThose well-versed in Unsafe Haskell might think, “In this case you could use\n\n`unsafeDupablePerformIO`\n\n, or even better,`****PerformIO`\n\n…” Don’t worry, I’ve included the results of those experiments at the end as well.[↩︎](#fnref1)If your change gets merged into the upstream GHC, the effort might pay off, but getting an intrinsic for a niche instruction merged into GHC requires a correspondingly convincing case. There was apparently an\n\n[issue proposing to add AES instructions](https://gitlab.haskell.org/ghc/ghc/issues/8153)to GHC in the past, but it was closed as won’t fix.[↩︎](#fnref2)A name that really gives off a “GHC-only! External libraries, keep out!” vibe.\n\n[↩︎](#fnref3)ghccc is the name of GHC’s calling convention within LLVM (→\n\n[LLVM calling-conventions documentation](https://llvm.org/docs/LangRef.html#calling-conventions)). It used to be called cc10.[↩︎](#fnref4)", "url": "https://wpnews.pro/news/low-level-haskell-the-cursed-way-to-emulate-inline-assembly-in-haskell-ghc", "canonical_source": "https://minoki.github.io/posts/2026-06-30-haskell-inline-asm.html", "published_at": "2026-07-01 10:33:38+00:00", "updated_at": "2026-07-01 10:50:53.944791+00:00", "lang": "en", "topics": ["developer-tools", "machine-learning"], "entities": ["GHC", "GCC", "Clang", "x86", "SIMD"], "alternates": {"html": "https://wpnews.pro/news/low-level-haskell-the-cursed-way-to-emulate-inline-assembly-in-haskell-ghc", "markdown": "https://wpnews.pro/news/low-level-haskell-the-cursed-way-to-emulate-inline-assembly-in-haskell-ghc.md", "text": "https://wpnews.pro/news/low-level-haskell-the-cursed-way-to-emulate-inline-assembly-in-haskell-ghc.txt", "jsonld": "https://wpnews.pro/news/low-level-haskell-the-cursed-way-to-emulate-inline-assembly-in-haskell-ghc.jsonld"}}