{"slug": "80386-multiplication-and-division", "title": "80386 Multiplication and Division", "summary": "The Intel 80386, released in 1985, was the first 32-bit x86 processor, introducing a flat 4GB address space, virtual memory with paging, and an effective protected mode. It significantly improved arithmetic performance over its predecessors by using dedicated hardware that processes multiplication and division at one bit per cycle, with an \"add-and-shift\" multiplication algorithm and an early-out optimization that reduces cycle counts when remaining multiplier bits are zero.", "body_md": "# 80386 Multiplication and Division\n\nWhen Intel released the 80386 in October 1985, it marked a watershed moment for personal computing. The 386 was the first 32-bit x86 processor, increasing the register width from 16 to 32 bits and vastly expanding the address space compared to its predecessors. This wasn't just an incremental upgrade—it was the foundation that would carry the PC architecture for decades to come.\n\nThe timing was significant. By the mid-1980s, the IBM PC had established x86 as the dominant PC architecture, but the 16-bit 8086/286 processors were hitting their limits. Memory was constrained to 1MB (or 16MB with the 286's limited protected mode). Competing 32-bit architectures like the Motorola 68020 threatened Intel's dominance. The 386 was Intel's answer: full 32-bit computing with backward compatibility for the massive library of existing DOS software.\n\nThe 386 introduced important and long-lasting x86 features: a flat 4GB address space, virtual memory with paging, and a protected mode that actually worked. It would go on to run Windows 3.0, Windows 95, early Linux, and countless other operating systems that shaped modern computing.\n\n## Faster arithmetic\n\nIn addition to its architectural advances, the 386 delivered a major jump in arithmetic performance. On the earlier 8086, multiplication and division were slow — 16-bit multiplication typically required 120–130 cycles, with division taking even longer at over 150 cycles. The 286 significantly improved on this by introducing faster microcode routines and modest hardware enhancements.\n\nThe 386 pushed performance further with dedicated hardware that processes multiplication and division at the rate of **one bit per cycle**, combined with a native 32-bit datapath width. The microcode still orchestrates the operation, but the heavy lifting happens in specialized datapath logic that advances every cycle.\n\nHere are the actual cycle counts from the Intel 386 Programmer's Reference Manual:\n\n| Instruction | 8-bit | 16-bit | 32-bit |\n|---|---|---|---|\n| MUL | 9-14 | 9-22 | 9-38 |\n| IMUL | 9-14 | 9-22 | 9-38 |\n| DIV | 14 | 22 | 38 |\n| IDIV | 19 | 27 | 43 |\n\nThe ranges for MUL/IMUL reflect an \"early-out\" optimization—the loop exits early when the remaining multiplier bits are all zeros (or all ones for signed). Division has no early-out, so cycle counts are fixed at roughly `width + overhead`\n\n.\n\nTo save silicon, the 386 reuses the main ALU for the per-iteration add/subtract work rather than having a separate multiplier unit. The microcode controls the iteration, while dedicated datapath logic handles the shifting and loop termination. Let's look at how these algorithms work.\n\n## Add-and-shift multiplication\n\nThe classic multiplication algorithm in processors is the Booth algorithm. However, the 80386 does not use that. Instead, an \"add-and-shift\" multiplication algorithm is used. This is similar to grade-school long multiplication. The difference is that instead of moving from lower digits to higher, we shift to the right. Here's the data layout:\n\nThree key internal registers participate in multiplication: MULTMP, TMPB, and SIGMA. A notable challenge is that x86 instructions support 8-bit, 16-bit, and 32-bit operands. Consistent with the design philosophy of the 8086, the 386 achieves this flexibility by reusing the same registers and microcode routines for all operand sizes. In most cases, the identical hardware and microcode sequence accommodate different widths seamlessly. The diagram above shows how, for example, the result of multiplying two 16-bit numbers is arranged within a 32-bit product: it occupies the lower half of the SIGMA register and the upper half of TMPB.\n\nHere is the multiplication algorithm in pseudocode:\n\n```\n1: COUNTR = width-1\n2: while (true):\n3:   if (TMPB[0]) SIGMA <= SIGMA + MULTMP\n4:   {SIGMA, TMPB} >>= 1      // arithmetic shift for signed\n5:   if (--COUNTR==0) break\n6:   if (remaining TMPB bits are all 0 or all 1 for signed) break\n7: {SIGMA, TMPB} >>= COUNTR   // compensate for early exit\n8: correction for signed multiplication\n```\n\nShifting to the right rather than the left simplifies the hardware circuits. Line 6 implements the important \"early-out\" optimization, which means the loop can terminate early if the remaining multiplier bits are all zeros—or all ones, in the case of signed multiplication. When this happens, line 7 adjusts for the early exit by shifting the accumulated result right by the number of remaining COUNTR bits.\n\nLines 1–7 fully describe unsigned multiplication. To extend this to signed multiplication, only a few tweaks are needed: use arithmetic (not logical) shifts on lines 4 and 7, and, as a final correction in line 8, subtract the multiplicand from the upper product register (SIGMA) if the multiplier was negative. For a deeper dive into the mathematics, see college-level computer organization resources such as [this one](https://web.ece.ucsb.edu/~parhami/pres_folder/f31-book-arith-pres-pt3.pdf).\n\nThe 80386 multiplication microcode closely mirrors the algorithm described above, and shows both the timing and the likely underlying hardware involved. The microcode routine shown here handles register-based multiplication—both unsigned and signed—and supports all three operand sizes: 8, 16, and 32 bits. Other forms, such as multiplying with a memory operand, are implemented similarly.\n\nBefore we examine the code, it’s helpful to quickly review the 80386’s microcode syntax and conventions. While the [8086 used 21-bit micro-operations](https://www.reenigne.org/blog/8086-microcode-disassembled/), the 80386 expanded these to 37 bits, adding fields to control more complex hardware functionality. Moves are written as `src->dest`\n\n, which simply means copying data from one register to another. The `alujmp`\n\nfield directs either the ALU (using `src`\n\nand `alu_src`\n\nas inputs) or the microcode control flow, handling everything from arithmetic to jumps to indirect operations (`alu_src`\n\nas the jump target). Pay special attention to the `RPT`\n\nkeyword found on the third line of the upcoming listing: this signals the microcode sequencer to repeatedly execute a micro-instruction, decrementing the COUNTR register each time, and continuing until COUNTR reaches zero, i.e. looping for COUNTR+1 iterations.\n\n``` php\n; MUL/IMUL r\n; src     dest    alu_src        alujmp  uop sub busop\nDSTREG -> MULTMP  BITS_V         LDCNTR          ; MULTMP=r (multiplicand), COUNTR=width-1\neAX_AL -> TMPB    0              PASS2           ; TMPB=multiplier (AL/AX/EAX)\nSIGMA             TMPB           IMUL3   RPT DLY ; hardware mult loop with early-out \nSIGMA                            PASS            ; pass through SIGMA\nCOUNTR -> TMPD                                   ; save remaining COUNTR\nRESULT -> TMPC    TMPD           LDBSR8          ; load shift count: right shift, COUNTR\nSIGMA  -> TMPD    TMPC           SHIFT           ; shift {SIGMA,RESULT} to get low result\nSIGMA  -> eAX_AL  TMPD           MULFIX          ; write low result, set flags, signed mult correction\nSIGMA             TMPD           SHIFT   RNI     ; shift {0,ProdU} to get high result\nSIGMA  -> eDX_AH                                 ; write high result\n```\n\nThe `RESULT`\n\nregister is used by both multiplication and division. For multiplication, it accumulates the lower half of the product as bits shift right out of TMPB during the loop. `MULFIX`\n\nis the correction for signed multiplication on pseudocode line 8.\n\n### Other variants\n\nThe 386 introduced two new forms of IMUL beyond the original single-operand form:\n\n**Two-operand**:`IMUL reg, r/m`\n\n- multiplies reg by r/m, stores in reg (single-width result)**Three-operand**:`IMUL reg, r/m, imm`\n\n- multiplies r/m by immediate, stores in reg (single-width result)\n\nThese variants are interesting because they only produce a single-width result (discarding the upper half), making them faster for common cases where overflow isn't expected. The microcode for these uses a slightly different entry point that skips writing the upper result to EDX/DX/AH.\n\n## Division\n\n80386 uses the standard [non-restoring division algorithm](https://en.wikipedia.org/wiki/Division_algorithm#Non-restoring_division) for division. Here's the data layout:\n\nThe dividend is {SIGMA, DIVTMP} (max 64 bits), while the divisor is TMPB (max 32 bits). Each iteration shifts the dividend left by one bit and either adds or subtracts the divisor, building up the quotient in RESULT one bit at a time.\n\n```\n1: do:                               // loop body is DIV7\n2:     {SIGMA,DIVTMP} <<= 1;\n3:     if (SIGMA < 0) SIGMA += TMPB;\n4:     else           SIGMA -= TMPB;\n5:     RESULT = (RESULT << 1) | (SIGMA >= 0 ? 1 : 0)\n6:     COUNTR--;\n7: while (COUNTR > 0)\n8: if (SIGMA < 0) SIGMA += TMPB;     // DIV5\n```\n\nLet's look at the division routine (`DIV r`\n\nat F6.6/F7.6) directly.\n\n``` php\n; DIV r\neAX_AL -> DIVTMP  BITS_V         LDCNTR          ; DIVTMP = lower half of dividend, COUNTR=width-1\neDX_AH                           PASS            ; SIGMA = upper half of dividend\nDSTREG -> TMPB                                   ; TMPB = divisor\nSIGMA             TMPB            DIV7   RPT DLY ; Loop: dividend={SIGMA,DIVTMP}, divisor=TMPB\nSIGMA             TMPB            DIV5           ; Final correction\nSIGMA                            PASS            ; Preserve remainder through ALU\nRESULT -> eAX_AL                         RNI     ; accumulator = quotient \nSIGMA  -> eDX_AH                                 ; upper-half reg = remainder\n```\n\nDIV7 and DIV5 are both single-cycle micro-operations. DIV7 implements the core of the division loop, corresponding to pseudocode lines 2–5 (excluding the COUNTR decrement). With each iteration, DIV7 updates SIGMA (the remainder) and RESULT (the quotient accumulator). The loop is controlled by the RPT instruction, which keeps the sequencer repeatedly executing DIV7 for COUNTR+1 iterations—there’s no early exit for division. After completing the main loop, DIV5 performs the final correction required by the non-restoring division algorithm (pseudocode line 8).\n\n### Signed division (IDIV)\n\nIDIV is more complex than DIV because it must handle signs. The approach is:\n\n- Convert dividend and divisor to absolute values\n- Perform unsigned division\n- Adjust signs of quotient and remainder\n\nHere's the IDIV microcode:\n\n```\n; IDIV r\n-1                BITS_V         ADD             ; COUNTR=width-2\nSIGMA  -> COUNTR\neDX_AH                           PASS            ; SIGMA=upper dividend\neAX_AL -> DIVTMP                                 ; DIVTMP=lower dividend\nDSTREG -> TMPB                                   ; TMPB=divisor\nSIGMA             TMPB           PREDIV          ; |dividend|divisor|, save signs, first iteration\nSIGMA             TMPB            DIV7   RPT DLY ; main division loop\nSIGMA             TMPB            DIV5           ; non-restoring correction\nSIGMA             TMPB           IDIV1           ; correct remainder sign\nSIGMA                            PASS\nSIGMA  -> TMPB                                   ; save remainder\nRESULT                           IDIV2           ; correct quotient sign -> SIGMA\nTMPB   -> eDX_AH                         RNI     ; write remainder\nSIGMA  -> eAX_AL                                 ; write quotient\n```\n\nThe key micro-ops are:\n\n**PREDIV**: Computes absolute values of dividend and divisor, saves their signs in internal flip-flops, and performs the first division iteration**IDIV1**: Corrects the remainder's sign (remainder has same sign as dividend)** IDIV2**: Corrects the quotient's sign (negative if operand signs differ)\n\nThis explains why IDIV takes 5 more cycles than DIV - the extra cycles handle sign computation and correction.\n\n## Additional notes\n\nOne of the biggest hurdles in deciphering CPUs is interpreting the role and meaning of each micro-operation and constant. Their interdependence often makes the process both challenging and fascinating. Consider BITS_V: at first glance, seeing it used in LDCNTR and loop logic, you might assume it simply represents the instruction’s bit width—such as 8, 16, or 32—meaning the RPT instruction would run for that number of COUNTR cycles. This approach seems to suffice for MUL and DIV. However, when applied to IDIV and AAM, the microcode repeatedly failed to function as expected. After many hours spent troubleshooting, I finally came across a clue in a seemingly unrelated part of the microcode:\n\n```\n; PUSHAd\nESP               BITS_V         SUB         DLY      0    \nSIGMA     INDSTK  -1             ADD             IN=+      \n...\nSIGMA  -> eSP                                DLY\n```\n\nThis finally gave me the hint that BITS_V is `width-1`\n\ninstead of `width`\n\n. Here PUSHA pushes 8 registers to the stack, so SP should be subtracted by `8*2=16`\n\nor `8*4=32`\n\nbytes. The existence of `SIGMA-1`\n\n(SIGMA, -1, ADD) after `SIGMA=ESP-BITS_V`\n\n(ESP, BITS_V, SUB) clearly indicates that BITS_V is one less than 16 or 32.\n\n## Comparison with modern CPUs\n\nThe 386's iterative approach to multiplication and division was state-of-the-art for its time, but modern x86 processors have [moved far beyond it](https://uops.info/):\n\n| Era | Processor | 32-bit MUL | 32-bit DIV |\n|---|---|---|---|\n| 1985 | 80386 | 9-38 cycles | 38 cycles |\n| 1993 | Pentium | 10 cycles | 41 cycles |\n| 2000s | Core 2 | 3-4 cycles | 17-41 cycles |\n| 2020s | Zen 3/Alder Lake | 3-4 cycles | 13-19 cycles |\n\nModern CPUs use dedicated multiplier arrays (often Booth-encoded Wallace trees) that can multiply 64-bit numbers in just a few cycles. Division remains slower because it's inherently sequential - each quotient bit depends on the previous remainder. However, modern CPUs use radix-4 or radix-16 division (computing 2-4 bits per cycle) and sophisticated prediction to speed things up.\n\nThe 386's \"one bit per cycle\" approach is elegant in its simplicity and its reuse of the main ALU. For an FPGA implementation, this microcode-driven design is actually quite practical - it minimizes hardware while still achieving reasonable performance.\n\nFollows me on X ([@nand2mario](https://x.com/nand2mario)) for updates, or use [RSS](/feed.xml).\n\nCredits: This analysis of the 80386 draws on the microcode disassembly and silicon reverse engineering work of [reenigne](https://www.reenigne.org/blog/), [gloriouscow](https://github.com/dbalsom), [smartest blob](https://github.com/a-mcego), and [Ken Shirriff](https://www.righto.com). For a detailed examination of the silicon itself, see Ken Shirriff’s [silicon reverse engineering series on the 80386](https://www.righto.com/search/label/386).", "url": "https://wpnews.pro/news/80386-multiplication-and-division", "canonical_source": "https://nand2mario.github.io/posts/2026/80386_multiplication_and_division/", "published_at": "2026-01-24 00:00:00+00:00", "updated_at": "2026-05-23 15:53:07.582332+00:00", "lang": "en", "topics": ["semiconductor", "hardware"], "entities": ["Intel", "80386", "Motorola 68020", "IBM PC", "Windows 3.0", "Windows 95", "Linux"], "alternates": {"html": "https://wpnews.pro/news/80386-multiplication-and-division", "markdown": "https://wpnews.pro/news/80386-multiplication-and-division.md", "text": "https://wpnews.pro/news/80386-multiplication-and-division.txt", "jsonld": "https://wpnews.pro/news/80386-multiplication-and-division.jsonld"}}