{"slug": "the-absurdly-complicated-circuitry-for-the-386-processor-s-registers", "title": "The absurdly complicated circuitry for the 386 processor's registers", "summary": "The Intel 386 processor, introduced in 1985, was the first 32-bit CPU in the x86 architecture and contains numerous registers that provide much faster storage than main memory. The processor's register implementation is surprisingly complex, using six different circuit types for its 30 registers instead of a standard design, with features like triple-porting for simultaneous reads and writes, interleaved bit storage, and support for 8, 16, or 32-bit accesses. This complexity arises partly from backward compatibility requirements, as registers like EAX can be accessed as 32-bit, 16-bit, or even 8-bit values.", "body_md": "The groundbreaking Intel 386 processor (1985) was the first 32-bit processor in the x86 architecture. Like most processors, the 386 contains numerous registers; registers are a key part of a processor because they provide storage that is much faster than main memory. The register set of the 386 includes general-purpose registers, index registers, and segment selectors, as well as registers with special functions for memory management and operating system implementation. In this blog post, I look at the silicon die of the 386 and explain how the processor implements its main registers.\n\nIt turns out that the circuitry that implements the 386's registers is much more complicated than one would expect.\nFor the 30 registers that I examine, instead of using a standard circuit, the 386 uses *six* different circuits,\neach one optimized for the particular characteristics of the register.\nFor some registers, Intel squeezes register cells together to double the storage capacity.\nOther registers support accesses of 8, 16, or 32 bits at a time.\nMuch of the register file is \"triple-ported\", allowing two registers to be read simultaneously while a value is written\nto a third register.\nFinally, I was surprised to find that registers don't store bits in order: the lower 16 bits of each register are interleaved, while the upper 16 bits are stored linearly.\n\nThe photo below shows the 386's shiny fingernail-sized silicon die under a special metallurgical microscope. I've labeled the main functional blocks. For this post, the Data Unit in the lower left quadrant of the chip is the relevant component. It consists of the 32-bit arithmetic logic unit (ALU) along with the processor's main register bank (highlighted in red at the bottom). The circuitry, called the datapath, can be viewed as the heart of the processor.\n\nThe datapath is built with a regular structure: each register or ALU functional unit is a horizontal stripe of circuitry, forming the horizontal bands visible in the image. For the most part, this circuitry consists of a carefully optimized circuit copied 32 times, once for each bit of the processor. Each circuit for one bit is exactly the same width—60 µm—so the functional blocks can be stacked together like microscopic LEGO bricks. To link these circuits, metal bus lines run vertically through the datapath in groups of 32, allowing data to flow up and down through the blocks. Meanwhile, control lines run horizontally, enabling ALU operations or register reads and writes; the irregular circuitry on the right side of the Data Unit produces the signals for these control lines, activating the appropriate control lines for each instruction.\n\nThe datapath is highly structured to maximize performance while minimizing its area on the die. Below, I'll look at how the registers are implemented according to this structure.\n\n## The 386's registers\n\nA processor's registers are one of the most visible features of the processor architecture. The 386 processor contains 16 registers for use by application programmers, a small number by modern standards, but large enough for the time. The diagram below shows the eight 32-bit general-purpose registers. At the top are four registers called EAX, EBX, ECX, and EDX. Although these registers are 32-bit registers, they can also be treated as 16 or 8-bit registers for backward compatibility with earlier processors. For instance, the lower half of EAX can be accessed as the 16-bit register AX, while the bottom byte of EAX can be accessed as the 8-bit register AL. Moreover, bits 15-8 can also be accessed as an 8-bit register called AH. In other words, there are four different ways to access the EAX register, and similarly for the other three registers. As will be seen, these features complicate the implementation of the register set.\n\n[80386 Programmer's Reference Manual](http://www.bitsavers.org/components/intel/80386/230985-001_80386_Programmers_Reference_Manual_1986.pdf#page=44), page 2-8.\n\nThe bottom half of the diagram shows that the 32-bit EBP, ESI, EDI, and ESP registers can also be treated as 16-bit registers BP, SI, DI, and SP. Unlike the previous registers,\nthese ones cannot be treated as 8-bit registers.\nThe 386 also has six segment registers that define the\nstart of memory segments; these are 16-bit registers.\nThe 16 application registers are rounded out by the status flags and instruction pointer (EIP);\nthey are viewed as 32-bit registers, but their implementation is more complicated.\nThe 386 also has numerous registers for operating system programming, but I won't discuss them here, since they\nare likely in other parts of the chip.[1](#fn:system-regs)\nFinally, the 386 has numerous temporary registers that are not visible to the programmer but are used by the microcode\nto perform complex instructions.\n\n## The 6T and 8T static RAM cells\n\nThe 386's registers are implemented with static RAM cells, a circuit that can hold one bit. These cells are arranged into a grid to provide multiple registers. Static RAM can be contrasted with the dynamic RAM that computers use for their main memory: dynamic RAM holds each bit in a tiny capacitor, while static RAM uses a faster but larger and more complicated circuit. Since main memory holds gigabytes of data, it uses dynamic RAM to provide dense and inexpensive storage. But the tradeoffs are different for registers: the storage capacity is small, but speed is of the essence. Thus, registers use the static RAM circuit that I'll explain below.\n\nThe concept behind a static RAM cell is to connect two inverters into a loop. If an inverter has a \"0\" as input, it will output a \"1\", and vice versa. Thus, the inverter loop will be stable, with one inverter on and one inverter off, and each inverter supporting the other. Depending on which inverter is on, the circuit stores a 0 or a 1, as shown below. Thus, the pair of inverters provides one bit of memory.\n\nTo be useful, however, the inverter loop needs a way to store a bit into it, as well as a way to read out the stored bit.\nTo write a new value into the circuit, two signals are fed in, forcing the inverters to the desired new values.\nOne inverter receives the new bit value, while the other inverter receives the complemented bit value.\nThis may seem like a brute-force way to update the bit, but it works.\nThe trick is that the inverters in the cell are small and weak, while the input signals are higher current,\nable to overpower the inverters.[2](#fn:flip)\nThese signals are fed in through wiring called \"bitlines\"; the bitlines can also be used to read the value\nstored in the cell.\n\nTo control access to the register,\nthe bitlines are connected to the inverters through pass transistors, which act as switches to\ncontrol access to the inverter loop.[3](#fn:pass)\nWhen the pass transistors are on, the\nsignals on the write lines can pass through to the inverters. But when the pass transistors are off, the\ninverters are isolated from the write lines.\nThe pass transistors are turned on by a control signal, called a \"wordline\" since it controls access to a word\nof storage in the register.\nSince each inverter is constructed from two transistors, the circuit above consists of six transistors—thus this circuit is called a \"6T\" cell.\n\nThe 6T cell uses the same bitlines for reading and writing, so you can't read and write to registers simultaneously.\nBut adding two transistors creates an \"8T\" circuit that lets you read from one register\nand write to another register at the same time. (In technical terms, the register file is two-ported.)\nIn the 8T schematic below, the two additional transistors (G and H) are used for reading.\nTransistor G buffers the cell's value; it turns on if the inverter output is high, pulling the read output bitline low.[4](#fn:precharge)\nTransistor H is a pass transistor that blocks this signal until a read is performed on this register;\nit is controlled by a read wordline.\nNote that there are two bitlines for writing (as before) along with one bitline for reading.\n\nTo construct registers (or memory), a grid is constructed from these cells. Each row corresponds to a register, while each column corresponds to a bit position. The horizontal lines are the wordlines, selecting which word to access, while the vertical lines are the bitlines, passing bits in or out of the registers. For a write, the vertical bitlines provide the 32 bits (along with their complements). For a read, the vertical bitlines receive the 32 bits from the register. A wordline is activated to read or write the selected register. To summarize: each row is a register, data flows vertically, and control signals flow horizontally.\n\n## Six register circuits in the 386\n\nThe die photo below zooms in on the register circuitry in the lower left corner of the 386 processor.\nYou can see the arrangement of storage cells into a grid, but note that the pattern changes from row to row.\nThis circuitry implements 30 registers: 22 of the registers hold 32 bits, while the bottom ones are 16-bit registers.\nBy studying the die, I determined that there are six different register circuits,\nwhich I've arbitrarily labeled (*a*) to (*f*).\nIn this section, I'll describe these six types of registers.\n\nI'll start at the bottom with the simplest circuit: eight 16-bit registers that I'm calling type (*f*).\nYou can see a \"notch\" on the left side of the register file\nbecause these registers are half the width of the other registers (16 bits versus 32 bits).\nThese registers are implemented with the 8T circuit described earlier, making them dual ported:\none register can be read while another register is written.\nAs described earlier, three vertical bus lines pass through each bit: one bitline for reading and two bitlines\n(with opposite polarity)\nfor writing.\nEach register has two control lines (wordlines): one to select a register for reading and another to select a register for writing.\n\nThe photo below shows how four cells of type (*f*) are implemented on the chip.\nIn this image, the chip's two metal layers have been removed along with most of the polysilicon wiring, showing the underlying silicon.\nThe dark outlines indicate regions of doped silicon, while the stripes across the doped region correspond to transistor\ngates.\nI've labeled each transistor with a letter corresponding to the earlier schematic.\nObserve that the layout of the bottom half is a mirrored copy of the upper half, saving a bit of space.\nThe left and right sides are approximately mirrored; the irregular shape allows separate read and wite wordlines\nto control the left and right halves without colliding.\n\n*f*), separated by dotted lines. The small irregular squares are remnants of polysilicon that weren't fully removed.\n\nThe 386's register file and datapath are designed with 60 µm of width assigned to each bit.\nHowever, the register circuit above is unusual:\nthe image above is 60 µm wide but there are two register cells side-by-side.\nThat is, the circuit crams *two* bits in 60 µm of width, rather than one.\nThus, this dense layout implements two registers per row (with interleaved bits), providing twice the density of the other register circuits.\n\nIf you're curious to know how the transistors above are connected,\nthe schematic below shows how the physical arrangement of the transistors above corresponds to two of the 8T memory cells\ndescribed earlier.\nSince the 386 has two overlapping layers of metal, it is very hard to interpret a die photo with the metal layers.\nBut see my [earlier article](https://www.righto.com/2023/11/reverse-engineering-intel-386.html) if you want these photos.\n\nAbove the type (*f*) registers are 10 registers of type (*e*), occupying five rows of cells.\nThese registers are the same 8T implementation as before, but these registers are 32 bits wide instead of 16.\nThus, the register takes up the full width of the datapath, unlike the previous registers.\nAs before, the double-density circuit implements two registers per row.\nThe silicon layout is identical (apart from being 32 bits wide instead of 16), so I'm not including a photo.\n\nAbove those registers are four (*d*) registers, which are more complex.\nThey are triple-ported registers, so one register can be written while two other registers are read.\n(This is useful for ALU operations, for instance, since two values can be added and the result written back\nat the same time.)\nTo support reading a second register, another vertical bus line is added for each bit.\nEach cell has two more transistors to connect the cell to the new bitline.\nAnother wordline controls the additional read path.\nSince each cell has two more transistors, there are 10 transistors in total and the circuit is called 10T.\n\n*d*). The striped green regions are the remnants of oxide layers that weren't completely removed, and can be ignored.\n\nThe diagram above shows four memory cells of type (*d*).\nEach of these cells takes the full 60 µm of width, unlike the previous double-density cells.\nThe cells are mirrored horizontally and vertically;\nthis increases the density slightly since power lines can be shared between cells.\nI've labeled the transistors `A`\n\nthrough `H`\n\nas before, as well as the two additional transistors `I`\n\nand `J`\n\nfor the\nsecond read line.\nThe circuit is the same as before, except for the two additional transistors, but\nthe silicon layout is significantly different.\n\nEach of the (*d*) registers has five control lines. Two control lines select a register for reading, connecting the register\nto one of the two vertical read buses.\nThe three write lines allow parts of the register to be written independently: the top 16 bits, the next 8 bits, or the\nbottom 8 bits.\nThis is required by the x86 architecture, where a 32-bit register such as EAX can also be accessed as the 16-bit AX register,\nthe 8-bit AH register, or the 8-bit AL register.\nNote that reading part of a register doesn't require separate control lines: the register provides all 32 bits and\nthe reading circuit can ignore the bits it doesn't want.\n\nProceeding upward, the three (*c*) registers have a similar 10T implementation.\nThese registers, however, do not support partial writes so all 32 bits must be written at once.\nAs a result, these registers only require three control lines (two for reads and one for writes).\nWith fewer control lines, the cells can be fit into less vertical space, so the layout is slightly more compact than\nthe previous type (*d*) cells. The diagram below shows four type (*c*) rows above two type (*d*) rows.\nAlthough the cells have the same ten transistors, they have been shifted around somewhat.\n\n*c*) above two cells of type (\n\n*d*).\n\nNext are the four (*b*) registers, which support 16-bit writes and 32-bit writes (but not 8-bit writes).\nThus, these registers have four control lines (two for reads and two for writes).\nThe cells take slightly more vertical space than the (*c*) cells due to the additional control line, but the layout is\nalmost identical.\n\nFinally, the (*a*) register at the top has an unusual feature: it can receive a copy of the value in the register just\nbelow it.\nThis value is copied directly between the registers, without using the read or write buses.\nThis register has 3 control lines: one for read, one for write, and one for copying.\n\n*a*), which can copy the value in the cell of type (\n\n*b*) below.\n\nThe diagram above shows a cell of type (*a*) above a cell of type (*b*).\nThe cell of type (*a*) is based on the standard 8T circuit,\nbut with six additional transistors to copy the value of the cell below.\nSpecifically, two inverters buffer the output from cell (*b*), one inverter for each side of the cell.\nThese inverters are implemented with transistors I1 through I4.[5](#fn:inverters)\nTwo transistors, S1 and S2, act as a pass-transistor switches between these inverters and the memory cell.\nWhen activated by the control line, the switch transistors allow the inverters to overwrite the memory cell with\nthe contents of the cell below.\nNote that cell (*a*) takes considerably more vertical space because of the extra transistors.\n\n## Speculation on the physical layout of the registers\n\nI haven't determined the mapping between the 386's registers and the 30 physical registers, but I can speculate.\nFirst, the 386 has four registers that can be accessed as 8, 16, or 32-bit registers: EAX, EBX, ECX, and EDX.\nThese must map onto the (*d*) registers, which support these access patterns.\n\nThe four index registers (ESP, EBP, ESI, and EDI) can be used as 32-bit registers or 16-bit registers,\nmatching the four (*b*) registers with the same properties.\nWhich one of these registers can be copied to the type (*a*) register?\nMaybe the stack pointer (ESP) is copied as part of interrupt handling.\n\nThe register file has eight 16-bit registers, type (*f*).\nSince there are six 16-bit segment registers in the 386, I suspect the 16-bit registers are the segment registers and two additional registers.\nThe [LOADALL](https://web.archive.org/web/20210624172529/https://asm.inightmare.org/opcodelst/index.php?op=LOADALL)\ninstruction gives some clues, suggesting that the two additional 16-bit registers are\nLDT (Local Descriptor Table register) and TR (Task Register).\nMoreover, `LOADALL`\n\nhandles 10 temporary registers, matching the 10 registers of type (*e*) near the bottom\nof the register file.\nThe three 32-bit registers of type (*c*) may be the\nCR0 control register and the DR6 and DR7 debug registers.\n\nIn this article, I'm only looking at the main register file in the datapath.\nThe 386 presumably has other registers scattered around\nthe chip for various purposes.\nFor instance, the Segment Descriptor Cache contains multiple registers similar to type (*e*), probably holding cache entries.\nThe processor status flags and the instruction pointer (EIP) may not be implemented as discrete registers.[6](#fn:flags-eip)\n\nTo the right of the register file, a complicated block of circuitry uses seven-bit values to select registers. Two values select the registers (or constants) to read, while a third value selects the register to write. I'm currently analyzing this circuitry, which should provide more insight into how the physical registers are assigned.\n\n## The shuffle network\n\nThere's one additional complication in the register layout.\nAs mentioned earlier, the bottom 16 bits of the main registers can be treated as two 8-bit registers.[7](#fn:datapoint)\nFor example, the 8-bit AH and AL registers form the bottom 16 bits of the EAX register.\nI explained earlier how the registers use multiple write control lines to allow these different parts of the register\nto be updated separately.\nHowever, there is also a layout problem.\n\nTo see the problem, suppose you perform an 8-bit ALU operation on the AH register, which is bits 15-8 of the EAX register. These bits must be shifted down to positions 7-0 so they can take part in the ALU operation, and then must be shifted back to positions 15-8 when stored into AH. On the other hand, if you perform an ALU operation on AL (bits 7-0 of EAX), the bits are already in position and don't need to be shifted.\n\nTo support the shifting required for 8-bit register operations, the 386's register file physically interleaves the bits of the two lower bytes (but not the high bytes).\nAs a result, bit 0 of AL is next to bit 0 of AH in the register file, and so forth.\nThis allows multiplexers to easily select bits from AH or AL as needed.\nIn other words, each bit of AH and AL is in almost the correct physical position, so an 8-bit shift is not required.\n(If the bits were in order, each multiplexer would need to be connected to bits that are separated by eight positions,\nrequiring inconvenient wiring.)[8](#fn:8086)\n\nThe photo above shows the shuffle network.\nEach bit has three bus lines associated with it: two for reads and one for writes, and these all get shuffled.\nOn the left, the lines for the 16 bits pass straight through.\nOn the right, though, the two bytes are interleaved.\nThis shuffle network is located below the ALU and above the register file, so data words are shuffled when stored in the\nregister file and then unshuffled when read from the register file.[9](#fn:constants)\n\nIn the photo, the lines on the left aren't quite straight.\nThe reason is that the circuitry above is narrower than the circuitry below.\nFor the most part, each functional block in the datapath is constructed with the same width (60 µm) for each bit.\nThis makes the layout simpler since functional blocks can be stacked on top of each other and the vertical bus wiring\ncan pass straight through.\nHowever, the circuitry above the registers (for the barrel shifter) is about 10% narrower (54.5 µm), so the wiring\nneeds to squeeze in and then expand back out.[10](#fn:width)\nThere's a tradeoff of requiring more space for this wiring versus the space saved by making the barrel shifter\nnarrower and Intel must have considered the tradeoff worthwhile.\n(My hypothesis is that since the shuffle network required additional wiring to shuffle the bits, it didn't take up\nmore space to squeeze the wiring together at the same time.)\n\n## Conclusions\n\nIf you look in a book on processor design, you'll find a description of how registers can be created from static memory cells. However, the 386 illustrates that the implementation in a real processor is considerably more complicated. Instead of using one circuit, Intel used six different circuits for the registers in the 386.\n\nThe 386's register circuitry also shows the curse of backward compatibility. The x86 architecture supports 8-bit register accesses for compatibility with processors dating back to 1971. This compatibility requires additional circuitry such as the shuffle network and interleaved registers. Looking at the circuitry of x86 processors makes me appreciate some of the advantages of RISC processors, which avoid much of the ad hoc circuitry of x86 processors.\n\nIf you want more information about how the 386's memory cells were implemented, I wrote a [lower-level article](https://www.righto.com/2023/11/reverse-engineering-intel-386.html) earlier.\nI plan to write more about the 386, so\nfollow me on Bluesky ([@righto.com](https://bsky.app/profile/righto.com)) or [RSS](https://www.righto.com/feeds/posts/default) for updates.\n\n## Footnotes and references\n\n-\nThe 386 has multiple registers that are only relevant to operating systems programmers (see Chapter 4 of the\n\n[386 Programmer's Reference Manual](http://www.bitsavers.org/components/intel/80386/230985-003_386DX_Microprocessor_Programmers_Reference_Manual_1990.pdf)). These include the Global Descriptor Table Register (GDTR), Local Descriptor Table Register (LDTR), Interrupt Descriptor Table Register (IDTR), and Task Register (TR). There are four Control Registers CR0-CR3; CR0 controls coprocessor usage, paging, and a few other things. The six Debug Registers for hardware breakpoints are named DR0-DR3, DR6, and DR7. The two Test Registers for TLB testing are named TR6 and TR7. I expect that these registers are in the 386's Segment Unit and Paging Unit, rather than part of the processing datapath.[↩](#fnref:system-regs) -\nTypically the write driver circuit generates a strong low on one of the bitlines, flipping the corresponding inverter to a high output. As soon as one inverter flips, it will force the other inverter into the right state. To support this, the pullup transistors in the inverters are weaker than normal.\n\n[↩](#fnref:flip) -\nThe pass transistor passes its signal through or blocks it. In CMOS, this is usually implemented with a transmission gate with an NMOS and a PMOS transistor in parallel. The cell uses only the NMOS transistor, which is much worse at passing a high signal than a low signal. Because there is one NMOS pass transistor on each side of the inverters, one of the transistors will be passing a low signal that will flip the state.\n\n[↩](#fnref:pass) -\nThe bitline is typically precharged to a high level for a read, and then the cell pulls the line low for a 0. This is more compact than including circuitry in each cell to pull the line high.\n\n[↩](#fnref:precharge) -\nNote that buffering is needed so the (\n\n*b*) cell can write to the (*a*) cell. If the cells were connected directly, cell (*a*) could overwrite cell (*b*) as easily as cell (*b*) could overwrite cell (*a*). With the inverters in between, cell (*b*) won't be affected by cell (*a*).[↩](#fnref:inverters) -\nIn the 8086, the processor status flags are not stored as a physical register, but instead consist of flip-flops scattered throughout the chip (\n\n[details](https://www.righto.com/2023/02/silicon-reverse-engineering-intel-8086.html)). The 386 probably has a similar implementation for the flags.In the 8086, the program counter (instruction pointer) does not exist as such. Instead, the instruction prefetch circuitry has a register holding the current prefetch address. If the program counter address is required (to push a return address or to perform a relative branch, for instance), the program counter value is derived from the prefetch address. If the 386 is similar, the program counter won't have a physical register in the register file.\n\n[↩](#fnref:flags-eip) -\nThe x86 architecture combines two 8-bit registers to form a 16-bit register for historical reasons. The TTL-based\n\n[Datapoint 2200](https://www.righto.com/2023/08/datapoint-to-8086.html)(1971) system had 8-bit A, B, C, D, E, H, and L registers, with the H and L registers combined to form a 16-bit indexing register for memory accesses. Intel created a microprocessor version of the Datapoint 2200's architecture, called the 8008. Intel's 8080 processor extended the register pairs so BC and DE could also be used as 16-bit registers. The 8086 kept this register design, but changed the 16-bit register names to AX, BX, CX, and DX, with the 8-bit parts called AH, AL, and so forth. Thus, the unusual physical structure of the 386's register file is due to compatibility with a programmable terminal from 1971.[↩](#fnref:datapoint) -\nTo support 8-bit and 16-bit operations, the 8086 processor used a similar interleaving scheme with the two 8-bit halves of a register interleaved. Since the 8086 was a 16-bit processor, though, its interleaving was simpler than the 32-bit 386. Specifically, the 8086 didn't have the upper 16 bits to deal with.\n\n[↩](#fnref:8086) -\nThe 386's constant ROM is located below the shuffle network. Thus, constants are stored with the bits interleaved in order to produce the right results. (This made the ROM contents incomprehensible until I figured out the shuffling pattern, but that's a topic for another article.)\n\n[↩](#fnref:constants) -\nThe main body of the datapath (ALU, etc.) has the same 60 µm cell width as the register file. However, the datapath is slightly wider than the register file overall. The reason? The datapath has a small amount of circuitry between bits 7 and 8 and between bits 15 and 16, in order to handle 8-bit and 16-bit operations. As a result, the logical structure of the registers is visible as stripes in the physical layout of the ALU below. (These stripes are also visible in the die photo at the beginning of this article.)\n\nPart of the ALU circuitry, displayed underneath the structure of the EAX register.", "url": "https://wpnews.pro/news/the-absurdly-complicated-circuitry-for-the-386-processor-s-registers", "canonical_source": "http://www.righto.com/2025/05/intel-386-register-circuitry.html", "published_at": "2025-05-01 17:04:00+00:00", "updated_at": "2026-05-23 17:43:13.987745+00:00", "lang": "en", "topics": ["semiconductor", "hardware", "research"], "entities": ["Intel", "Intel 386"], "alternates": {"html": "https://wpnews.pro/news/the-absurdly-complicated-circuitry-for-the-386-processor-s-registers", "markdown": "https://wpnews.pro/news/the-absurdly-complicated-circuitry-for-the-386-processor-s-registers.md", "text": "https://wpnews.pro/news/the-absurdly-complicated-circuitry-for-the-386-processor-s-registers.txt", "jsonld": "https://wpnews.pro/news/the-absurdly-complicated-circuitry-for-the-386-processor-s-registers.jsonld"}}