{"slug": "comparing-an-lz4-decompressor-on-four-legacy-cpus", "title": "Comparing an LZ4 Decompressor on Four Legacy CPUs", "summary": "The article describes the author's experience implementing LZ4 decompression algorithms across multiple legacy 8-bit and 16-bit CPUs, including the Z80, Intel 8080, 8086, and 6502. The author explains how constraints from earlier SNES and Sega Genesis projects informed these implementations, and provides a detailed technical breakdown of the LZ4 block format and decompression process. The piece serves as a comprehensive comparison of how the same decompression function can be adapted to different assembly languages and processor architectures.", "body_md": "A few years ago, I needed to save some cartridge space in a SNES project, and I did so by [compressing that data](https://bumbershootsoft.wordpress.com/2023/10/21/lz4-decompression-on-the-65816/) with the [LZ4 compression algorithm](https://lz4.org/) from 2012. I found that working within the constraints of the SNES allowed me to take some convenient shortcuts during decompression and I have since found those constraints and shortcuts to be widely relevant on the 8- and 16-bit platforms my hobby programming often involves. I accumulated two more implementations (for Motorola’s [6809](https://bumbershootsoft.wordpress.com/2024/02/24/lz4-decompression-on-the-6809/) and [68000](https://bumbershootsoft.wordpress.com/2025/07/12/lz4-decompression-on-the-68000/) processors) as I worked on projects for the Tandy Color Computer and the Sega Genesis.\n\nNow, as a consequence of my [recent experiences with the Sorcerer](https://bumbershootsoft.wordpress.com/2026/04/25/staying-a-spell-with-the-exidy-sorcerer/), I found myself with *four* new implementations, for the Z80, two CPUs related to it (the Intel 8080 and 8086), and the 6502. As I mentioned at the time, the Z80 implementation in particular was *very* straightforward, and it informed the other three pretty directly.\n\nMy article about this for the 68000 was kind of desultory—it’s pretty clear on rereading it that the only thing I found interesting about it was the way the 16-bit code ended up bearing more in common with the 8-bit 6809 code than the 16-bit 65816 code. I expect this article to be my last hurrah on the subject, so I’d also like to give it more respect and more scope. I’ll start with my potted summary of how LZ4 works and what variations I made to it, but then also look at *why* LZ4 is so friendly to the Z80, especially as it compares to the 8080 and 6502. After that, I’ll use the LZ4 implementation as a worked example of [last week’s article comparing all these CPUs](https://bumbershootsoft.wordpress.com/2026/05/02/comparing-the-z80-and-6502-to-their-relatives/) by showing how the Z80 implementation can be easily transformed into the 8080 and x86 implementations, and some of the very different implementation decisions the 6502 version must make.\n\n**Fair warning:** This article is very long and very dry compared to my usual fare. If you want to see the same function implemented in four different assembly languages, this is the good stuff and go on and have fun. If not, maybe come back next week when I expect to be goofing around with high level languages again.\n\n## A Review of LZ4\n\n*(I’ve repeated this section in every implementation article so that readers don’t have to click back for context; if you’ve seen this all before, feel free to skip ahead to the section “Other Restrictions on LZ4 Encoders.” You’re not missing anything.)*\n\nLZ77-class decompressors all work on the same two basic tricks:\n\n- If a string of data repeats itself throughout the text, you specify an offset and a length within the output to copy. This will be shorter than repeating the data itself.\n- If the length of the amount to repeat kicks you past the original end of the output, keep repeating the values that you originally wrote. This matches what happens with a naïve copy loop, anyway, so this normally works out. It also means that this class of decompressors can do run-length encoding for free, even when the run is a copy of multiple bytes instead of just one.\n\nLZ4’s block format hinges on the insight that we can think of a compressed stream as *alternating* between strings of copied values and backreferenced ranges. Backreferences might follow one another directly, but if two blocks of literals are adjacent, that’s just a single block of literals. As such, the fundamental unit of decompression is a string (possibly length zero) of literals, followed by a backreference. The format specification calls this pair a *sequence.*\n\nA sequence begins with a single byte giving the length of both the next literal string and the next backreference; the top four bits represent the string length directly, and the bottom four bits represent the backreference length after adding four to it. (If your backreference is shorter than four, it’s not worth the space to create a new sequence compared to just copying the literals again. Thus, four is the shortest meaningful backreference.) This is followed by that many literal characters to copy into the output, and then a two-byte little-endian value to indicate how far back in the output to copy the backreference from.\n\nIt may, of course, come to pass that your literal string is longer than 15, or your backreference is longer than 19. To represent this, the length nybble for that part is recorded as 15 and additional bytes are added to the stream to indicate how much to add to the length. These bytes occur right after the initial lengths byte for the literal string, or just after the offset for the backreference. Bytes are read and added to the length until one of them is not 255; then that value too is added in to produce the final result. (This does mean that a literals length of exactly fifteen needs an extra length byte of zero to indicate that we did in fact mean fifteen exactly.)\n\n### Our Variation\n\nEncoders have to worry about a few extra constraints, because there are old decoders that rely on hacks to allow decompression to go way faster. Since we’re decoding, we don’t have to care about those, but one of them is extremely useful to us: We are guaranteed that the final sequence contains only literals. The compressed data ends just before what would otherwise be an offset word.\n\nThe other important fact about offset words is that they may never be zero. After all, that would mean we would need to copy from underneath the very byte we were writing!\n\nCombined, these facts are what let us dispense with the traditional frame data: we can simply null-terminate the compressed data with a pair of zero bytes and use “our alleged backreference has offset zero” as the signal to stop decompressing.\n\n### Other Restrictions on LZ4 Decoders\n\nThere are three restrictions placed on the LZ4 encoder when creating compressed blocks. We only rely on the first.\n\n- The last sequence is only literals.\n- The last sequence has\n*at least five*literal characters in it, unless it is the only sequence in the entire block. - The backreference in the next-to-last sequence begins\n*at least 12 bytes before the end of the block.*\n\nThe specification explicitly states that a consequence of this is that no string of less than 13 bytes can be compressed. It also describes these rules as being “in place to ensure compatibility with a wide range of historical decoders which rely on these conditions for their speed-oriented design.” That’s interesting to me, because my own exploitation of these rules was to ensure correctness more easily and to reduce the amount of state the decoder needed. Speed was not really involved at all.\n\nI can, at least, come up with an easy efficiency-oriented reason to insist on the last sequence being only literals; the LZ4 frame format (which I have discarded) signals termination by pre-declaring the size of the compressed data. Insisting on ending with literals means we only have to actually adjust or check that counter after copying a block of literals. Without this restriction, we’d have to juggle the check in with reading the *extended length bytes of the backreference.*\n\nI am utterly in the dark as the the advantages the last two restrictions might grant, though, particularly since while decoding we generally won’t *know* that we’re in the last two sequences and neither of these restrictions apply to any *previous* sequences within the block. My best guess on the second is that for every block but the last you can transfer four bytes unconditionally with a 32-bit move and just adjust your destination pointer if it turns out that copied too much, and you will still (thanks to the minimum backreference size being four bytes long) never blow out your destination buffer. I don’t even have *bad* guesses for what the final restriction buys us.\n\n## LZ4 On Our Various Chips\n\nBefore digging into the details of the implementations, let’s take a high-level look at how the decompression algorithm interacts with the capabilities of our four CPUs.\n\n### Our Yardstick: The Zilog Z80\n\nThe LZ4 decompression algorithm revolves around making block copies, which the Z80 is quite good at thanks to the ** LDIR** instruction. However, that is not the only way it fits well with the chip. We can only conveniently use two pointers at once on the Z80, with the remaining three general-purpose registers managing the overall computation and the index registers kept at a bit of a remove from the rest of the instruction set. LZ4 only needs two long-lived pointers as well; the source and destination pointers. A third pointer value shows up regularly as the alternate source pointer when copying from a backreference; however, this pointer effectively\n\n*temporarily replaces*the source pointer over its lifetime; it is, with one slight wobble, computed, used exclusively in place of the source pointer, and then thrown out. This means that to the extent that we\n\n*won’t*be able to fit everything in registers, our usage of the stack can restrict itself to storing temporary values.\n\nThe “wobble” shows up with long-running backreference copies. If the backreference part of a sequence is more than 19 bytes long, the remaining bytes in the length data appear *after* the offset. This requires us to stash the backreference pointer briefly after computing it so that we may read a few more bytes from the source pointer first. While the Z80 is not really well-suited to keeping local variables on a stack, it inherits an instruction from the 8080 that allows it to swap ** HL** with the top value on the stack. This turns out to be sufficient to let us juggle our source pointers as needed, and everything works out.\n\n### Doing Things the Hard Way: The Intel 8080\n\nThe 8080 implementation almost exactly matches the Z80 one, except for assembler syntax. There are only two Z80-specific capabilities we use at all in the implementation: the ** LDIR** block-copy command, and the single instruction\n\n**. We replace the block copies with hand-written loops, and break out the 16-bit subtraction into a pair of 8-bit subtractions. These operations mean we make heavier use of the accumulator, and as a result we do need to save and restore registers to the stack a bit more frequently.**\n\n`SBC HL,BC`\n\nThe translation here was otherwise very direct; we can read the 8080 version as being *exactly* the Z80 version with some blocks of instructions filling in for the missing ones.\n\n### Doing More With More: The Intel 8086\n\nThe 8086 has enough spare registers, and enough expressive capability, that we do not need to touch the stack at any point the 8-bits do. We *do* still hit the stack a bit though; the customary ABI preserves some of the registers we use, and we shouldn’t violate that without a good reason. Furthermore, I expanded the scope of the 8086 implementation to work with “far pointers”, covering the entire 1MB address range. This required some juggling of the segment registers, and I felt the code was a bit clearer if it used the stack while doing so.\n\nBeyond that, we get extremely good use out of the “string” instructions that serve as autoincrementing loads, stores, and moves. Not only do these serve the same purpose as the Z80’s ** LDIR** instruction, they are actually useful in even more places in the implementation. A consequence of this, however, is that we often find that the main accumulator gets overwritten by\n\n*different*intermediate computations than it does on the 8-bit editions, which resulted in me shifting around some parts of the order of operations as a result.\n\n### And Now For Something Completely Different: The MOS Technology 6502\n\nThe 6502 is vastly more restricted, when viewed from the perspective of the Z80 or its relatives. It has *zero* registers that can double as pointers, its 8-bit stack struggles to accomplish much beyond preserving values in strict stack discipline, and it offers even less direct support for 16-bit operations than the 8080.\n\nThe gap narrows considerably, however, when we shift the perspective to what the 6502 is actually *good* at. I dedicated [an entire article](https://bumbershootsoft.wordpress.com/2022/11/15/cleanly-organizing-8-bit-dataflows/) to the differences in implementation philosophy encouraged by the 6502 compared to the Z80, and much of that analysis applies here. The weaker stack and smaller register bank mean that the 6502 relies far more heavily on scratch space elsewhere in RAM, but that very reliance means that we have *no register pressure whatsoever.* All arithmetic goes through the accumulator, and 16-bit values can’t be used out of any registers at all, so conceptually all the operands are coming out of memory and returning to it once processing is complete.\n\nThe LZ4 algorithm is definitely a less comfortable fit overall, though. The 6502 is much less comfortable with pointers than the Z80, or even the 8080, is, and we will have only limited opportunities to work around that. In my old article, I noted that the 6502 vastly prefers to work with arrays than with pointers, and that furthermore, when we have multiple arrays, it wishes to be working with corresponding indices *within* those arrays. I ended up writing an entire helper function to implement what the Z80 does with ** LDIR**, and it was able to provide that abstraction of corresponding indices; however, the decompressor\n\n*as a whole*is not afforded that luxury. It is, after all, kind of the whole point of compressed data that we will be advancing the destination’s pointer more rapidly than the source’s!\n\n## Designing the API\n\nThe Z80 API is straightforward, and is basically dictated by the fact that we’re going to lean on ** LDIR** for our copy operations.\n\n**will be our source pointer, pointing to the beginning of the compressed data, and**\n\n`HL`\n\n**will be our destination pointer, pointing to the beginning of the buffer we wish to fill with decompressed data. Registers**\n\n`DE`\n\n**will all be scratch registers for the function, and on function exit, both**\n\n`ABC`\n\n**and**\n\n`HL`\n\n**will point one byte past the last byte read or written.**\n\n`DE`\n\nAs I mentioned above, the 8080 implementation is extremely close to the Z80 one and as a result it will enjoy an identical API.\n\nThe 8086 was a bit more fraught. We’ve now reached a point where things like “calling conventions” actually exist, and the C compilers I’ve worked with are pretty consistent about what they expect. ** BP** is used as a frame pointer to access arguments that were pushed on the stack prior to the call, 16 bit return values are returned in\n\n**, the high 16 bits of 32-bit return values are returned in**\n\n`AX`\n\n**, those two along with**\n\n`DX`\n\n**are scratch registers, and everything else is preserved. This is extremely inconvenient for comparison purposes with my other implementations, so I’ve bent it heavily.**\n\n`CX`\n\nI have retained the notion that the function’s API should closely match the block-copy instruction. For the 8086, that means these must be * far pointers* that range over the full 1MB of address space, with\n\n**as the source pointer and**\n\n`DS:SI`\n\n**as the destination pointer. While these are normally all preserved, I let the routine update**\n\n`ES:DI`\n\n**and**\n\n`SI`\n\n**for use as a pair of return values (again, pointing just past their final byte processed). Those two pointers aside, I do also ensure that all segment registers are preserved alongside**\n\n`DI`\n\n**and**\n\n`BP`\n\n**.**\n\n`BX`\n\nThe 6502 API is, like the others, dictated by the actual processing we do, but since its architecture is so different, the API is likewise distinct. Here, we set aside four bytes of scratch space in the zero page to hold the source and destination pointers. The routine updates those memory locations directly, leaving them in their final locations on function exit. Four additional scratch bytes are necessary as well, but I implemented the function such that they do not need to be in the zero page. Like most functions of any complexity, this one trashes all three of the 6502’s registers.\n\n## Implementation on the Z80 and Intel Chips\n\nLet’s now look at the implementations in detail. The 8080, Z80, and 8086 implementations are so similar that it’s worth simply tracking them in parallel.\n\nWe open with a function prologue… or we would, if we had one. Only the 8086 has one; it needs to save off the ** BX** register because it will be trashing it over the course of the function and it needs to restore the caller’s value on the way out.\n\n```\n        Intel 8080            Zilog Z80                     Intel 8086\n    ------------------    -----------------             -----------------\nlz4dec:                                                     push bx\n```\n\nAfter that, we begin our main loop by reading the next byte from the compressed data and advancing the source pointer. We’ll need to split this into two 4-bit values over the course of the function, so we’ll need to stash a copy of the original byte for later.\n\nThe 8080 and Z80 code here is actually identical, though the instructions look very different. When reading a byte out of the memory location specified by ** HL**, 8080 syntax refers to it as the pseudo-register\n\n**. Furthermore, instructions that work on register pairs are distinct from those that work on single registers, so they are usually named only by their high byte (so,**\n\n`M`\n\n**for what the Z80 calls**\n\n`H`\n\n**, and so on). A further exception, which we also see here, is that the register pair that the Z80 calls**\n\n`HL`\n\n**—the accumulator and flags—is instead called**\n\n`AF`\n\n**—the processor status word.**\n\n`PSW`\n\nThe 8086 code, on the other hand, is much different. Instead of pushing the byte read to the stack, it simply stores it in the ** BL** register where it will be left alone until needed. More interestingly, it uses one of the CPU’s “string processing” instructions. The three basic instructions are\n\n**,**\n\n`LODSB`\n\n**, and**\n\n`STOSB`\n\n**:**\n\n`MOVSB`\n\n(“Load String Byte”) is equivalent to`LODSB`\n\n. This instruction replaces the first two instructions in the 8-bit case.`MOV AL,[DS:SI]; ADD SI,1`\n\n(“Store String Byte”) is the reverse; it is equivalent to`STOSB`\n\n.`MOV [ES:DI],AL; ADD DI,1`\n\n(“Move String Byte”) combines them into`MOVSB`\n\n. This is roughly equivalent to the Z80’s`MOV [ES:DI], [DS:SI]; ADD SI,1; ADD DI,1`\n\ninstruction.`LDI`\n\n- It is also possible to move 16-bit words (with\nas the other register in the load and store cases) by replacing the`AX`\n\nwith a`B`\n\n. Once the CPU goes 32-bit, a double-word variant also appears relying on`W`\n\n.`EAX`\n\n- There is a “direction” flag that causes the pointers to be decremented instead of incremented; calling conventions all demand that this flag be set to increment mode at all function boundaries, and we don’t mess with it here.\n- The store- and move-string instructions may be repeated by prefixing the instruction with\n. This uses`REP`\n\nas a count register, and functions analogously to`CX`\n\nor`LDIR`\n\ndepending on the direction flag.`LDDR`\n\nThe end result is that in this brief snippet the 8086 code is one instruction shorter:\n\n```\n        Intel 8080            Zilog Z80                     Intel 8086\n    ------------------    -----------------             -----------------\n.main:  mov a,m               ld a,(hl)                     lodsb\n        inx h                 inc hl\n        push psw              push af                       mov bl,al\n```\n\nNow we need to handle the literals. The 8086 code is, again, pretty straightforward; it isolates the high nybble by shifting ** AL** right 4, then skips ahead to backreference processing if the result is zero. Otherwise it calls our\n\n**helper function to finalize the length and does the block copy with**\n\n`.rdlen`\n\n**. The Z80 and 8080 code is similar, with**\n\n`REP MOVSB`\n\n**handling the block copy on the Z80 side and the 8080 (which lacks this instruction) filling it in with a more manual loop. There’s an interesting difference in the way they collect the high nybble, though:**\n\n`LDIR`\n\n```\n        Intel 8080            Zilog Z80                     Intel 8086\n    ------------------    -----------------             ------------------\n        rrc                   rrca                          mov cl,4\n        rrc                   rrca                          shr al,cl\n        rrc                   rrca\n        rrc                   rrca\n        ani 15                and 15\n        jz .bkref             jr z,.bkref                   jz .bkref\n        call .rdlen           call .rdlen                   call .rdlen\n.lp1:   mov a,m               ldir                          rep movsb\n        stax d\n        inx h\n        inx d\n        dcx b\n        mov a,c\n        ora b\n        jnz .lp1\n```\n\nThe 8080, it turns out, doesn’t have any shift instructions. Instead, it has four “rotate accumulator” instructions, two of which are 9-bit rotations through the carry and two of which are 8-bit rotations strictly within the register (with the carry bit mirroring the bit that moves from least to most significant). On the 8080, we need to rotate right four times to swap the high and low nybbles and then mask out the upper bits. On the Z80, we actually *have* a proper shift-right instruction ** SRL A** but this instruction, as part of its extended instruction set, is 2 bytes long and takes 8 cycles to execute, while\n\n**, which it inherited from the 8080, is only 1 byte and takes only 4 cycles to execute. Copying the 8080’s more roundabout processing is both smaller and faster. This is one of those cases I noted more abstractly last week: the Z80 rewards staying within the 8080’s constraints when it’s not very awkward to do so.**\n\n`RRCA`\n\nThe literals of this sequence handled, we now move on to the backreference part. Step one is always to read a 16-bit value from the compressed data and end decompression if it is zero. Once again, the string operations dramatically simplify the memory access on the 8086 side, and its 16 bit registers mean that the zero test may be done with a single operation as well:\n\n```\n        Intel 8080            Zilog Z80                     Intel 8086\n    ------------------    -----------------             ------------------\n.bkref: mov c,m               ld c,(hl)                     lodsw\n        inx h                 inc hl\n        mov b,m               ld b,(hl)\n        inx h                 inc hl\n        mov a,c               ld a,c                        test ax,ax\n        ora b                 or b\n        jz .done              jr z,.done                    jz .done\n```\n\n(The ** TEST** instruction is basically an\n\n**that throws away the result; when passed the same register twice it basically asks the CPU to set the Zero and Sign flags according to that register’s value.)**\n\n`AND`\n\nAt this point, things start to diverge. Several things need to happen, and which operations interfere with what are different on each chip:\n\n- Construct the backreference pointer by subtracting the offset we just read from the destination pointer.\n- Read the rest of the backreference length out of the original source pointer, and add 4 to it.\n- Do the block copy with the backreference pointer standing in for the source pointer.\n- End this sequence with the source pointer one byte past the last length byte we read (or just past the offset, if this backreference was 18 characters or less).\n\nKeeping the source pointer happy means it will need to be saved and restored to the stack, on the 8-bits, and that also means that we’ll need to pop the length byte before pushing the source pointer onto the stack. This all works out very neatly, with a high level procedure something like this:\n\n- Pop the length byte from the stack and isolate the low nybble this time.\n- Save\nto the stack. This is our source pointer that we’ll be setting aside, but also then…`HL`\n\n- Compute\nto get our backreference pointer. The Z80’s extended instructions make this easy. The 8080 has to do more work, which also means re-saving and re-restoring the accumulator.`HL = DE-BC`\n\n*Swap the backreference and source pointers,*finalize the backreference length by reading any extra length bytes, then swap them*back*and do the bulk copy.- Pop the source pointer back into\n, ready to begin the next sequence, and jump back to the top to begin processing the next sequence of literals.`HL`\n\nThe 8086, however, has to reorganize things a bit. The 8-bits were able to load their 16-bit offset directly into ** BC** because that is what its 16-bit math demanded, and that meant that it was free to use\n\n**to hold the initial backreference length during the subtraction. The 8086, however, loaded**\n\n`A`\n\n*into*\n\n**with its**\n\n`AX`\n\n**instruction, and needs to completely finish computing the backreference pointer before dealing with the length at all. Fortunately, since it isn’t using the stack to hold temporary values, this is just a matter of register moves. The backreference pointer is computed in**\n\n`LODSW`\n\n**and exchanged with**\n\n`DX`\n\n**as needed; the saved length value remains in**\n\n`SI`\n\n**for as long as we need it and is copied over once it’s necessary to be passed to**\n\n`BL`\n\n**.**\n\n`.rdlen`\n\n```\n        Intel 8080            Zilog Z80                     Intel 8086\n    ------------------    -----------------             ------------------\n        pop     psw           pop af                        ; Below\n        ani     15            and 15\n        push    h             push hl\n        mov     h,d           ld h,d                        mov dx,di\n        mov     l,e           ld l,e\n        push    psw           or a                          sub dx,ax\n        mov     a,l           sbc hl,bc\n        sub     c\n        mov     l,a\n        mov     a,h\n        sbb     b\n        mov     h,a\n        pop     psw\n        xthl                  ex (sp),hl\n        ; above               ; above                       mov al,bl\n                                                            and al,15\n        call .rdlen           call .rdlen                   call .rdlen\n        inx b                 inc bc                        add cx,4\n        inx b                 inc bc\n        inx b                 inc bc\n        inx b                 inc bc\n        xthl                  ex (sp),hl                    xchg dx,si\n                                                            push ds\n                                                            push es\n                                                            pop ds\n.lp2:   mov a,m               ldir                          rep movsb\n        stax d\n        inx h\n        inx d\n        dcx b\n        mov a,c\n        ora b\n        jnz .lp2\n        pop h                 pop hl                        mov si,dx\n                                                            pop ds\n        jmp .main             jr .main                      jmp .main\n```\n\nThere’s one last subtlety we can see in the 8086 code; when we swap the pointers, we also need to copy the value of ** ES** into\n\n**for the copy so that the backreference pointer has the correct segment as well, then restore it once we’re done. It’s a bit inconvenient to move values in and out of segment registers; I just do stack operations here because it’s clear and simple.**\n\n`DS`\n\nWe also can see some counterintuitive optimization in the 8-bit code; we add 4 to ** BC** simply by calling\n\n**four times. This turns out to be both faster and shorter than trying to add 4 to**\n\n`INC BC`\n\n**directly with 8- or 16-bit math.**\n\n`BC`\n\nAll that’s left in the main function is the exit; each function needs to restore the stack and return. On the 8-bits, this involves popping the unused backreference length off the stack; on the 8086, it involves restoring the ** BX** register, which we aren’t allowed to permanently trash in our calling convention.\n\n```\n        Intel 8080            Zilog Z80                     Intel 8086\n    ------------------    -----------------             ------------------\n.done:  pop psw               pop af\n                                                            pop bx\n        ret                   ret                           ret\n```\n\nNow we need to implement the ** .rdlen** helper function. This function takes an initial length in\n\n**(**\n\n`A`\n\n**), and returns a final length in**\n\n`AL`\n\n**(**\n\n`BC`\n\n**). If the initial length is under 15, that is also the final length; otherwise it keeps reading bytes and adding them to the total length until we read a byte whose value is not 255. The first thing to do is to copy our 8-bit (really, 4-bit) value into the 16-bit result and quit immediately if the value isn’t 15:**\n\n`CX`\n\n```\n        Intel 8080            Zilog Z80                     Intel 8086\n    ------------------    -----------------             ------------------\n.rdlen: mvi b,0               ld b,0                        xor ah,ah\n        mov c,a               ld c,a                        mov cx,ax\n        cpi 15                cp 15                         cmp cl,15\n        rnz                   ret nz                        jne .rdend\n```\n\nThe 8-bits handle this by zeroing the destination high byte and copying the low byte; the 8086 handles it by zeroing the high byte of ** AX** and copying the whole word over. The 8086 also lacks\n\n**, and instead jumps to the end of the function if it needs to quit immediately.**\n\n`RET NZ`\n\nReading an 8-bit value and adding it to a 16-bit counter is old hat for us at this point, but the 8086 gets to be a little sneaky here; since it zeroed out ** AH** at the top of the function, it can load bytes exclusively into\n\n**and then just do 16-bit adds. The 8-bit chips need to use the accumulator to carry out the add, but also need to preserve the byte read for the final check, so there’s some stack work here too.**\n\n`AL`\n\n```\n        Intel 8080            Zilog Z80                     Intel 8086\n    ------------------    -----------------             ------------------\n.rdlp:  mov a,m               ld a,(hl)                     lodsb\n        inx h                 inc hl\n        push psw              push af                       add cx,ax\n        add c                 add c\n        jnc .rdok             jr nc,.rdok\n        inr b                 inc b\n.rdok:  mov c,a               ld c,a\n        pop psw               pop af\n```\n\nFinally we loop back if the value we had read was 255. The easiest way to check this on all platforms is to do an 8-bit increment on it and see if that makes it zero.\n\n```\n        Intel 8080            Zilog Z80                     Intel 8086\n    ------------------    -----------------             ------------------\n        inr a                 inc a                         inc al\n        jz .rdlp              jr z,.rdlp                    jz .rdlp\n.rdend: ret                   ret                           ret\n```\n\nThat gets us all three implementations for the ’80 cousins. Let’s shift gears to the more distinct 6502 now.\n\n## Implementation on the 6502\n\nIt turns out that, at least at first, the 6502 code doesn’t diverge that much from the Z80. While the golden rule of 6502 data processing is that we want that data to be in arrays and processing to be in corresponding elements, the only part of LZ4 compression that lets us *do* that is the bulk copy that the Z80 handles with ** LDIR**. Pretty much everywhere else, we will be locking down the\n\n**register and using the “indirect indexed” mode as an expensive pointer mechanism. With**\n\n`Y`\n\n**and**\n\n`src`\n\n**as 4 bytes of data in the zero page that also serve as our arguments and return values, we can map those to**\n\n`dest`\n\n**and**\n\n`HL`\n\n**and our handling of the literals in a sequence tracks very closely:**\n\n`DE`\n\n```\n        MOS 6502              Zilog Z80\n    ----------------      -----------------\nlz4dec: ldy #$00              ld a,(hl)\n        lda (src),y\n        inc src               inc hl\n        bne +\n        inc src+1\n+       pha                   push af\n        lsr                   rrca\n        lsr                   rrca\n        lsr                   rrca\n        lsr                   rrca\n                              and 15\n        beq .bkref            jr z,.bkref\n        jsr .rdlen            call .rdlen\n        jsr .ldir             ldir\n```\n\n(Syntax note: 6502 code requires a lot more of these short-term jumps when it’s doing its 16-bit math, so I’m using ** +** as a temporary label. Branching to\n\n**just means “skip to the next**\n\n`+`\n\n**label” in each case.)**\n\n`+`\n\nThe only real divergence at this point, beyond ** INC HL** and\n\n**requiring more work, is that the 6502 does have proper logical-shift instructions and doesn’t need a masking step. Our replacement for**\n\n`LDIR`\n\n**turns out to have quite a lot going on in it, so we’ll hold off on the details there. Suffice to say that we have created another 16-bit variable—not necessarily in the zero page this time—named**\n\n`LDIR`\n\n**and it is the length value produced by**\n\n`.count`\n\n**and consumed by**\n\n`.rdlen`\n\n**, much like those uses of**\n\n`.ldir`\n\n**in the Z80 code.**\n\n`BC`\n\nThe backreference processing on the 6502 diverges *completely* from the Z80 code, to the point that there’s no sense in putting the Z80 code alongside it for reference. It begins by reading the 16-bit offset from the compressed data stream to produce the backreference pointer, much like on the Z80, but instead of shuffling data around on the top of the stack, or even storing out the offset to more scratch RAM, we instead *do the subtraction as we load* and store *that* result to our scratch RAM under the name ** .bksrc**:\n\n```\n        MOS 6502\n    ----------------\n.bkref: sec\n        lda dest\n        sbc (src),y\n        sta .bksrc\n        iny\n        lda dest+1\n        sbc (src),y\n        sta .bksrc+1\n```\n\nAs part of this read we’ve bumped up the ** Y** register to let us quickly index the next two bytes. Our next step will be to renormalize the\n\n**pointer and the**\n\n`src`\n\n**register so that when we visit the**\n\n`Y`\n\n**function it will be ready to go:**\n\n`.rdlen`\n\n```\n        MOS 6502\n    ----------------\n        dey\n        clc\n        lda src\n        adc #$02\n        sta src\n        bcc +\n        inc src+1\n+\n```\n\nUnlike the Z80 and 8080, where repeated 16-bit ** INC** instructions are the fastest way to add small numbers, the calculus flips in favor of direct 16-bit math on the 6502 basically immediately. Note that we check the carry flag this time instead of the zero flag, since we could have skipped over the page boundary without hitting it exactly, and also note that it remains both shorter and faster to do this branch-or-increment dance than it does to add zero to the high byte with carry.\n\nWith this work done, we now need to see if the offset was zero, which is a bit more problematic here because we didn’t actually *store* it. We used it immediately to create ** .bksrc**. What we\n\n*can*do, however, is see whether the result of the subtraction changed anything. If\n\n**, then the offset wasn’t zero and we can proceed with processing:**\n\n`.bksrc != dest`\n\n```\n        MOS 6502\n    ----------------\n        lda dest\n        cmp .bksrc\n        bne .bkok\n        lda dest+1\n        cmp .bksrc+1\n        bne .bkok\n```\n\nOtherwise, it *was* zero and we need to pop our useless length byte and return, just like on the Z80.\n\n```\n        MOS 6502\n    ----------------\n        pla\n        rts\n```\n\nAssuming everything was fine, though, it’s time to pop our *still useful* length byte, isolate the backreference part, and complete the length read, including adding the extra four to the counter. Like the 8086, no juggling is necessary here because we never displaced the ** src** pointer.\n\n```\n        MOS 6502\n    ----------------\n.bkok:  pla\n        and #$0f\n        jsr .rdlen\n        clc\n        lda .count\n        adc #$04\n        sta .count\n        bcc +\n        inc .count+1\n+\n```\n\n*Now* it is time to juggle some values. We will be hardcoding the pointer locations for our ** .ldir** helper function as\n\n**and**\n\n`src`\n\n**, so we need to stash the original**\n\n`dest`\n\n**and replace it with with the value in**\n\n`src`\n\n**before the call.**\n\n`.bksrc`\n\n```\n        MOS 6502\n    ----------------\n        lda src\n        ldx .bksrc\n        pha\n        stx src\n        lda src+1\n        ldx .bksrc+1\n        pha\n        stx src+1\n        jsr .ldir\n```\n\nStackwork can be expensive, but using the stack here produces shorter code than just using ** .bksrc** itself as swap space, and if\n\n**is not on the zero page, the code is faster, too.**\n\n`.bksrc`\n\nWe are now at the very end of the loop. This is where we finally re-sync with the Z80 code, restoring the original ** src** pointer and jumping back to the start of the loop:\n\n```\n        MOS 6502              Zilog Z80\n    ----------------      -----------------\n        pla                   pop hl\n        sta     src+1\n        pla\n        sta     src\n        jmp     lz4dec        jr lz4dec\n```\n\n### Implementing the Length Reader\n\nWith the main function complete, it’s time to move on to the helper functions. Both of these end up relying on the fact that I’ve kept the ** Y** register mostly locked to zero, and\n\n*strictly*locked to zero at all loop points and function boundaries.\n\nWe’ll look at ** .rdlen** first. It ends up synchronizing much more closely with the Z80 version. There isn’t much to say about it that we didn’t say about the eighters’ implementations, so here I’ll just quote the two in parallel:\n\n```\n        MOS 6502              Zilog Z80\n    ----------------      -----------------\n.rdlen: sta     .count        ld c,a\n        sty     .count+1      ld b,0\n        cmp     #$0f          cp 15\n        bne     .rdone        ret nz\n.rdlp:  lda     (_src),y      ld a,(hl)\n        inc     src           inc hl\n        bne     +\n        inc     src+1\n+       tax                   push af\n        clc                   add c\n        adc     .count\n        sta     .count\n        bcc     .rdok         jr nc,.rdok\n        inc     .count+1      inc b\n.rdok:                        ld c,a\n                              pop af\n        inx                   inc a\n        beq     .rdlp         jr z,.rdlp\n.rdone: rts                   ret\n```\n\nLike the 8086, the 6502 has no conditional return statement so we use a simple branch-to-a-normal-return-statement instead. Also like the 8086, I stash the accumulator in a register (here, ** X**) instead of relying on the stack like the Z80 does.\n\nMore interestingly, I actually played around a bit with consolidating the updates to ** src** at the end of the function, relying instead on\n\n**instructions inside of it. Based on a few other drafts and some napkin math, this turned out to only become worth it once your sequences extended to a kilobyte or so. Given the data I work with, I stuck with this. I think it’s clearer, anyway.**\n\n`INY`\n\n### Supporting LDIR on the 6502\n\nNow we turn to a *new* helper function that we’ve only needed on the 6502, and we finally get to write some code that plays well with indirect-indexed addressing. We need to write a function that does the same work that the ** LDIR** or\n\n**instructions did, taking a 16-bit counter (in the**\n\n`REP MOVSB`\n\n**variable) and copying that many bytes from the buffer in**\n\n`.count`\n\n**to**\n\n`src`\n\n**. At the end of the function,**\n\n`dest`\n\n**and**\n\n`src`\n\n**should each point one byte past the last byte copied in their respective buffers.**\n\n`dest`\n\nThe 6502, unlike even the 8080, does not have a 16-bit decrement operator, so we’re going to need to organize this as two nested loops, one for looping the low byte of the counter, and the other for looping the high byte. We also would really prefer to let our decrement options also be our loop-end test, which on the 6502 means it has to be a test against zero. We want the loop to end with something like ** DEC COUNTER; BNE LOOP; DEC COUNTER+1; BNE LOOP**, but that carries a subtle trap. If the initial value of\n\n**here is**\n\n`COUNTER`\n\n*an even multiple of 256*—that is, if the byte in\n\n**is zero to start with—this actually works. If it isn’t, you will end up iterating the wrong number. A starting value of**\n\n`COUNTER`\n\n**loops 256 times, as we’d expect. A starting value of**\n\n`$0100`\n\n**, however, only loops once, and a starting value of**\n\n`$0101`\n\n**will loop 65,750 times!**\n\n`$00d6`\n\nThis bug isn’t unique to the 6502; the identical bug, for the identical reason, appears in the ColecoVision BIOS in some functions, because someone decided they wanted to rely on the automatic flag updates of the 8-bit decrements instead of doing a proper 16-bit check like we did in the 8080 code. The solution is simple enough, though; we need to check that low byte and increment the high byte if, and *only* if, the low byte isn’t zero.\n\n```\n        MOS 6502\n    ----------------\n.ldir:  ldx     .count\n        beq     .llp\n        inc     .count+1\n```\n\nI loaded into ** X** instead of\n\n**here because I intend to keep the low byte of the counter in a register for faster access. We’ll need**\n\n`A`\n\n**to index the**\n\n`Y`\n\n**and**\n\n`src`\n\n**arrays, and since the copy is in sync, we use it directly for both.**\n\n`dest`\n\nThe core loop is pretty simple, with an odd little division of responsibility for both; the ** Y** register counts up, incrementing the high byte of both\n\n**and**\n\n`src`\n\n**any time it wraps around**\n\n`dest`\n\n*as an offset.*The low bytes of\n\n**and**\n\n`src`\n\n**can be completely unrelated and this all still works fine. The**\n\n`dest`\n\n**register, on the other hand, is counting**\n\n`X`\n\n*down*the total bytes copied, and it decrements the high byte of the counter once it hits zero itself. That loop wastes no space:\n\n```\n        MOS 6502\n    ----------------\n.llp:   lda     (src),y\n        sta     (dest),y\n        iny\n        bne     +\n        inc     src+1\n        inc     dest+1\n+       dex\n        bne     .llp\n        dec     .count+1\n        bne     .llp\n```\n\nWe aren’t done when this loop is, though. We need to add the value in ** Y** to both\n\n**and**\n\n`src`\n\n**to get their final values right. We’ve been incrementing the high bytes along the way—we had to, in order to reach later parts of the buffer with just an 8-bit offset—and now we must add what’s left over. We also reset**\n\n`dest`\n\n**to zero on the way out so that the main function can use**\n\n`Y`\n\n**and**\n\n`src`\n\n**normally with no extra work.**\n\n`dest`\n\n```\n        MOS 6502\n    ----------------\n        clc\n        tya\n        adc     src\n        sta     src\n        bcc     +\n        inc     src+1\n+       clc\n        tya\n        adc     dest\n        sta     dest\n        bcc     +\n        inc     dest+1\n+       ldy     #$00\n        rts\n```\n\nRelying on this function for both copies *technically* does some wasted work here; the backreference pointer doesn’t need that final update since we trash it immediately afterwards. Past a certain point, though, overall brevity and clarity have to win.\n\n## Catching Our Breath\n\nThis article ended up a lot longer than I expected, and in the service of an honestly kind of questionable goal. If you made it this far, I salute you. As for what *I’ve* ended up getting out of this, I’ve been [collecting these drafts in their own directory in my repository](https://github.com/michaelcmartin/bumbershoot/tree/master/asm/lz4core) so the practical outcome of this is that I now can LZ4-compress my data for Bumbershoot projects whenever I want.\n\nFrom a craftsmanship standpoint, switching so rapidly between CPUs was an interesting experience for me. I’ve gotten quite a lot more experience with the Z80 lately, and working with both the 8080 and 8086 underlined a lot of the habits I’d picked up along the way. Usually I’ve stumbled a bit in my Z80 work, because the less appropriate habits of 6502 or 68000 programming would wrong-foot me from time to time, but working on this project is the first time in a long time where I’ve felt that stumbling in the *other direction.* I’ve had a much stronger Z80 focus over the past year or so and it’s really showing. On the plus side, I think that means at this point that my proficiency in both is up to par, and I simply need to remind myself to shift gears properly as needed from project to project.\n\nNext week we’ll do something more freewheeling and fun, I promise.", "url": "https://wpnews.pro/news/comparing-an-lz4-decompressor-on-four-legacy-cpus", "canonical_source": "https://bumbershootsoft.wordpress.com/2026/05/09/comparing-an-lz4-decompressor-on-four-legacy-cpus/", "published_at": "2026-05-20 11:58:24+00:00", "updated_at": "2026-05-22 23:05:01.414093+00:00", "lang": "en", "topics": ["hardware", "open-source"], "entities": ["LZ4", "SNES", "Motorola", "Tandy Color Computer", "Sega Genesis", "Z80", "Intel 8080", "Intel 8086"], "alternates": {"html": "https://wpnews.pro/news/comparing-an-lz4-decompressor-on-four-legacy-cpus", "markdown": "https://wpnews.pro/news/comparing-an-lz4-decompressor-on-four-legacy-cpus.md", "text": "https://wpnews.pro/news/comparing-an-lz4-decompressor-on-four-legacy-cpus.txt", "jsonld": "https://wpnews.pro/news/comparing-an-lz4-decompressor-on-four-legacy-cpus.jsonld"}}