{"slug": "design-notes-for-snestang-0-3", "title": "Design notes for SNESTang 0.3", "summary": "The SNESTang 0.3 project ports the Super Nintendo Entertainment System (SNES) to Tang FPGAs, using a phase-based clock design with a main work clock (wclk) at 10.8 MHz to simplify logic and reduce clock domains. The architecture separates the CPU, PPU, and audio processor (SMP/APU), each with dedicated memory, unlike the NES where the PPU shares memory with the CPU. The design is primarily adapted from the FpgaSnes and SNES_FPGA codebases, with SDRAM clocked at 64.8 MHz (6x wclk) to ensure efficient memory access within a single work clock cycle.", "body_md": "Design notes for SNESTang 0.3\nThis documents the design of SNESTang 0.3 for my own reference and others who want to read the code. It also aims to be helpful to people porting other cores to Tang FPGAs.\nThe SNES architecture\n+-------+ +------+ +------+\n| WRAM | | SPPU |..| VRAM |\n| 128KB | +========+ | | |2x32KB|\n+-------+...| SCPU |...+------+ +------+\n| w/ DMA |\n+-----------+...+========+...+------+ +------+ +------+\n| Cartridge | | | SSMP |..| DSP |..| ARAM |\n| Rom 16MB | | |SPC700| | | | 64KB |\n| Sram/chip | +--------+ +------+ +------+ +------+\n+-----------+ | ^ Joy |\n|< > O O |\n| v 2x |\n+--------+\nSeveral things are worth pointing out,\n- The 16-bit CPU runs the main game program. The cartridge ROM, cartridge SRAM and WRAM (work RAM) live in the same 24-bit address space (max 16MB). The CPU reads or writes at most one word of this memory space per cycle for normal operations, or up to 2 words (one read and one write) when doing DMA.\n- The PPU graphics processor operates stand-alone, with no access to the CPU address space. The CPU can access VRAM through memory-mapped registers or DMA, but not the other way around.\n- The SMP/APU audio processor also operates independently from the CPU, with its own memory (64KB of ARAM). The audio DSP does not run code. It is driven by voice tables in ARAM written by the APU.\nThis architecture is both inherited and different from NES. For example in NES, the PPU accesses the same memory as the CPU. So it has direct access to the cartridge and RAM while here it does not. The benefit on the other hand is more memory bandwidth for the PPU. The APU is not Nintendo developed, but sourced from SONY. So it not only has its own memory, but also operates at a different clock speed. For a detailed description of the SNES architecture, refer to SNES Architecture by Rodrigo Copetti.\nOriginal code bases\nWe are mainly porting from two code bases,\n- The original, archived FpgaSnes by srg320 (old but simple)\n- The active SNES_FPGA by gyurco (new but more complex).\nSome code is reused from the previous NESTang project for NES on Tang Primer 25K.\nClocks\nThe original SNES has the following clocks,\n- \"Master clock\" at 21.477Mhz for CPU and PPU.\n- CPU instruction takes 6, 8 or 12 cycles.\n- PPU output 1 dot every 4 cycles.\n- SMP sound system works off a separate 24.576Mhz clock.\n- DSP runs on 1/6 of the clock cycles\n- SPC700 runs at 1/24 of the cycles (i.e. 2.048Mhz).\nIn SNESTang we map the clocks to FPGA in the following way. It may look a bit convoluted. But the goal is to have as few separate clock domains as possible, because crossing them adds complexity and latency. Latency is important as SNES is faster than NES and we do not have a lot of timing leeway to make the design work.\nhclk\n: HDMI 720p pixel clock at 74.25Mhzwclk\n: Main \"work clock\" at 10.8Mhz for everything other than HDMI or SDRAM, including CPU, PPU, SMP, SD card and etc. This is half of SNES master clock speed so CPU instruction takes 3, 4 or 6 wclk cycles.- CPU/PPU runs slightly faster (10.8 > 21.477/2). See below (\"SNES video to HDMI\") for how we stay in sync.\n- Sound also runs faster than original SNES. See below (\"SNES audio to HDMI\") for how audio is synchronized.\nfclk\n: SDRAM clock at 64.8Mhz. This is exactly 6x of wclk and generated from the same PLL. So they are related clocks and domain-crossing is easier.- It takes 5 cycles to access SDRAM. So 6x wclk makes SDRAM able to finish access in one wclk cycle, making the CPU/PPU logic simpler.\nThe original design in FpgaSnes uses signals like INT_CLK, DOT_CLK as clocks (ripple clocks), which are not good style and confuses Verilator. So we switch to a phase-based design similar to NESTang. This is similar to but slightly different from SNES_FPGA by gyurco.\nTimings\nTypical timings of the components:\nwclk /‾1‾‾\\____/‾2‾‾\\____/‾3‾‾\\____/‾4‾‾\\____/‾5‾‾\\____/‾6‾‾\\____/‾7‾‾\\____/‾8‾‾\\____/‾9‾‾\\____/‾10‾\\____\nphase | 0 | 1 | 2 | 0 | 1 | 2 | 3 | 0 | 1 | 2 |\ncpu /‾‾‾‾‾‾‾‾‾\\___________________/‾‾‾‾‾‾‾‾‾\\_____________________________/‾‾‾‾‾‾‾‾‾\\____________________\n|---------- 6-cycle ----------|---------------- 8-cycle --------------| .... 12-cycle ...\nmem_rd ____________________/‾‾‾‾‾‾‾‾‾\\___________________/‾‾‾‾‾‾‾‾‾\\_____________________________/‾‾‾‾‾‾‾‾‾\\\nmem_wr /‾‾‾‾‾‾‾‾‾\\___________________/‾‾‾‾‾‾‾‾‾\\_____________________________/‾‾‾‾‾‾‾‾‾\\____________________\nppu \\_________/‾‾dot 1‾‾\\_________/‾‾dot 2‾‾\\____ ...\nvram_rd /‾‾‾‾‾‾‾‾‾\\_________/‾‾‾‾‾‾‾‾‾\\_________/‾‾‾‾ ...\ndsp \\_________/substep0‾\\_________/substep1‾\\____ ...\naram_rd /‾‾‾‾‾‾‾‾‾\\_________/‾‾‾‾‾‾‾‾‾\\_________/‾‾‾‾ ...\nSNES CPU uses variable clock speeds. Basically a CPU cycle takes 6, 8 or 12 master clock cycles depending on the operation. Cycles with no memory or I/O operation takes 6 master clock cycles (i.e. 3 wclk \"phases\" as our wclk is twice the speed of SNES master clock). Cycles accessing memory takes 8. Cycles accessing I/O takes 12.\nTherefore, actual timing for a CPU cycle would look like this:\n* (cpu\nline in the diagram) Phase 0 (marked by SYSCLKF_CE\nin code, \"falling sys clk\"): P65 computation is executed.\n* (mem_rd\nline) Middle phase (marked by SYSCLKR_CE\n, \"rising sys clk\"): memory read operations. The read result may be needed by the next CPU cycle.\n* (mem_wr\nline) Memory writes are done in the next cpu cycle's phase 0. This overlaps safely with next CPU cycle because writes does not affect next cycle.\n* For DMA, there could be at most one read and one write. So the result of the read is used by the write in the next CPU cycle.\nPPU and audio DSP work similarly, albeit with memory operations in phase 0 and computation in phase 1.\nSDRAM controller\nThis is more complicated than NES because we have more memory components to support. There are the cartridge ROM, catridge BSRAM (battery-back SRAM to hold game saves), WRAM (128KB main work RAM), VRAM (64KB video RAM) and ARAM (64KB audio RAM).\nIn FPGA, the fastest and most convenient memory is block RAM or BRAM. BRAM usage takes a few lines of code and data accesses take exactly one cycle. For instance, the MiSTer SNES core places everything except the ROM in BRAM, which makes things much simpler. In contrast, the Tang Primer 25K has 56x18kb blocks, or 126K bytes of BRAM in total. So it is in short supply if we look at what SNES needs. SNESTang places only the VRAM in BRAM. All other memories reside in the SDRAM. The rest half of BRAM are used for processor micro codes, video buffer for HDMI and etc.\nWe use a dual-bank interleaved SDRAM controller (with CAS latency CL=2): sdram_snes.v\n. This allows parallel accesses from both the CPU and SMP, therefore avoiding the need for complex time-multiplexing (which I did spent some time on and abandoned).\nThe nice thing about this design is that the SDRAM does not need to run at super-high speed. 64.8Mhz is half the speed of MIST-SNES's 128Mhz (which uses a CL3 3-way interleaving controller). In my experiments, I have not found a way to make Tang Primer 25K SDRAM work reliably for 128Mhz yet.\nHere is the timings for the SDRAM controller. Remember 6 fclk cycles are 1 wclk cycle.\nfclk_# CPU ARAM\n0\n1 RAS1 DATA2\n2 CAS1\n3 RAS2/Refresh\n4 DATA1\n5 CAS2\nFor each memory access, RAS is row activation, followed by CAS (column activation and write data), and then DATA for read data available. So you can see that the CPU channel is designed to operate within one wclk cycle. The ARAM channel crosses two wclk cycles. Note that RAS and CAS are shown when they are registered on the memory side. Because the memory operate in CL2 (CAS latency 2) mode, exactly two cycles after CAS1, we have read data ready (DATA1).\nMemory Layout\nSNES memory layout roughly looks likes this,\n00-1F 20-3F 40-5F 60-7D 7E-7F 80-9F A0-BF C0-DF E0-FF\n0000 -----------------------------------------------------\n| RAM | | | RAM | |\n2000 |----------| | |-----------| |\n| I/O | | | I/O | |\n| | ROM | RAM | | ROM |\n8000 |----------| | |-----------| |\n| | | | | |\n| ROM | | | ROM | |\n| | | | | |\nffff -----------------------------------------------------\nDetails of the memory map are determined by the map_ctrl\n, rom_size\nand ram_size\nin the SNES header. See code for details.\nDMA\ncpu.v\ncontains mainly DMA-related code. The actually 65C816 processor is in P65C816.v\n. The DMA controller is a Nintendo design and technically outside of the CPU. When DMA is active, the 65C816 is simply paused. Every DMA cycle is 4 wclks (8 master cycles). It always transfers one byte from Bus A (CA) to Bus B (PA), or vice versa. Actual operations are controlled by the DMA registers like DMAEN, BBAD and etc. With these registers, we can do ROM-VRAM DMA, ROM-WRAM DMA, WRAM-CGRAM DMA and etc. It is very flexible, as long as it is between Bus A and B.\nDMA is implemented with the following timings. This example shows WRAM to WRAM DMA, the most complicated situation.\nwclk / 0 \\___/ 1 \\___/ 2 \\___/ 3 \\___/\nSYSCLKF_CE / \\_______________________/\nSYSCLKR_CE ________________/ \\________\nDMA |NxtAddr|\nSDRAM |MEM<=MDR| |DI<=MEM|\nMDR |MDR<=DI|\nSNES video to HDMI\nVideo upscaling to HDMI is done in snes2hdmi.v\n. It upscale the SNES video by 3. So 256x224 becomes 1024x672, leaving empty columns on the sides and narrow bars on the top and bottom. The final result is OK.\nHowever this works quite differently from nes2hdmi.v\nfor NES, mainly because we had to give up the frame buffer. The frame buffer for NES is introduced because the console and HDMI works at different pixel scanning speeds. So things become easier when there is a full frame buffer that hold the whole frame image.\nFor SNESTang, however, a frame buffer would take too much space. The RGB5 pixel format takes 15 bits (round up to 2 bytes) to store one pixel. Therefore 256x224 would need 112KB, almost all the BRAM we have. One way would be to store the frame buffer in SDRAM. But given the high pixel clock of HDMI and our already busy SDRAM, implementing a SDRAM-backed frame buffer would be challenging here.\nThe approach I chose was to let HDMI generate pixels at 71.25Mhz from a multi-line pixel buffer, which was fed by SNES in a mostly-synchronized fashion. It is somewhat similar to the VGA line-doubler used in other FPGA cores that expands 240p to VGA. The scanning speed of 720p is different from the 256x224 video feed of SNES. So they tend to go out of sync over time. But if HDMI and SNES can start each frame at the same time, they would not drift from each other too far away. After some calculation and experimentation, a 16 line buffer turns out to be enough, using 4 BRAM blocks. In order to sync the SNES to HDMI frames, the pause_snes_for_frame_sync\nsignal was introduced. We pause the SNES during the first \"DRAM refresh period\" (middle of scanline, marked by the REFRESH signal), where there is no RAM access, and wait for HDMI to catch up.\nSMP audio to HDMI\nAudio generated by the SMP is streamed through the AUDIO_L[15:0]\n, AUDIO_R[15:0]\nsignals of main.v\n. The sound should be 32K samples per second. As we discussed above, in order to simplify clocking and allow the SMP to use SDRAM directly without clock domain crossing, SMP is also run with wclk. The original dsp.v\nruns at 4.096Mhz and now becomes 10.8/2=5.4Mhz, 32% faster. The 32K sample rate is thus maintained by introducing an AUDIO_EN\ninput signal to main.v\n. When the HDMI audio input FIFO is full, AUDIO_EN\nbecomes 0, temporarily stopping sound generation. When there is empty space in the FIFO, it becomes 1 again, resuming sound.\nAcknowledgements\n- SNES Architecture by Rodrigo Copetti\n- FpgaSnes by srg320\n- SNES_FPGA by gyurco", "url": "https://wpnews.pro/news/design-notes-for-snestang-0-3", "canonical_source": "https://nand2mario.github.io/posts/2024/snes_design_0.3/", "published_at": "2024-01-07 00:00:00+00:00", "updated_at": "2026-05-23 15:57:24.507937+00:00", "lang": "en", "topics": ["hardware", "semiconductor", "open-source", "developer-tools"], "entities": ["SNESTang", "Tang FPGAs", "SNES", "SPPU", "SCPU", "SSMP", "SPC700", "DSP"], "alternates": {"html": "https://wpnews.pro/news/design-notes-for-snestang-0-3", "markdown": "https://wpnews.pro/news/design-notes-for-snestang-0-3.md", "text": "https://wpnews.pro/news/design-notes-for-snestang-0-3.txt", "jsonld": "https://wpnews.pro/news/design-notes-for-snestang-0-3.jsonld"}}