Z80 Assembly

Z80 Assembly is the language of the Zilog Z80 processor — the 8-bit CPU that powered the Sinclair ZX Spectrum, the Amstrad CPC, the MSX, the TI-83 calculator, and approximately half of the 1980s. It is the language where every clock cycle belongs to the programmer, every register has a name, every instruction has a cost measured in T-states, and the distance between the code and the hardware is exactly zero.

riclib did not move from Spectrum BASIC to Z80 assembly. riclib wrote Z80 assembly inside Spectrum BASIC — POKEing opcodes into memory one byte at a time, looking up the decimal values in the back of the manual. 201 was RET. 62 was LD A,n. He programmed in assembly for three years before he ever saw an assembler. When the assembler arrived, the mnemonics were a translation layer for numbers he already knew. Z80 assembly let you orchestrate the entire CPU — load registers, manipulate the stack, control interrupts, address every byte of the 64K address space, and do it at the full speed of the 3.5MHz clock, which in 1980s terms was everything and in modern terms is the speed at which a JavaScript runtime decides which polyfill to load.

“The Z80 had 158 documented instructions. The interesting ones were undocumented.”
— The Lizard, who reads the silicon, not the manual

The Undocumented Instructions

The Z80’s official instruction set — the one Zilog published, the one the manuals described — had 158 instructions. LD, ADD, SUB, AND, OR, XOR, CALL, RET, PUSH, POP, DJNZ, and the rest. Complete, sufficient, documented.

But the Z80 had more.

The index registers — IX and IY — were 16-bit registers that the documented instruction set treated as indivisible units. You could load IX. You could use IX as a base for indexed addressing. You could not, according to the documentation, access the high and low bytes of IX individually.

The Z80 disagreed. The silicon had no such restriction. The undocumented instructions — IXH, IXL, IYH, IYL — accessed the individual bytes of the index registers, and they worked perfectly, and Zilog never published them, and every serious Z80 programmer used them, because the four extra 8-bit registers they provided were the difference between fitting your routine in the available cycles and not.

riclib used them all. Undocumented instructions and all. The Z80 did not care whether Zilog had published the instruction. The Z80 executed it. The silicon was the documentation. The manual was a suggestion.

This was the second formative lesson, after POKE: the documentation describes what the manufacturer intended. The hardware describes what is possible. The gap between the two is where the interesting programming happens.

Clearing the Screen: A Case Study in Thinking Like the Machine

The difference between knowing Z80 assembly and understanding the Z80 is best illustrated by how you clear the screen.

The Spectrum’s display file starts at address 16384 and is 6144 bytes long (the attribute area adds another 768 bytes from 22528, but let’s clear the pixels first). The naive approach — the one every beginner writes, the one the manual suggests — uses LDIR:

    LD HL, 16384      ; source: first byte of screen
    LD DE, 16385      ; destination: second byte of screen
    LD BC, 6143       ; count: rest of screen
    LD (HL), 0        ; zero the first byte
    LDIR              ; copy forward: each byte copies the zero from the previous

LDIR copies BC bytes from (HL) to (DE), incrementing both, decrementing BC. Each iteration: 21 T-states. For 6143 bytes: 129,003 T-states. At 3.5MHz, that is approximately 37 milliseconds. Not slow. But not fast. The screen clears visibly, top to bottom, a wipe you can see if you are paying attention.

The trick — the one that separates the beginner from the programmer who has read the Z80 data sheet and noticed that PUSH is faster than LDIR — uses the stack pointer.

    DI                ; disable interrupts (we're about to hijack the stack)
    LD (saved_sp), SP ; save the real stack pointer
    LD SP, 16384+6144 ; point the stack at the END of the screen
    LD HL, 0          ; the value we're pushing (zero = clear)
    ; now push HL 3072 times (6144 bytes / 2 bytes per push)
    REPT 3072
    PUSH HL
    ENDM
    LD SP, (saved_sp) ; restore the stack pointer
    EI                ; re-enable interrupts

PUSH HL: 11 T-states. It writes two bytes (H and L) to the address the stack pointer points to, then decrements SP by 2. Each PUSH clears two bytes of screen. For 3072 pushes: 33,792 T-states. That is 9.7 milliseconds. The screen clears so fast it appears instantaneous — no wipe, no visible progression, just gone.

But 3072 pushes is a lot of instructions. The code is 3072 × 1 byte = 3KB just for the PUSH opcodes. If memory is tight — and on the Spectrum, memory was always tight — you use a loop:

    DI
    LD (saved_sp), SP
    LD SP, 16384+6144
    LD HL, 0
    LD B, 192         ; 192 iterations × 16 pushes = 3072 pushes
.loop:
    PUSH HL           ; unroll to 16 pushes per iteration
    PUSH HL
    PUSH HL
    PUSH HL
    PUSH HL
    PUSH HL
    PUSH HL
    PUSH HL
    PUSH HL
    PUSH HL
    PUSH HL
    PUSH HL
    PUSH HL
    PUSH HL
    PUSH HL
    PUSH HL
    DJNZ .loop        ; 13 T-states taken, 8 not taken
    LD SP, (saved_sp)
    EI

Sixteen pushes per iteration, 192 iterations. The 16-push unroll is the fast version — faster than the fully unrolled 3072-push version in practice, because 16 pushes per iteration means 192 iterations, and 192 fits in an 8-bit register (B), which means DJNZ works. DJNZ — Decrement B, Jump if Not Zero — is the Z80’s tightest loop: one register, one instruction, 13 T-states taken. Nobody unrolled to fewer than 12 pushes per iteration, because below 12 the iteration count exceeds 255 and you need a 16-bit loop counter, which costs more T-states per iteration than the pushes you saved by unrolling less. Twelve pushes: 256 iterations, B=0 (which DJNZ treats as 256). The sweet spot was 16 pushes: 192 iterations, clean arithmetic, 8-bit counter, fast.

And of course this matters because you are doing it 25 times per second. This is not a one-time screen clear. This is the inner loop of every game, every demo, every animation on the Spectrum. Clear the screen. Draw the frame. Clear the screen. Draw the frame. Twenty-five times per second (PAL). At 25 fps, the difference between the 37ms LDIR clear and the ~10ms PUSH clear is 675 milliseconds per second — more than half a second of CPU time recovered, per second, available for the actual drawing. The bytes spent on 16 unrolled pushes pay for themselves every frame.

The LDIR version is correct. The PUSH version is 3.8 times faster. The difference is not knowing a different instruction — PUSH is in every Z80 reference. The difference is realising that the stack pointer is just a register that points at memory, and you can point it anywhere, and if you point it at the screen and push zeros, the screen clears at the speed of stack operations, which is faster than block copy because PUSH is 11 T-states and LDIR is 21 T-states per two bytes.

This is the kind of trick that the manual does not teach. The manual says PUSH is for the stack. The silicon says PUSH writes two bytes to wherever SP points and decrements SP. The gap between those two descriptions is where the screen clears in 10 milliseconds instead of 37, and where 675 milliseconds per second become available for everything else.

You do need to disable interrupts first — because the interrupt handler uses the stack, and if an interrupt fires while SP is pointing at the middle of the screen, the interrupt return address gets written to the display file and the screen gets a brief, exciting glitch before the machine crashes. DI before. EI after. This is the price of using the stack as a paintbrush.

The Cycle Count

Z80 programming was not just about getting the right answer. Z80 programming was about getting the right answer in the right number of clock cycles.

Every instruction had a cost in T-states — the fundamental clock periods of the Z80. LD A,B: 4 T-states. ADD A,(HL): 7 T-states. DJNZ: 13 T-states if the branch is taken, 8 if it falls through. The programmer who did not count T-states wrote code that was correct but slow. The programmer who counted T-states wrote code that was correct and fast. On a 3.5MHz processor, the difference between a 7-T-state instruction and a 4-T-state instruction was the difference between smooth scrolling and flickering garbage.

This was optimisation at the atomic level. Not “profile and find the bottleneck.” Not “cache the database query.” Not “reduce the bundle size.” The bottleneck was the instruction. The optimisation was choosing a different instruction. The cost was measured in nanoseconds, and every nanosecond was visible on the screen.

Modern programming has abstractions between the code and the consequence. Layers of runtime, operating system, virtual machine, garbage collector, JIT compiler — each one absorbing the programmer’s decisions and translating them into something the hardware eventually executes. Z80 assembly had no layers. The instruction was the consequence. You wrote LD A,(HL) and the accumulator loaded the byte at the address in HL and it took exactly 7 T-states and you could see the result on the screen in the next frame.

Measured Characteristics

CPU: Zilog Z80 (8-bit, 3.5MHz on the Spectrum)
Clock speed: 3.5MHz (the speed at which a modern JavaScript framework decides which polyfill to load)
Documented instructions: 158
Undocumented instructions: several dozen (the interesting ones)
Most useful undocumented: IXH, IXL, IYH, IYL (individual bytes of index registers)
Zilog’s position on undocumented instructions: they do not exist
The Z80’s position on undocumented instructions: they execute perfectly
riclib’s position: undocumented instructions and all
Register count: 18 (A, B, C, D, E, F, H, L, plus shadow set, plus IX, IY, SP, PC)
Usable register count with undocumented: 22 (the extra 4 made the difference)
Distance between code and hardware: 0 (no runtime, no OS, no abstraction)
Preceding language: Spectrum BASIC (POKE was the gateway)
The lesson: the documentation is what the manufacturer intended; the hardware is what is possible
Legacy: every language since has felt like it’s hiding something

Type	Technology
First Observed	1976 (Zilog, Federico Faggin); in the lifelog, the moment Spectrum BASIC was too slow and every byte needed to be controlled
Severity	Formative (the language that taught the difference between knowing a computer and understanding one)
Natural Predator	None. Z80 assembly has no natural predator. Z80 assembly is the predator.
Tags	languages
Cited in	C++ episode Curiosity episode Lisp episode MSX episode O Notation episode +10 more Pascal episode Pointers episode Recursion episode Saudade episode Spectrum BASIC episode Stack Overflow episode The Compiler episode The Knack of Computer Programming episode Turing Machine episode webMethods episode