skip to content

miniTPU: building Google's TPU from scratch

Constructing a systolic array from NAND gates up in a digital logic sim

Motivation. NVIDIA is still THE most valuable company, with crazy margins. Accelerated compute will be more important as it is already, especially for Europe, where we have almost none of it. Time to understand how they work. Time to learn how to build TPUs.

At its core a TPU is not a CPU. It’s a fixed-function accelerator built to do (mostly) one thing well: matrix multiplication. At its heart a systolic array. A grid of tiny multiply-accumulate (MAC) units. That grid is what we’re going to build.

How would you actually build one until tape-out? Here’s a rough outline, in increasing order of pain and capex:

  1. A cycle-accurate simulator — software that models the array clock tick by clock tick. We focus on this although not on a professional simulator.
  2. RTL — the real logic, written in Verilog.
  3. An FPGA — that RTL running on reconfigurable silicon.
  4. An ASIC — taping out an actual chip, e.g through something like TinyTapeout.

Systolic arrays

The canonical reference is Kung & Leiserson’s 1978 paper, Systolic Arrays (for VLSI) (Kung & Leiserson, 1978) Systolic Arrays (for VLSI) Kung, Leiserson · 1978 . The name comes from systole, the contraction of a heart: data is pumped rhythmically through a mesh of simple cells, each one beating once per clock.

Funny anectode in the paper is a small aside where the authors suggest making transparencies to visualize how the inputs flow through the array. Pure retro-vibes. I genuinely hope germans schools system has advanced past this, though I have my doubts.

A scanned passage from the 1978 paper: 'The reader is invited to study the data flow of this problem more closely by making transparencies of the band matrices shown in the figures, and moving them over the network picture as described above.'
Straight from the paper — the officially recommended way to understand the data flow, circa 1978.

Read the paper and you’ll notice their matmul uses hex-connected arrays, where each cell talks to six neighbors. It’s not necessary, and not what’s used in practice. The hex layout was for banded matrix multiplication, where three operands move at once and nothing was stationary. Modern dense matmuls either keep the partial sums stationary inside each cell (output-stationary), so only two operands have to stream in, or the weight (weight-stationary) and pass the partial sum forward.

Figure 4-2 from the paper: a hex-connected systolic array for matrix multiplication. The A and B band matrices stream in diagonally from the upper left and upper right, and the C results flow upward out of the array.
The data flow this all refers to (Figure 4-2). The A and B band matrices stream in diagonally from the top, three operands moving at once, and the C results flow upward — hence six neighbors per cell.
  • Weight-stationary — each cell holds a fixed weight and reuses it across a whole batch of activations. Great when you multiply many inputs by the same matrix.
  • Output-stationary — each cell holds its accumulator, and the partial sum never leaves until the end. This minimizes partial-sum traffic and is the easiest to reason about.

We’ll build output-stationary because its simpler.

Before building anything, it helps to hand-trace a small one. Take a 3×3 matmul. Each cell (i,j) is responsible for exactly one output C[i][j] = Σ_k A[i][k]·B[k][j]. Row i of A enters from the left, column j of B enters from the top, and every clock the operands hop one cell over. The catch: for the right a and b to meet in the right cell at the right time, you have to stagger (skew) the inputs — feed each row and column one cycle later than the previous one, like a staircase:

A row 0: a00 a01 a02
A row 1: a10 a11 a12
A row 2: a20 a21 a22

For the practical side read Google’s TPUv1 paper (Jouppi et al., 2017) In-Datacenter Performance Analysis of a Tensor Processing Unit Jouppi, Young, Patil, Patterson, et al. · ISCA 2017 arXiv:1704.04760 . It’s less about the array itself and more about the engineering reality around it like memory bandwidth, roofline analysis, real production workloads. Worth noting how dated the workload mix already feels: back then it was CNNs and LSTMs; today it would be almost entirely transformers.

Building it from scratch

My first idea was simulate the array in Python, one class per processing element (PE), and wire them into a grid. But i remembered the Digital Logic Sim that Sebastian Lague built for one of his videos. I’d never done anything serious with digital logic, so this felt like the right kind of exercise. Booting it up, you get the bare minimum: it hands you a single NAND gate (plus some other goodies like ROM which we will use only at the end) and expects you to build the rest of the universe from there.

I built the usual basic gates (NOT, AND, OR, XOR) first, but I’ll skip them here since they’re a quick lookup. With those in hand, the build order is:

  1. mux2:1 — a 2-to-1 selector, we will use them to clear our registers
  2. half-addfull-add4bit-add8bit-add — arithmetic, to accumulate
  3. DFF4bit-register8bit-register — memory
  4. 4bit-mult — the multiplier
  5. PE — one processing element
  6. 2x2 array — four PEs wired into a grid

I prototyped everything at unsigned int4 (4-bit operands, 8-bit accumulator) to keep the wiring tractable.

A 2:1 multiplexer

Start with the first circuit: a multiplexer. A 2:1 MUX is just a digital switch. One select line picks which of two inputs reaches the output. In code out = sel ? b : a. You can build it from one NOT, two ANDs, and one OR: invert the select line, AND it with a, AND the original select with b, and OR the two halves together.

A 2:1 multiplexer built from a NOT gate, two AND gates and an OR gate, with SEL, A and B inputs and an OUT output.
A 2:1 MUX: out = SEL ? B : A, built from 1 NOT + 2 ANDs + 1 OR.

I’ll need it for the registers as it’s the clear mechanism for them. Wire clear to the select line, the live value to one input and a hard-wired zero to the other, and a high clear forces the register to zero regardless of what’s coming in.

Adders

Addition starts with the half adder: two bits in, XOR gives the sum, AND gives the carry.

A half adder: two inputs feeding an XOR gate for the sum and an AND gate for the carry-out.
Half adder — XOR produces the sum S, AND produces the carry Cout.

A half adder doesn’t accept a carry in, so it can’t be chained. The full adder fixes that having three inputs (A, B, and a carry-in), built from two XORs, two ANDs, and an OR:

A full adder built from two XOR gates, two AND gates and an OR gate, taking A, B and Cin and producing S and Cout.
Full adder — A, B and a carry-in Cin produce the sum S and carry-out Cout.

This can be chained into a 4-bit adder with four full adders in a row, each one’s carry-out feeding the next one’s carry-in (professionals call it ripple-carry adder i guess):

A 4-bit ripple-carry adder: four full-adder blocks chained, with the carry rippling from the low bit to the high bit.
4-bit adder — four full adders chained, carry rippling from low to high bit.

And an 8-bit adder is just two 4-bit adders chained the same way. The low adder’s carry-out wired into the high adder’s carry-in.

An 8-bit adder made of two 4-bit-add blocks, with the low block's carry-out feeding the high block's carry-in.
8-bit adder — two 4-bit adders chained: low Cout → high Cin.

One small gotcha here. First I implemented a 4bit adder subtractor that used a control signal to also subtract. Then I was confused where the carry in would go and just decided to neglect subtraction for simplicity. Who needs subtraction anyway…

Registers

So far the gates where just combinational and computed stuff. But a systolic array needs to hold on to its computation with values that move one cell per clock cycle. For that we need memory.

Concretely, the D flip-flop (DFF): a one-bit cell that captures its input D on the clock edge and then holds it, ignoring D until the next edge.

Why “clock edge”? The clock is just a square wave oscillating between 0 and 1. The edge is the instant it transitions from 0 to 1. Flip-flops latch precisely at that instant, which means the entire chip updates simultaneously, once per cycle.

Put N flip-flops side by side and you have an N-bit register. In our 4-bit register we put the MUX’s from earlier in front of each DFF, so a clear pulse zeroes all four bits on the next edge:

A 4-bit register: four MUX2-1 selectors feeding four D flip-flops, with CLEAR and CLOCK inputs and a 4-bit output.
4-bit register — a MUX2-1 (for clear) and a DFF per bit, all sharing CLOCK and CLEAR.

Two of those stacked give an 8-bit register, which is what will hold our accumulator:

An 8-bit register built from two 4-bit-register blocks sharing CLEAR and CLOCK.
8-bit register — two 4-bit registers stacked; this one holds the accumulator.

Multiplier

The last computational piece is the multiplier. Our 4×4 multiplier implementation will be a Braun’s array multiplier: generate all 16 partial products with a grid of AND gates (every bit of A ANDed with every bit of B), then sum the columns with the half and full adders you already built. The result is an 8-bit product.

A 4-bit Braun array multiplier: a grid of AND gates generating partial products, summed by rows of half-adders and full-adders into an 8-bit product.
4-bit multiplier — Braun's array: 16 AND gates for the partial products, then the adders sum the columns.

This schematic is still kinda messy because it’s from the first iteration, where I got the multiplier working before I figured out how to snap the wires and chips to the grid in the simulator. It works, so we won’t touch it…and I can’t be bothered to wire it all up again. Feel free to look online for a clean reference.

Processing element

Now we assemble the cell that does all the work. A PE is one MAC (multiply-accumulate) plus the plumbing that makes the array systolic. It combines:

  • the 4-bit multiplier (a × b),
  • the 8-bit adder (add the product to the running sum),
  • the 8-bit register (hold that running sum),
  • and two more registers that forward the a and b operands to the neighboring cells, one hop per clock.
A processing element wiring a 4bit-mult into an 8bit-add into an 8bit-register accumulator, with two 4-bit registers forwarding the A and B operands to neighbors.
The PE — 4bit-mult → 8bit-add → 8bit-register accumulator, plus two registers forwarding a (Aout) and b (Bout) to neighbors.

This is the heart of the whole thing. Each clock, the combinational logic computes a × b + acc; on the clock edge the accumulator register latches the new sum and the forwarding registers hand a and b to the right and downward neighbors.

2×2 systolic array

Wire four PEs into a 2×2 grid: a operands flow left-to-right along the rows, b operands flow top-to-bottom down the columns, and each PE accumulates its own output in place.

A 2x2 systolic array of four PEs, with A0/A1 entering from the left, B0/B1 from the top, and C and operand-passthrough outputs on the right.
The 2×2 array — four PEs; A streams left-to-right, B top-to-bottom, each PE accumulating one element of C.

Feed in the (staggered!) rows of A on the left and columns of B (or rows of B transposed) on the top, let it run for a handful of cycles, and each PE settles holding one element of the 2×2 product, read out through the C ports on the right. The A_out/B_out ports are just the operands continuing off the far edges. In a bigger array they’d be the next tile’s inputs.

Full demo

The last piece to make it a self-contained demo is feeding the matrix values. We use (completely oversized for what we need them to do) ROMs to hold the matrix values, plus a register for the second row of A and second column of B to apply a one-step delay so each row and column enters on the right cycle.

The full demo circuit: ROM blocks holding the matrix values feed into the 2x2-Systolic block, with the four C outputs on the right.
The finished demo — ROM holds the operands, the skew registers stagger them into the 2×2 array, and the four results (C00, C01, C10, C11) fall out the C ports on the right.

This design re-expressed as Verilog RTL would actually run on an FPGA; the same RTL, hardened and taped out through TinyTapeout, would be an actual chip you could hold. And if you wanted it to do something useful, you’d give it a host, e.g a little RISC-V core feeding the array (Gemmini-style, over RoCC) (Genc et al., 2021) Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration Genc, Kim, Amid, et al. · DAC 2021 arXiv:1911.09925 .

References

  1. Kung, Leiserson (1978). Systolic Arrays (for VLSI). [link]
  2. Jouppi, Young, Patil, Patterson, et al. (2017). In-Datacenter Performance Analysis of a Tensor Processing Unit. ISCA 2017. arXiv:1704.04760
  3. Genc, Kim, Amid, et al. (2021). Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration. DAC 2021. arXiv:1911.09925