Every computer architecture begins with a contract: if you put these bits
here, the machine will do that. The contract is called an instruction set
architecture — an ISA — and this chapter writes Tilessa's down. Not as prose
that can drift out of date, but as Lean definitions the rest of the book
(the hardware, the proofs, the simulator) is obliged to agree with, because
the build checks them.
By the end of this chapter you will know what a tessera is made of, which
instructions it understands, how elements talk to their neighbours, and the
trick that lets the whole chip do floating-point math without giving any
element a floating-point unit.
(One design decision remains deliberately open and is marked where it
appears: the lane/packing order, which we will freeze together with the
compiler work so that both sides agree on a single answer.)
Hardware projects have a way of accumulating processors: a main core for
the application, a little helper core for booting, a vendor management core
hiding in a corner — each with its own instruction set, its own bugs, and
its own manual. Every extra ISA multiplies the work of understanding the
system, and for us it would multiply something worse: the proof burden.
So Tilessa has exactly one processor definition. Every element on the chip —
the workhorse tessera, the shrunken variant we tape out on hobby shuttles,
the conductor that boots everything — is the same core generator
instantiated with different feature flags, the way one template yields many
concrete types. One ISA, one meaning, everywhere. This pays off twice: in
hardware, there is a single design to verify; in software, programs running
on the conductor obey the same mathematical description as the hardware they
manage, so proofs about firmware and proofs about gates snap together
instead of meeting at an awkward translation layer.
The second unusual decision: there is no privileged mode. In a
conventional CPU, some instructions are reserved for the operating system,
and a mode bit tracks whether you are currently allowed to use them — so a
bug (or an attack) that flips the bit is catastrophic. Tilessa's privilege
is physical instead. The conductor is "privileged" because it is the only
element whose wiring reaches the control registers — the UART, the timer,
the switches that start and stop the mesh. A tessera cannot escalate for the
same reason a kitchen toaster cannot open a garage door: no wire goes there.
namespaceTilessa/-- The four mesh directions. -/inductiveDirwhere|east|west|north|southderivingRepr,DecidableEq/-- Feature flags for the core generator. Every element of a Tilessa is an
instantiation of this one structure — there is no second core. -/structureCoreConfigwhere/-- Datapath word width in bits (registers, scratch, straps). -/wordBits:Nat:=64/-- MAC operand width in bits (products are 2 × macBits wide). -/macBits:Nat:=32/-- Instruction memory size, in instruction words. -/imemWords:Nat/-- Local data scratchpad, in wordBits-wide words (weight-stationary storage). -/scratchWords:Nat/-- Byte-addressed data SRAM (conductor only; 0 = absent). -/dmemBytes:Nat:=0/-- Integer multiply-accumulate unit (the BFP throughput path). -/hasMAC:Bool/-- Optional scalar fp32 unit (add/sub/mul/min/max/cmp/convert). No M0
instantiation enables it: BFP + boundary units cover the ML path, and
element code uses soft-float. Configs without it HALT on `fp` instructions
(same fault semantics as every absent unit). -/hasFP:Bool/-- Decodes the control MMIO window (UART, timer, mesh control, DMA). -/hasMMIO:BoolderivingRepr,DecidableEq
Here are the three configurations this book builds. Read tessera first;
the other two differ only in the knobs.
/-- The standard mesh element. -/deftessera:CoreConfigwhereimemWords:=64scratchWords:=32hasMAC:=truehasFP:=falsehasMMIO:=false/-- The shuttle-constrained variant: "narrow" = a 32-bit datapath with a
16×16 MAC and minimal memories, ≈ 0.14 mm² of cells — the configuration that
fits a Tiny Tapeout footprint (24 tiles ≈ 0.26 mm² placed), where the full
tessera does not. Same ISA, same semantics; only the widths shrink. -/deftesseraNarrow:CoreConfigwherewordBits:=32macBits:=16imemWords:=16scratchWords:=16hasMAC:=truehasFP:=falsehasMMIO:=false/-- The boot/orchestration element: mesh-perimeter placement, outside the
torus, one gateway strap pair into an edge. Integer-only like everything
else (the monitor needs no FP). SRAM sized deliberately — revisit after the
monitor port is sized. -/defconductor:CoreConfigwhereimemWords:=4096scratchWords:=0dmemBytes:=8192hasMAC:=falsehasFP:=falsehasMMIO:=true
A tessera understands a deliberately small set of instructions: integer
arithmetic and logic, a multiply-accumulate unit (the workhorse of all
neural-network math), loads and stores to a small local scratchpad, send
and recv for talking to the four neighbours, branches, and halt. That is
the entire list — no operating-system instructions, no vector unit, and no
floating-point unit inside the element.
The floating-point story deserves a sentence before the next section tells
it in full: bulk math travels through the mesh in a block format that turns
it into integer arithmetic, and the expensive functions (exponentials,
reciprocals) live in shared units at the mesh edge. On the rare occasion
element code needs to touch an individual float outside that fast path, it
computes with integer instructions — the classic technique called
soft-float, costing roughly fifteen instructions per multiply. The fp
instructions below are a placeholder for a future variant that does carry
the unit; in every configuration in this book they simply halt, the same as
any instruction for hardware that is not there.
/-- 32 general-purpose registers, `wordBits` wide. -/abbrevReg:=Fin32inductiveAluOpwhere|add|sub|and|or|xor|sll|srl|sraderivingRepr,DecidableEq/-- Scalar fp32 operations — the optional `hasFP` extension (no M0
instantiation enables it). The multiply reuses the MAC mantissa array;
exp/recip are deliberately absent (boundary units). -/inductiveFpOpwhere|fadd|fsub|fmul|fmin|fmax|flt|feqderivingRepr,DecidableEqinductiveInstrwhere|alu(op:AluOp)(rdrs1rs2:Reg)|li(rd:Reg)(imm:BitVec32)/-- acc := acc + rs1 × rs2 (integer; the BFP throughput path). -/|mac(rs1rs2:Reg)/-- Clear the accumulator. -/|macz/-- Read the accumulator into a register. -/|rdacc(rd:Reg)|fp(op:FpOp)(rdrs1rs2:Reg)|itof(rdrs:Reg)|ftoi(rdrs:Reg)/-- Load from / store to the local scratchpad. -/|ldw(rd:Reg)(addr:BitVec8)|stw(rs:Reg)(addr:BitVec8)|send(d:Dir)(rs:Reg)|recv(d:Dir)(rd:Reg)|beq(rs1rs2:Reg)(offset:BitVec12)|bne(rs1rs2:Reg)(offset:BitVec12)|blt(rs1rs2:Reg)(offset:BitVec12)|jmp(target:BitVec12)|nop|haltderivingRepr,DecidableEq
Most processors handle errors with traps: the hardware detects a problem,
saves some state, and jumps into an operating-system handler. Traps need
cause registers, save-and-restore paths, and careful thought about what
happens when one arrives at exactly the wrong moment — much of the
difficulty of verifying a conventional CPU lives right there.
Tilessa deletes the machinery. There is exactly one fault behaviour,
everywhere: the element stops. An illegal instruction, an instruction for a
unit this configuration lacks, a scratchpad access out of range — all of
them halt the element, deterministically, and the halt is visible from
outside so the conductor can notice and react. No handler, no cause
register, no wrong moment. A stopped element cannot corrupt anything, and a
system whose only failure mode is "cleanly stopped" is one you can reason
about — and time — with straight-line logic.
/-- The only fault in the architecture. -/inductiveFaultwhere|haltedderivingRepr,DecidableEq
Here is the problem this section solves. Neural-network math is almost
entirely multiply-and-add, and doing it in ordinary floating point means
every single addition pays for alignment and normalisation hardware — the
shifters that line two numbers up before adding and line the result back up
afterwards. When we measured our own designs, that bookkeeping — not the
multipliers — dominated the silicon area. Giving every element in a mesh its
own float hardware would spend most of the chip on it.
Block floating point (BFP) is the old, good trick that avoids the bill.
Take a block of numbers — say thirty-two of them — and make them share one
exponent, so each number keeps only its mantissa (its significant digits).
Within the block everything is aligned by construction: multiply-accumulate
becomes an integer multiply and an integer add. No shifters, no rounding
logic, no float hardware in the elements at all. The exponent is handled
exactly once per block, at the edge of the mesh where blocks are formed,
instead of inside every element on every operation. Numerically the scheme
behaves like fixed point within a block and floating point across blocks —
and it is not exotic: it is the arithmetic pinned down by the industry's
OCP MX standard formats, the ones the machine-learning world is converging
on. Tilessa's math is MX math.
The numbers below make it concrete: hardware sums blocks eight at a time (a
tree we measured at about one-and-a-half multipliers of area — cheap), four
such sums make one 32-element MX block, and the running total lives in a
64-bit accumulator with room to spare. That "room to spare" is not a
comment that might rot — it is a theorem the build checks:
/-- Hardware block-sum tree width (measured: block-sum cost ≈ 1.5 multipliers). -/defbfpHwSum:Nat:=8/-- Format-level block size (OCP MX v1.0, shared E8M0 scale). -/defmxBlockSize:Nat:=32/-- Hardware sums per format block. -/defhwSumsPerBlock:Nat:=mxBlockSize/bfpHwSum/-- Accumulator significant bits. -/defaccBits:Nat:=40/-- The guard-bit budget is sufficient: eight full-magnitude 16×16 products
fit strictly within the accumulator. Checked by evaluation. -/theoremhwSum_fits_acc:bfpHwSum*(2^16-1)^2<2^accBits:=⊢ bfpHwSum*(2^16-1)^2<2^accBitsAll goals completed! 🐙example:hwSumsPerBlock=4:=⊢ hwSumsPerBlock=4All goals completed! 🐙
Instructions must eventually live in memory as bits. Each instruction is
one 64-bit word: an 8-bit opcode at the top, three 5-bit register fields
below it, and the low bits carry whatever the instruction needs — a 32-bit
immediate, a 12-bit branch target, an 8-bit scratchpad address, or a 2-bit
direction. (The layout is provisional until it survives the first
hardware; the function below is the authority, and the hardware decoder
will be tested against it.)
Encoding is only half a contract; the other half is that the hardware's
decoder gives everything back. Here is the decoder — written with the same
bit positions the RTL uses — and the theorem that the two are inverse: for
every instruction, decoding its encoding returns it unchanged. Small
fields are checked by exhaustive evaluation (every register, every opcode);
the wide immediates are handled symbolically by the bit-vector decision
procedure.
The elements are arranged in a two-dimensional grid whose edges wrap
around: the top row's northern neighbour is the bottom row, the way the
screen wraps in Pac-Man. (The wrap-around shape is called a torus, and its
virtue is that no element is special — every tessera has exactly four
neighbours, which keeps the hardware, the layout, and the proofs uniform.)
Neighbours communicate through the humblest device we could design, called
a strap: a mailbox that holds exactly one word. A sender may drop a word in
only when the mailbox is empty; the word then sits there — one cycle or one
hour, it does not matter — until the receiver takes it. That "it does not
matter" is the load-bearing phrase of the whole architecture. Because
nothing about a strap depends on when the other side arrives, a strap
stretched across a circuit board, or serialised down a two-wire link
between chips, behaves identically to a strap between adjacent elements.
That is what lets a mosaic grow beyond one piece of silicon, and it is
stated here as the theorem the mesh chapter will prove:
untrusted host
│ UART
┌──────┴──────┐
│ conductor │ privileged by wiring: the only
└──────┬──────┘ element on the control MMIO
│ gateway strap
┌───────────────────────────┴───────────────────────────┐
│ boundary — BFP ⇄ fp32 · exp/recip · DMA (smart edge) │
└────┬──────────────┬──────────────┬──────────────┬─────┘
│ │ │ │
┌───┴───┐ ┌───┴───┐ ┌───┴───┐ ┌───┴───┐
⟲──┤tessera├──────┤tessera├──────┤tessera├──────┤tessera├──⟲
└───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
│ │ │ │
┌───┴───┐ ┌───┴───┐ ┌───┴───┐ ┌───┴───┐
⟲──┤tessera├──────┤tessera├──────┤tessera├──────┤tessera├──⟲
└───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
⟲ ⟲ ⟲ ⟲
(⟲ = torus wrap links)
Every ── and │ is a strap. At a package boundary a strap becomes
the 2-wire chip-edge serial link — same semantics, different wire.
Strap semantics are latency-insensitive by construction, which yields the
scale-out cornerstone (target theorem for M3):
Mesh semantics are invariant under link serialization — the mosaic does not care where the chip edge is.
Crossing between chips takes only two wires per direction. Words leave on
one wire as a self-framed serial stream — a start bit, then the payload,
the same scheme serial terminals have used since the 1960s — and the
acknowledgement returns on the other. The receiving side simply withholds
its acknowledgement until its local mailbox has been emptied, so the sender
waits exactly as long as it must and not a cycle longer: the strap
discipline, stretched over a wire. Eight pins connect a chip to all four
neighbours — few enough to work even on hobbyist fabrication services. (The
first version shares one clock between boards; a link between
independently-clocked chips is the follow-on, and doubles as the prototype
for joining silicon dies inside one package.)
/-- Chip-edge link: wires per direction (tx, rx). -/defchipEdgeWiresPerDir:Nat:=2/-- All four directions cross the package in 8 pins. -/example:4*chipEdgeWiresPerDir=8:=by⊢ 4*chipEdgeWiresPerDir=8decideAll goals completed! 🐙
Weights and activations enter and leave the mesh through the boundary,
streamed by the conductor through a DMA port — deliberately the simplest
design that can work. A dedicated DMA engine earns its place only when a
measured workload demands one, and none has yet.
Lane/packing order — OPEN, frozen jointly with the compiler workstream;
one ordering, applied everywhere. This does not block M1 (the tessera model
and refinement proof are lane-order-independent).