Tilessa

1. The Shape of the Machine🔗

A Tilessa is a mosaic of tesserae.

Every computer architecture begins with a contract: if you put these bits here, the machine will do that. The contract is called an instruction set architecture — an ISA — and this chapter writes Tilessa's down. Not as prose that can drift out of date, but as Lean definitions the rest of the book (the hardware, the proofs, the simulator) is obliged to agree with, because the build checks them.

By the end of this chapter you will know what a tessera is made of, which instructions it understands, how elements talk to their neighbours, and the trick that lets the whole chip do floating-point math without giving any element a floating-point unit.

(One design decision remains deliberately open and is marked where it appears: the lane/packing order, which we will freeze together with the compiler work so that both sides agree on a single answer.)

1.1. One generator, many elements🔗

Hardware projects have a way of accumulating processors: a main core for the application, a little helper core for booting, a vendor management core hiding in a corner — each with its own instruction set, its own bugs, and its own manual. Every extra ISA multiplies the work of understanding the system, and for us it would multiply something worse: the proof burden.

So Tilessa has exactly one processor definition. Every element on the chip — the workhorse tessera, the shrunken variant we tape out on hobby shuttles, the conductor that boots everything — is the same core generator instantiated with different feature flags, the way one template yields many concrete types. One ISA, one meaning, everywhere. This pays off twice: in hardware, there is a single design to verify; in software, programs running on the conductor obey the same mathematical description as the hardware they manage, so proofs about firmware and proofs about gates snap together instead of meeting at an awkward translation layer.

The second unusual decision: there is no privileged mode. In a conventional CPU, some instructions are reserved for the operating system, and a mode bit tracks whether you are currently allowed to use them — so a bug (or an attack) that flips the bit is catastrophic. Tilessa's privilege is physical instead. The conductor is "privileged" because it is the only element whose wiring reaches the control registers — the UART, the timer, the switches that start and stop the mesh. A tessera cannot escalate for the same reason a kitchen toaster cannot open a garage door: no wire goes there.

namespace Tilessa /-- The four mesh directions. -/ inductive Dir where | east | west | north | south deriving Repr, DecidableEq /-- Feature flags for the core generator. Every element of a Tilessa is an instantiation of this one structure — there is no second core. -/ structure CoreConfig where /-- Datapath word width in bits (registers, scratch, straps). -/ wordBits : Nat := 64 /-- MAC operand width in bits (products are 2 × macBits wide). -/ macBits : Nat := 32 /-- Instruction memory size, in instruction words. -/ imemWords : Nat /-- Local data scratchpad, in wordBits-wide words (weight-stationary storage). -/ scratchWords : Nat /-- Byte-addressed data SRAM (conductor only; 0 = absent). -/ dmemBytes : Nat := 0 /-- Integer multiply-accumulate unit (the BFP throughput path). -/ hasMAC : Bool /-- Optional scalar fp32 unit (add/sub/mul/min/max/cmp/convert). No M0 instantiation enables it: BFP + boundary units cover the ML path, and element code uses soft-float. Configs without it HALT on `fp` instructions (same fault semantics as every absent unit). -/ hasFP : Bool /-- Decodes the control MMIO window (UART, timer, mesh control, DMA). -/ hasMMIO : Bool deriving Repr, DecidableEq

Here are the three configurations this book builds. Read tessera first; the other two differ only in the knobs.

/-- The standard mesh element. -/ def tessera : CoreConfig where imemWords := 64 scratchWords := 32 hasMAC := true hasFP := false hasMMIO := false /-- The shuttle-constrained variant: "narrow" = a 32-bit datapath with a 16×16 MAC and minimal memories, ≈ 0.14 mm² of cells — the configuration that fits a Tiny Tapeout footprint (24 tiles ≈ 0.26 mm² placed), where the full tessera does not. Same ISA, same semantics; only the widths shrink. -/ def tesseraNarrow : CoreConfig where wordBits := 32 macBits := 16 imemWords := 16 scratchWords := 16 hasMAC := true hasFP := false hasMMIO := false /-- The boot/orchestration element: mesh-perimeter placement, outside the torus, one gateway strap pair into an edge. Integer-only like everything else (the monitor needs no FP). SRAM sized deliberately — revisit after the monitor port is sized. -/ def conductor : CoreConfig where imemWords := 4096 scratchWords := 0 dmemBytes := 8192 hasMAC := false hasFP := false hasMMIO := true

1.2. The tessera ISA🔗

A tessera understands a deliberately small set of instructions: integer arithmetic and logic, a multiply-accumulate unit (the workhorse of all neural-network math), loads and stores to a small local scratchpad, send and recv for talking to the four neighbours, branches, and halt. That is the entire list — no operating-system instructions, no vector unit, and no floating-point unit inside the element.

The floating-point story deserves a sentence before the next section tells it in full: bulk math travels through the mesh in a block format that turns it into integer arithmetic, and the expensive functions (exponentials, reciprocals) live in shared units at the mesh edge. On the rare occasion element code needs to touch an individual float outside that fast path, it computes with integer instructions — the classic technique called soft-float, costing roughly fifteen instructions per multiply. The fp instructions below are a placeholder for a future variant that does carry the unit; in every configuration in this book they simply halt, the same as any instruction for hardware that is not there.

/-- 32 general-purpose registers, `wordBits` wide. -/ abbrev Reg := Fin 32 inductive AluOp where | add | sub | and | or | xor | sll | srl | sra deriving Repr, DecidableEq /-- Scalar fp32 operations — the optional `hasFP` extension (no M0 instantiation enables it). The multiply reuses the MAC mantissa array; exp/recip are deliberately absent (boundary units). -/ inductive FpOp where | fadd | fsub | fmul | fmin | fmax | flt | feq deriving Repr, DecidableEq inductive Instr where | alu (op : AluOp) (rd rs1 rs2 : Reg) | li (rd : Reg) (imm : BitVec 32) /-- acc := acc + rs1 × rs2 (integer; the BFP throughput path). -/ | mac (rs1 rs2 : Reg) /-- Clear the accumulator. -/ | macz /-- Read the accumulator into a register. -/ | rdacc (rd : Reg) | fp (op : FpOp) (rd rs1 rs2 : Reg) | itof (rd rs : Reg) | ftoi (rd rs : Reg) /-- Load from / store to the local scratchpad. -/ | ldw (rd : Reg) (addr : BitVec 8) | stw (rs : Reg) (addr : BitVec 8) | send (d : Dir) (rs : Reg) | recv (d : Dir) (rd : Reg) | beq (rs1 rs2 : Reg) (offset : BitVec 12) | bne (rs1 rs2 : Reg) (offset : BitVec 12) | blt (rs1 rs2 : Reg) (offset : BitVec 12) | jmp (target : BitVec 12) | nop | halt deriving Repr, DecidableEq

1.2.1. Fault semantics: trap-free🔗

Most processors handle errors with traps: the hardware detects a problem, saves some state, and jumps into an operating-system handler. Traps need cause registers, save-and-restore paths, and careful thought about what happens when one arrives at exactly the wrong moment — much of the difficulty of verifying a conventional CPU lives right there.

Tilessa deletes the machinery. There is exactly one fault behaviour, everywhere: the element stops. An illegal instruction, an instruction for a unit this configuration lacks, a scratchpad access out of range — all of them halt the element, deterministically, and the halt is visible from outside so the conductor can notice and react. No handler, no cause register, no wrong moment. A stopped element cannot corrupt anything, and a system whose only failure mode is "cleanly stopped" is one you can reason about — and time — with straight-line logic.

/-- The only fault in the architecture. -/ inductive Fault where | halted deriving Repr, DecidableEq

1.3. Arithmetic: block-float, MX-compliant🔗

Here is the problem this section solves. Neural-network math is almost entirely multiply-and-add, and doing it in ordinary floating point means every single addition pays for alignment and normalisation hardware — the shifters that line two numbers up before adding and line the result back up afterwards. When we measured our own designs, that bookkeeping — not the multipliers — dominated the silicon area. Giving every element in a mesh its own float hardware would spend most of the chip on it.

Block floating point (BFP) is the old, good trick that avoids the bill. Take a block of numbers — say thirty-two of them — and make them share one exponent, so each number keeps only its mantissa (its significant digits). Within the block everything is aligned by construction: multiply-accumulate becomes an integer multiply and an integer add. No shifters, no rounding logic, no float hardware in the elements at all. The exponent is handled exactly once per block, at the edge of the mesh where blocks are formed, instead of inside every element on every operation. Numerically the scheme behaves like fixed point within a block and floating point across blocks — and it is not exotic: it is the arithmetic pinned down by the industry's OCP MX standard formats, the ones the machine-learning world is converging on. Tilessa's math is MX math.

The numbers below make it concrete: hardware sums blocks eight at a time (a tree we measured at about one-and-a-half multipliers of area — cheap), four such sums make one 32-element MX block, and the running total lives in a 64-bit accumulator with room to spare. That "room to spare" is not a comment that might rot — it is a theorem the build checks:

/-- Hardware block-sum tree width (measured: block-sum cost ≈ 1.5 multipliers). -/ def bfpHwSum : Nat := 8 /-- Format-level block size (OCP MX v1.0, shared E8M0 scale). -/ def mxBlockSize : Nat := 32 /-- Hardware sums per format block. -/ def hwSumsPerBlock : Nat := mxBlockSize / bfpHwSum /-- Accumulator significant bits. -/ def accBits : Nat := 40 /-- The guard-bit budget is sufficient: eight full-magnitude 16×16 products fit strictly within the accumulator. Checked by evaluation. -/ theorem hwSum_fits_acc : bfpHwSum * (2 ^ 16 - 1) ^ 2 < 2 ^ accBits := bfpHwSum * (2 ^ 16 - 1) ^ 2 < 2 ^ accBits All goals completed! 🐙 example : hwSumsPerBlock = 4 := hwSumsPerBlock = 4 All goals completed! 🐙

1.3.1. Bits on the wire🔗

Instructions must eventually live in memory as bits. Each instruction is one 64-bit word: an 8-bit opcode at the top, three 5-bit register fields below it, and the low bits carry whatever the instruction needs — a 32-bit immediate, a 12-bit branch target, an 8-bit scratchpad address, or a 2-bit direction. (The layout is provisional until it survives the first hardware; the function below is the authority, and the hardware decoder will be tested against it.)

/-- Field packing helper: place `v` at bit offset `lo` of a 64-bit word. -/ def field (lo : Nat) (v : Nat) : BitVec 64 := (BitVec.ofNat 64 v) <<< lo /-- Direction number (matches the mesh chapter's convention). -/ def Dir.code : Dir Nat | .east => 0 | .west => 1 | .north => 2 | .south => 3 /-- Sub-opcode for ALU operations. -/ def AluOp.code : AluOp Nat | .add => 0 | .sub => 1 | .and => 2 | .or => 3 | .xor => 4 | .sll => 5 | .srl => 6 | .sra => 7 /-- Sub-opcode for the (currently dormant) FP extension. -/ def FpOp.code : FpOp Nat | .fadd => 0 | .fsub => 1 | .fmul => 2 | .fmin => 3 | .fmax => 4 | .flt => 5 | .feq => 6 /-- Encode one instruction as a 64-bit word: opcode at [63:56], rd at [55:51], rs1 at [50:46], rs2 at [45:41]. -/ def Instr.encode : Instr BitVec 64 | .nop => field 56 0 | .halt => field 56 1 | .li rd imm => field 56 2 ||| field 51 rd.val ||| (imm.zeroExtend 64) | .mac rs1 rs2 => field 56 3 ||| field 46 rs1.val ||| field 41 rs2.val | .macz => field 56 4 | .rdacc rd => field 56 5 ||| field 51 rd.val | .ldw rd addr => field 56 6 ||| field 51 rd.val ||| (addr.zeroExtend 64) | .stw rs addr => field 56 7 ||| field 46 rs.val ||| (addr.zeroExtend 64) | .send d rs => field 56 8 ||| field 46 rs.val ||| field 41 (Dir.code d) | .recv d rd => field 56 9 ||| field 51 rd.val ||| field 41 (Dir.code d) | .beq r1 r2 off => field 56 10 ||| field 46 r1.val ||| field 41 r2.val ||| (off.zeroExtend 64) | .bne r1 r2 off => field 56 11 ||| field 46 r1.val ||| field 41 r2.val ||| (off.zeroExtend 64) | .blt r1 r2 off => field 56 12 ||| field 46 r1.val ||| field 41 r2.val ||| (off.zeroExtend 64) | .jmp target => field 56 13 ||| (target.zeroExtend 64) | .alu op rd r1 r2 => field 56 (16 + AluOp.code op) ||| field 51 rd.val ||| field 46 r1.val ||| field 41 r2.val | .fp op rd r1 r2 => field 56 (24 + FpOp.code op) ||| field 51 rd.val ||| field 46 r1.val ||| field 41 r2.val | .itof rd rs => field 56 31 ||| field 51 rd.val ||| field 46 rs.val | .ftoi rd rs => field 56 32 ||| field 51 rd.val ||| field 46 rs.val example : Instr.encode .halt = 0x0100000000000000 := Instr.halt.encode = 72057594037927936 All goals completed! 🐙 example : Instr.encode (.mac 1 2) = (0x03 <<< 56 ||| 1 <<< 46 ||| 2 <<< 41 : BitVec 64) := (Instr.mac 1 2).encode = (3 <<< 56) ||| (1 <<< 46) ||| (2 <<< 41) All goals completed! 🐙

Encoding is only half a contract; the other half is that the hardware's decoder gives everything back. Here is the decoder — written with the same bit positions the RTL uses — and the theorem that the two are inverse: for every instruction, decoding its encoding returns it unchanged. Small fields are checked by exhaustive evaluation (every register, every opcode); the wide immediates are handled symbolically by the bit-vector decision procedure.

def decodeReg (x : BitVec 64) (lo : Nat) : Reg := ((x >>> lo).setWidth 5).toNat, ((x >>> lo).setWidth 5).isLt def decodeDir (x : BitVec 64) : Dir := match ((x >>> 41).setWidth 2).toNat with | 0 => .east | 1 => .west | 2 => .north | _ => .south def Instr.decode (x : BitVec 64) : Instr := let rd := decodeReg x 51 let rs1 := decodeReg x 46 let rs2 := decodeReg x 41 match ((x >>> 56).setWidth 8).toNat with | 0 => .nop | 1 => .halt | 2 => .li rd (x.setWidth 32) | 3 => .mac rs1 rs2 | 4 => .macz | 5 => .rdacc rd | 6 => .ldw rd (x.setWidth 8) | 7 => .stw rs1 (x.setWidth 8) | 8 => .send (decodeDir x) rs1 | 9 => .recv (decodeDir x) rd | 10 => .beq rs1 rs2 (x.setWidth 12) | 11 => .bne rs1 rs2 (x.setWidth 12) | 12 => .blt rs1 rs2 (x.setWidth 12) | 13 => .jmp (x.setWidth 12) | 16 => .alu .add rd rs1 rs2 | 17 => .alu .sub rd rs1 rs2 | 18 => .alu .and rd rs1 rs2 | 19 => .alu .or rd rs1 rs2 | 20 => .alu .xor rd rs1 rs2 | 21 => .alu .sll rd rs1 rs2 | 22 => .alu .srl rd rs1 rs2 | 23 => .alu .sra rd rs1 rs2 | 24 => .fp .fadd rd rs1 rs2 | 25 => .fp .fsub rd rs1 rs2 | 26 => .fp .fmul rd rs1 rs2 | 27 => .fp .fmin rd rs1 rs2 | 28 => .fp .fmax rd rs1 rs2 | 29 => .fp .flt rd rs1 rs2 | 30 => .fp .feq rd rs1 rs2 | 31 => .itof rd rs1 | 32 => .ftoi rd rs1 | _ => .halt /-- Lifting a register index into the word: the 64-bit field is the zero-extension of its 5-bit image. -/ theorem ofNat64_reg (r : Reg) : BitVec.ofNat 64 r.val = (BitVec.ofNat 5 r.val).zeroExtend 64 := r:RegBitVec.ofNat 64 r = BitVec.zeroExtend 64 (BitVec.ofNat 5 r) r:Reg(BitVec.ofNat 64 r).toNat = (BitVec.zeroExtend 64 (BitVec.ofNat 5 r)).toNat All goals completed! 🐙 set_option maxHeartbeats 1000000 in /-- The decoder is a left inverse of the encoder: no instruction is altered by the trip through its 64 bits. -/ theorem decode_encode (i : Instr) : Instr.decode i.encode = i := i:InstrInstr.decode i.encode = i cases i with Instr.decode Instr.nop.encode = Instr.nop All goals completed! 🐙 Instr.decode Instr.halt.encode = Instr.halt All goals completed! 🐙 Instr.decode Instr.macz.encode = Instr.macz All goals completed! 🐙 rs1:Regrs2:RegInstr.decode (Instr.mac rs1 rs2).encode = Instr.mac rs1 rs2 (rs1 rs2 : Reg), Instr.decode (Instr.mac rs1 rs2).encode = Instr.mac rs1 rs2; All goals completed! 🐙 rd:RegInstr.decode (Instr.rdacc rd).encode = Instr.rdacc rd (rd : Reg), Instr.decode (Instr.rdacc rd).encode = Instr.rdacc rd; All goals completed! 🐙 rd:Regrs:RegInstr.decode (Instr.itof rd rs).encode = Instr.itof rd rs (rd rs : Reg), Instr.decode (Instr.itof rd rs).encode = Instr.itof rd rs; All goals completed! 🐙 rd:Regrs:RegInstr.decode (Instr.ftoi rd rs).encode = Instr.ftoi rd rs (rd rs : Reg), Instr.decode (Instr.ftoi rd rs).encode = Instr.ftoi rd rs; All goals completed! 🐙 op:AluOprd:Regrs1:Regrs2:RegInstr.decode (Instr.alu op rd rs1 rs2).encode = Instr.alu op rd rs1 rs2 rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.add rd rs1 rs2).encode = Instr.alu AluOp.add rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.sub rd rs1 rs2).encode = Instr.alu AluOp.sub rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.and rd rs1 rs2).encode = Instr.alu AluOp.and rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.or rd rs1 rs2).encode = Instr.alu AluOp.or rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.xor rd rs1 rs2).encode = Instr.alu AluOp.xor rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.sll rd rs1 rs2).encode = Instr.alu AluOp.sll rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.srl rd rs1 rs2).encode = Instr.alu AluOp.srl rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.sra rd rs1 rs2).encode = Instr.alu AluOp.sra rd rs1 rs2 rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.add rd rs1 rs2).encode = Instr.alu AluOp.add rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.sub rd rs1 rs2).encode = Instr.alu AluOp.sub rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.and rd rs1 rs2).encode = Instr.alu AluOp.and rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.or rd rs1 rs2).encode = Instr.alu AluOp.or rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.xor rd rs1 rs2).encode = Instr.alu AluOp.xor rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.sll rd rs1 rs2).encode = Instr.alu AluOp.sll rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.srl rd rs1 rs2).encode = Instr.alu AluOp.srl rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.alu AluOp.sra rd rs1 rs2).encode = Instr.alu AluOp.sra rd rs1 rs2 ( (rd rs1 rs2 : Reg), Instr.decode (Instr.alu AluOp.sra rd rs1 rs2).encode = Instr.alu AluOp.sra rd rs1 rs2; All goals completed! 🐙) op:FpOprd:Regrs1:Regrs2:RegInstr.decode (Instr.fp op rd rs1 rs2).encode = Instr.fp op rd rs1 rs2 rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.fadd rd rs1 rs2).encode = Instr.fp FpOp.fadd rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.fsub rd rs1 rs2).encode = Instr.fp FpOp.fsub rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.fmul rd rs1 rs2).encode = Instr.fp FpOp.fmul rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.fmin rd rs1 rs2).encode = Instr.fp FpOp.fmin rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.fmax rd rs1 rs2).encode = Instr.fp FpOp.fmax rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.flt rd rs1 rs2).encode = Instr.fp FpOp.flt rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.feq rd rs1 rs2).encode = Instr.fp FpOp.feq rd rs1 rs2 rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.fadd rd rs1 rs2).encode = Instr.fp FpOp.fadd rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.fsub rd rs1 rs2).encode = Instr.fp FpOp.fsub rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.fmul rd rs1 rs2).encode = Instr.fp FpOp.fmul rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.fmin rd rs1 rs2).encode = Instr.fp FpOp.fmin rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.fmax rd rs1 rs2).encode = Instr.fp FpOp.fmax rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.flt rd rs1 rs2).encode = Instr.fp FpOp.flt rd rs1 rs2rd:Regrs1:Regrs2:RegInstr.decode (Instr.fp FpOp.feq rd rs1 rs2).encode = Instr.fp FpOp.feq rd rs1 rs2 ( (rd rs1 rs2 : Reg), Instr.decode (Instr.fp FpOp.feq rd rs1 rs2).encode = Instr.fp FpOp.feq rd rs1 rs2; All goals completed! 🐙) d:Dirrs:RegInstr.decode (Instr.send d rs).encode = Instr.send d rs rs:RegInstr.decode (Instr.send Dir.east rs).encode = Instr.send Dir.east rsrs:RegInstr.decode (Instr.send Dir.west rs).encode = Instr.send Dir.west rsrs:RegInstr.decode (Instr.send Dir.north rs).encode = Instr.send Dir.north rsrs:RegInstr.decode (Instr.send Dir.south rs).encode = Instr.send Dir.south rs rs:RegInstr.decode (Instr.send Dir.east rs).encode = Instr.send Dir.east rsrs:RegInstr.decode (Instr.send Dir.west rs).encode = Instr.send Dir.west rsrs:RegInstr.decode (Instr.send Dir.north rs).encode = Instr.send Dir.north rsrs:RegInstr.decode (Instr.send Dir.south rs).encode = Instr.send Dir.south rs ( (rs : Reg), Instr.decode (Instr.send Dir.south rs).encode = Instr.send Dir.south rs; All goals completed! 🐙) d:Dirrd:RegInstr.decode (Instr.recv d rd).encode = Instr.recv d rd rd:RegInstr.decode (Instr.recv Dir.east rd).encode = Instr.recv Dir.east rdrd:RegInstr.decode (Instr.recv Dir.west rd).encode = Instr.recv Dir.west rdrd:RegInstr.decode (Instr.recv Dir.north rd).encode = Instr.recv Dir.north rdrd:RegInstr.decode (Instr.recv Dir.south rd).encode = Instr.recv Dir.south rd rd:RegInstr.decode (Instr.recv Dir.east rd).encode = Instr.recv Dir.east rdrd:RegInstr.decode (Instr.recv Dir.west rd).encode = Instr.recv Dir.west rdrd:RegInstr.decode (Instr.recv Dir.north rd).encode = Instr.recv Dir.north rdrd:RegInstr.decode (Instr.recv Dir.south rd).encode = Instr.recv Dir.south rd ( (rd : Reg), Instr.decode (Instr.recv Dir.south rd).encode = Instr.recv Dir.south rd; All goals completed! 🐙) rd:Regaddr:BitVec 8Instr.decode (Instr.ldw rd addr).encode = Instr.ldw rd addr (rd : Reg) (addr : BitVec 8), Instr.decode (Instr.ldw rd addr).encode = Instr.ldw rd addr; All goals completed! 🐙 rs:Regaddr:BitVec 8Instr.decode (Instr.stw rs addr).encode = Instr.stw rs addr (rs : Reg) (addr : BitVec 8), Instr.decode (Instr.stw rs addr).encode = Instr.stw rs addr; All goals completed! 🐙 t:BitVec 12Instr.decode (Instr.jmp t).encode = Instr.jmp t (t : BitVec 12), Instr.decode (Instr.jmp t).encode = Instr.jmp t; All goals completed! 🐙 rs1:Regrs2:Regoff:BitVec 12Instr.decode (Instr.beq rs1 rs2 off).encode = Instr.beq rs1 rs2 off (rs1 rs2 : Reg) (off : BitVec 12), Instr.decode (Instr.beq rs1 rs2 off).encode = Instr.beq rs1 rs2 off; All goals completed! 🐙 rs1:Regrs2:Regoff:BitVec 12Instr.decode (Instr.bne rs1 rs2 off).encode = Instr.bne rs1 rs2 off (rs1 rs2 : Reg) (off : BitVec 12), Instr.decode (Instr.bne rs1 rs2 off).encode = Instr.bne rs1 rs2 off; All goals completed! 🐙 rs1:Regrs2:Regoff:BitVec 12Instr.decode (Instr.blt rs1 rs2 off).encode = Instr.blt rs1 rs2 off (rs1 rs2 : Reg) (off : BitVec 12), Instr.decode (Instr.blt rs1 rs2 off).encode = Instr.blt rs1 rs2 off; All goals completed! 🐙 rd:Regimm:BitVec 32Instr.decode (Instr.li rd imm).encode = Instr.li rd imm rd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 rd) <<< 51 ||| BitVec.zeroExtend 64 immInstr.decode (Instr.li rd imm).encode = Instr.li rd imm rd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2Instr.decode (Instr.li rd imm).encode = Instr.li rd imm rd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2hrd:decodeReg (Instr.li rd imm).encode 51 = rdInstr.decode (Instr.li rd imm).encode = Instr.li rd imm rd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2hrd:decodeReg (Instr.li rd imm).encode 51 = rdhimm:BitVec.setWidth 32 (Instr.li rd imm).encode = immInstr.decode (Instr.li rd imm).encode = Instr.li rd imm All goals completed! 🐙

1.4. The mesh and the chip-edge strap🔗

The elements are arranged in a two-dimensional grid whose edges wrap around: the top row's northern neighbour is the bottom row, the way the screen wraps in Pac-Man. (The wrap-around shape is called a torus, and its virtue is that no element is special — every tessera has exactly four neighbours, which keeps the hardware, the layout, and the proofs uniform.)

Neighbours communicate through the humblest device we could design, called a strap: a mailbox that holds exactly one word. A sender may drop a word in only when the mailbox is empty; the word then sits there — one cycle or one hour, it does not matter — until the receiver takes it. That "it does not matter" is the load-bearing phrase of the whole architecture. Because nothing about a strap depends on when the other side arrives, a strap stretched across a circuit board, or serialised down a two-wire link between chips, behaves identically to a strap between adjacent elements. That is what lets a mosaic grow beyond one piece of silicon, and it is stated here as the theorem the mesh chapter will prove:

                          untrusted host
                                │ UART
                         ┌──────┴──────┐
                         │  conductor  │   privileged by wiring: the only
                         └──────┬──────┘   element on the control MMIO
                                │ gateway strap
    ┌───────────────────────────┴───────────────────────────┐
    │   boundary — BFP ⇄ fp32 · exp/recip · DMA (smart edge) │
    └────┬──────────────┬──────────────┬──────────────┬─────┘
         │              │              │              │
     ┌───┴───┐      ┌───┴───┐      ┌───┴───┐      ┌───┴───┐
  ⟲──┤tessera├──────┤tessera├──────┤tessera├──────┤tessera├──⟲
     └───┬───┘      └───┬───┘      └───┬───┘      └───┬───┘
         │              │              │              │
     ┌───┴───┐      ┌───┴───┐      ┌───┴───┐      ┌───┴───┐
  ⟲──┤tessera├──────┤tessera├──────┤tessera├──────┤tessera├──⟲
     └───┬───┘      └───┬───┘      └───┬───┘      └───┬───┘
         ⟲              ⟲              ⟲              ⟲
                                          (⟲ = torus wrap links)

  Every ── and │ is a strap. At a package boundary a strap becomes
  the 2-wire chip-edge serial link — same semantics, different wire.

Strap semantics are latency-insensitive by construction, which yields the scale-out cornerstone (target theorem for M3):

Mesh semantics are invariant under link serialization — the mosaic does not care where the chip edge is.

Crossing between chips takes only two wires per direction. Words leave on one wire as a self-framed serial stream — a start bit, then the payload, the same scheme serial terminals have used since the 1960s — and the acknowledgement returns on the other. The receiving side simply withholds its acknowledgement until its local mailbox has been emptied, so the sender waits exactly as long as it must and not a cycle longer: the strap discipline, stretched over a wire. Eight pins connect a chip to all four neighbours — few enough to work even on hobbyist fabrication services. (The first version shares one clock between boards; a link between independently-clocked chips is the follow-on, and doubles as the prototype for joining silicon dies inside one package.)

/-- Chip-edge link: wires per direction (tx, rx). -/ def chipEdgeWiresPerDir : Nat := 2 /-- All four directions cross the package in 8 pins. -/ example : 4 * chipEdgeWiresPerDir = 8 := 4 * chipEdgeWiresPerDir = 8 All goals completed! 🐙

1.5. Data staging🔗

Weights and activations enter and leave the mesh through the boundary, streamed by the conductor through a DMA port — deliberately the simplest design that can work. A dedicated DMA engine earns its place only when a measured workload demands one, and none has yet.

1.6. Remaining open item🔗

  • Lane/packing order — OPEN, frozen jointly with the compiler workstream; one ordering, applied everywhere. This does not block M1 (the tessera model and refinement proof are lane-order-independent).

end Tilessa