1. The Shape of the Machine🔗

A Tilessa is a mosaic of tesserae.

Every computer architecture begins with a contract: if you put these bits here, the machine will do that. The contract is called an instruction set architecture — an ISA — and this chapter writes Tilessa's down. Not as prose that can drift out of date, but as Lean definitions the rest of the book (the hardware, the proofs, the simulator) is obliged to agree with, because the build checks them.

By the end of this chapter you will know what a tessera is made of, which instructions it understands, how elements talk to their neighbours, and the trick that lets the whole chip do floating-point math without giving any element a floating-point unit.

(One design decision remains deliberately open and is marked where it appears: the lane/packing order, which we will freeze together with the compiler work so that both sides agree on a single answer.)

1.1. One generator, many elements🔗

Hardware projects have a way of accumulating processors: a main core for the application, a little helper core for booting, a vendor management core hiding in a corner — each with its own instruction set, its own bugs, and its own manual. Every extra ISA multiplies the work of understanding the system, and for us it would multiply something worse: the proof burden.

So Tilessa has exactly one processor definition. Every element on the chip — the workhorse tessera, the shrunken variant we tape out on hobby shuttles, the conductor that boots everything — is the same core generator instantiated with different feature flags, the way one template yields many concrete types. One ISA, one meaning, everywhere. This pays off twice: in hardware, there is a single design to verify; in software, programs running on the conductor obey the same mathematical description as the hardware they manage, so proofs about firmware and proofs about gates snap together instead of meeting at an awkward translation layer.

The second unusual decision: there is no privileged mode. In a conventional CPU, some instructions are reserved for the operating system, and a mode bit tracks whether you are currently allowed to use them — so a bug (or an attack) that flips the bit is catastrophic. Tilessa's privilege is physical instead. The conductor is "privileged" because it is the only element whose wiring reaches the control registers — the UART, the timer, the switches that start and stop the mesh. A tessera cannot escalate for the same reason a kitchen toaster cannot open a garage door: no wire goes there.

namespace Tilessa

/-- The four mesh directions. -/
inductive Dir where
  | east | west | north | south
deriving Repr, DecidableEq

/-- Feature flags for the core generator. Every element of a Tilessa is an
instantiation of this one structure — there is no second core. -/
structure CoreConfig where
  /-- Datapath word width in bits (registers, scratch, straps). -/
  wordBits     : Nat := 64
  /-- MAC operand width in bits (products are 2 × macBits wide). -/
  macBits      : Nat := 32
  /-- Instruction memory size, in instruction words. -/
  imemWords    : Nat
  /-- Local data scratchpad, in wordBits-wide words (weight-stationary storage). -/
  scratchWords : Nat
  /-- Byte-addressed data SRAM (conductor only; 0 = absent). -/
  dmemBytes    : Nat := 0
  /-- Integer multiply-accumulate unit (the BFP throughput path). -/
  hasMAC       : Bool
  /-- Optional scalar fp32 unit (add/sub/mul/min/max/cmp/convert). No M0
  instantiation enables it: BFP + boundary units cover the ML path, and
  element code uses soft-float. Configs without it HALT on `fp` instructions
  (same fault semantics as every absent unit). -/
  hasFP        : Bool
  /-- Decodes the control MMIO window (UART, timer, mesh control, DMA). -/
  hasMMIO      : Bool
deriving Repr, DecidableEq

Here are the three configurations this book builds. Read tessera first; the other two differ only in the knobs.

/-- The standard mesh element. -/
def tessera : CoreConfig where
  imemWords    := 64
  scratchWords := 32
  hasMAC       := true
  hasFP        := false
  hasMMIO      := false

/-- The shuttle-constrained variant: "narrow" = a 32-bit datapath with a
16×16 MAC and minimal memories, ≈ 0.14 mm² of cells — the configuration that
fits a Tiny Tapeout footprint (24 tiles ≈ 0.26 mm² placed), where the full
tessera does not. Same ISA, same semantics; only the widths shrink. -/
def tesseraNarrow : CoreConfig where
  wordBits     := 32
  macBits      := 16
  imemWords    := 16
  scratchWords := 16
  hasMAC       := true
  hasFP        := false
  hasMMIO      := false

/-- The boot/orchestration element: mesh-perimeter placement, outside the
torus, one gateway strap pair into an edge. Integer-only like everything
else (the monitor needs no FP). SRAM sized deliberately — revisit after the
monitor port is sized. -/
def conductor : CoreConfig where
  imemWords    := 4096
  scratchWords := 0
  dmemBytes    := 8192
  hasMAC       := false
  hasFP        := false
  hasMMIO      := true

1.2. The tessera ISA🔗

A tessera understands a deliberately small set of instructions: integer arithmetic and logic, a multiply-accumulate unit (the workhorse of all neural-network math), loads and stores to a small local scratchpad, send and recv for talking to the four neighbours, branches, and halt. That is the entire list — no operating-system instructions, no vector unit, and no floating-point unit inside the element.

The floating-point story deserves a sentence before the next section tells it in full: bulk math travels through the mesh in a block format that turns it into integer arithmetic, and the expensive functions (exponentials, reciprocals) live in shared units at the mesh edge. On the rare occasion element code needs to touch an individual float outside that fast path, it computes with integer instructions — the classic technique called soft-float, costing roughly fifteen instructions per multiply. The fp instructions below are a placeholder for a future variant that does carry the unit; in every configuration in this book they simply halt, the same as any instruction for hardware that is not there.

/-- 32 general-purpose registers, `wordBits` wide. -/
abbrev Reg := Fin 32

inductive AluOp where
  | add | sub | and | or | xor | sll | srl | sra
deriving Repr, DecidableEq

/-- Scalar fp32 operations — the optional `hasFP` extension (no M0
instantiation enables it). The multiply reuses the MAC mantissa array;
exp/recip are deliberately absent (boundary units). -/
inductive FpOp where
  | fadd | fsub | fmul | fmin | fmax | flt | feq
deriving Repr, DecidableEq

inductive Instr where
  | alu   (op : AluOp) (rd rs1 rs2 : Reg)
  | li    (rd : Reg) (imm : BitVec 32)
  /-- acc := acc + rs1 × rs2 (integer; the BFP throughput path). -/
  | mac   (rs1 rs2 : Reg)
  /-- Clear the accumulator. -/
  | macz
  /-- Read the accumulator into a register. -/
  | rdacc (rd : Reg)
  | fp    (op : FpOp) (rd rs1 rs2 : Reg)
  | itof  (rd rs : Reg)
  | ftoi  (rd rs : Reg)
  /-- Load from / store to the local scratchpad. -/
  | ldw   (rd : Reg) (addr : BitVec 8)
  | stw   (rs : Reg) (addr : BitVec 8)
  | send  (d : Dir) (rs : Reg)
  | recv  (d : Dir) (rd : Reg)
  | beq   (rs1 rs2 : Reg) (offset : BitVec 12)
  | bne   (rs1 rs2 : Reg) (offset : BitVec 12)
  | blt   (rs1 rs2 : Reg) (offset : BitVec 12)
  | jmp   (target : BitVec 12)
  | nop
  | halt
deriving Repr, DecidableEq

1.2.1. Fault semantics: trap-free🔗

Most processors handle errors with traps: the hardware detects a problem, saves some state, and jumps into an operating-system handler. Traps need cause registers, save-and-restore paths, and careful thought about what happens when one arrives at exactly the wrong moment — much of the difficulty of verifying a conventional CPU lives right there.

Tilessa deletes the machinery. There is exactly one fault behaviour, everywhere: the element stops. An illegal instruction, an instruction for a unit this configuration lacks, a scratchpad access out of range — all of them halt the element, deterministically, and the halt is visible from outside so the conductor can notice and react. No handler, no cause register, no wrong moment. A stopped element cannot corrupt anything, and a system whose only failure mode is "cleanly stopped" is one you can reason about — and time — with straight-line logic.

/-- The only fault in the architecture. -/
inductive Fault where
  | halted
deriving Repr, DecidableEq

1.3. Arithmetic: block-float, MX-compliant🔗

Here is the problem this section solves. Neural-network math is almost entirely multiply-and-add, and doing it in ordinary floating point means every single addition pays for alignment and normalisation hardware — the shifters that line two numbers up before adding and line the result back up afterwards. When we measured our own designs, that bookkeeping — not the multipliers — dominated the silicon area. Giving every element in a mesh its own float hardware would spend most of the chip on it.

Block floating point (BFP) is the old, good trick that avoids the bill. Take a block of numbers — say thirty-two of them — and make them share one exponent, so each number keeps only its mantissa (its significant digits). Within the block everything is aligned by construction: multiply-accumulate becomes an integer multiply and an integer add. No shifters, no rounding logic, no float hardware in the elements at all. The exponent is handled exactly once per block, at the edge of the mesh where blocks are formed, instead of inside every element on every operation. Numerically the scheme behaves like fixed point within a block and floating point across blocks — and it is not exotic: it is the arithmetic pinned down by the industry's OCP MX standard formats, the ones the machine-learning world is converging on. Tilessa's math is MX math.

The numbers below make it concrete: hardware sums blocks eight at a time (a tree we measured at about one-and-a-half multipliers of area — cheap), four such sums make one 32-element MX block, and the running total lives in a 64-bit accumulator with room to spare. That "room to spare" is not a comment that might rot — it is a theorem the build checks:

/-- Hardware block-sum tree width (measured: block-sum cost ≈ 1.5 multipliers). -/
def bfpHwSum : Nat := 8

/-- Format-level block size (OCP MX v1.0, shared E8M0 scale). -/
def mxBlockSize : Nat := 32

/-- Hardware sums per format block. -/
def hwSumsPerBlock : Nat := mxBlockSize / bfpHwSum

/-- Accumulator significant bits. -/
def accBits : Nat := 40

/-- The guard-bit budget is sufficient: eight full-magnitude 16×16 products
fit strictly within the accumulator. Checked by evaluation. -/
theorem hwSum_fits_acc : bfpHwSum * (2 ^ 16 - 1) ^ 2 < 2 ^ accBits := by⊢ bfpHwSum * (2 ^ 16 - 1) ^ 2 < 2 ^ accBits decideAll goals completed! 🐙

example : hwSumsPerBlock = 4 := by⊢ hwSumsPerBlock = 4 decideAll goals completed! 🐙

1.3.1. Bits on the wire🔗

Instructions must eventually live in memory as bits. Each instruction is one 64-bit word: an 8-bit opcode at the top, three 5-bit register fields below it, and the low bits carry whatever the instruction needs — a 32-bit immediate, a 12-bit branch target, an 8-bit scratchpad address, or a 2-bit direction. (The layout is provisional until it survives the first hardware; the function below is the authority, and the hardware decoder will be tested against it.)

/-- Field packing helper: place `v` at bit offset `lo` of a 64-bit word. -/
def field (lo : Nat) (v : Nat) : BitVec 64 :=
  (BitVec.ofNat 64 v) <<< lo

/-- Direction number (matches the mesh chapter's convention). -/
def Dir.code : Dir → Nat
  | .east => 0 | .west => 1 | .north => 2 | .south => 3

/-- Sub-opcode for ALU operations. -/
def AluOp.code : AluOp → Nat
  | .add => 0 | .sub => 1 | .and => 2 | .or => 3
  | .xor => 4 | .sll => 5 | .srl => 6 | .sra => 7

/-- Sub-opcode for the (currently dormant) FP extension. -/
def FpOp.code : FpOp → Nat
  | .fadd => 0 | .fsub => 1 | .fmul => 2 | .fmin => 3
  | .fmax => 4 | .flt => 5 | .feq => 6

/-- Encode one instruction as a 64-bit word: opcode at [63:56],
rd at [55:51], rs1 at [50:46], rs2 at [45:41]. -/
def Instr.encode : Instr → BitVec 64
  | .nop            => field 56 0
  | .halt           => field 56 1
  | .li rd imm      => field 56 2 ||| field 51 rd.val ||| (imm.zeroExtend 64)
  | .mac rs1 rs2    => field 56 3 ||| field 46 rs1.val ||| field 41 rs2.val
  | .macz           => field 56 4
  | .rdacc rd       => field 56 5 ||| field 51 rd.val
  | .ldw rd addr    => field 56 6 ||| field 51 rd.val ||| (addr.zeroExtend 64)
  | .stw rs addr    => field 56 7 ||| field 46 rs.val ||| (addr.zeroExtend 64)
  | .send d rs      => field 56 8 ||| field 46 rs.val ||| field 41 (Dir.code d)
  | .recv d rd      => field 56 9 ||| field 51 rd.val ||| field 41 (Dir.code d)
  | .beq r1 r2 off  => field 56 10 ||| field 46 r1.val ||| field 41 r2.val ||| (off.zeroExtend 64)
  | .bne r1 r2 off  => field 56 11 ||| field 46 r1.val ||| field 41 r2.val ||| (off.zeroExtend 64)
  | .blt r1 r2 off  => field 56 12 ||| field 46 r1.val ||| field 41 r2.val ||| (off.zeroExtend 64)
  | .jmp target     => field 56 13 ||| (target.zeroExtend 64)
  | .alu op rd r1 r2 =>
      field 56 (16 + AluOp.code op) ||| field 51 rd.val ||| field 46 r1.val ||| field 41 r2.val
  | .fp op rd r1 r2 =>
      field 56 (24 + FpOp.code op) ||| field 51 rd.val ||| field 46 r1.val ||| field 41 r2.val
  | .itof rd rs     => field 56 31 ||| field 51 rd.val ||| field 46 rs.val
  | .ftoi rd rs     => field 56 32 ||| field 51 rd.val ||| field 46 rs.val

example : Instr.encode .halt = 0x0100000000000000 := by⊢ Instr.halt.encode = 72057594037927936 native_decideAll goals completed! 🐙
example : Instr.encode (.mac 1 2)
    = (0x03 <<< 56 ||| 1 <<< 46 ||| 2 <<< 41 : BitVec 64) := by⊢ (Instr.mac 1 2).encode = ↑(3 <<< 56) ||| ↑(1 <<< 46) ||| ↑(2 <<< 41) native_decideAll goals completed! 🐙

Encoding is only half a contract; the other half is that the hardware's decoder gives everything back. Here is the decoder — written with the same bit positions the RTL uses — and the theorem that the two are inverse: for every instruction, decoding its encoding returns it unchanged. Small fields are checked by exhaustive evaluation (every register, every opcode); the wide immediates are handled symbolically by the bit-vector decision procedure.

def decodeReg (x : BitVec 64) (lo : Nat) : Reg :=
  ⟨((x >>> lo).setWidth 5).toNat, ((x >>> lo).setWidth 5).isLt⟩

def decodeDir (x : BitVec 64) : Dir :=
  match ((x >>> 41).setWidth 2).toNat with
  | 0 => .east | 1 => .west | 2 => .north | _ => .south

def Instr.decode (x : BitVec 64) : Instr :=
  let rd  := decodeReg x 51
  let rs1 := decodeReg x 46
  let rs2 := decodeReg x 41
  match ((x >>> 56).setWidth 8).toNat with
  | 0 => .nop
  | 1 => .halt
  | 2 => .li rd (x.setWidth 32)
  | 3 => .mac rs1 rs2
  | 4 => .macz
  | 5 => .rdacc rd
  | 6 => .ldw rd (x.setWidth 8)
  | 7 => .stw rs1 (x.setWidth 8)
  | 8 => .send (decodeDir x) rs1
  | 9 => .recv (decodeDir x) rd
  | 10 => .beq rs1 rs2 (x.setWidth 12)
  | 11 => .bne rs1 rs2 (x.setWidth 12)
  | 12 => .blt rs1 rs2 (x.setWidth 12)
  | 13 => .jmp (x.setWidth 12)
  | 16 => .alu .add rd rs1 rs2
  | 17 => .alu .sub rd rs1 rs2
  | 18 => .alu .and rd rs1 rs2
  | 19 => .alu .or  rd rs1 rs2
  | 20 => .alu .xor rd rs1 rs2
  | 21 => .alu .sll rd rs1 rs2
  | 22 => .alu .srl rd rs1 rs2
  | 23 => .alu .sra rd rs1 rs2
  | 24 => .fp .fadd rd rs1 rs2
  | 25 => .fp .fsub rd rs1 rs2
  | 26 => .fp .fmul rd rs1 rs2
  | 27 => .fp .fmin rd rs1 rs2
  | 28 => .fp .fmax rd rs1 rs2
  | 29 => .fp .flt  rd rs1 rs2
  | 30 => .fp .feq  rd rs1 rs2
  | 31 => .itof rd rs1
  | 32 => .ftoi rd rs1
  | _ => .halt

/-- Lifting a register index into the word: the 64-bit field is the
zero-extension of its 5-bit image. -/
theorem ofNat64_reg (r : Reg) :
    BitVec.ofNat 64 r.val = (BitVec.ofNat 5 r.val).zeroExtend 64 := byr:Reg⊢ BitVec.ofNat 64 ↑r = BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑r)
  apply BitVec.eq_of_toNat_eqr:Reg⊢ (BitVec.ofNat 64 ↑r).toNat = (BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑r)).toNat
  simpAll goals completed! 🐙

set_option maxHeartbeats 1000000 in
/-- The decoder is a left inverse of the encoder: no instruction is
altered by the trip through its 64 bits. -/
theorem decode_encode (i : Instr) : Instr.decode i.encode = i := byi:Instr⊢ Instr.decode i.encode = i
  cases i with
  | nop =>nop⊢ Instr.decode Instr.nop.encode = Instr.nop decideAll goals completed! 🐙
  | halt =>halt⊢ Instr.decode Instr.halt.encode = Instr.halt decideAll goals completed! 🐙
  | macz =>macz⊢ Instr.decode Instr.macz.encode = Instr.macz decideAll goals completed! 🐙
  | mac rs1 rs2 =>macrs1:Regrs2:Reg⊢ Instr.decode (Instr.mac rs1 rs2).encode = Instr.mac rs1 rs2 revert rs1 rs2mac⊢ ∀ (rs1 rs2 : Reg), Instr.decode (Instr.mac rs1 rs2).encode = Instr.mac rs1 rs2; native_decideAll goals completed! 🐙
  | rdacc rd =>rdaccrd:Reg⊢ Instr.decode (Instr.rdacc rd).encode = Instr.rdacc rd revert rdrdacc⊢ ∀ (rd : Reg), Instr.decode (Instr.rdacc rd).encode = Instr.rdacc rd; native_decideAll goals completed! 🐙
  | itof rd rs =>itofrd:Regrs:Reg⊢ Instr.decode (Instr.itof rd rs).encode = Instr.itof rd rs revert rd rsitof⊢ ∀ (rd rs : Reg), Instr.decode (Instr.itof rd rs).encode = Instr.itof rd rs; native_decideAll goals completed! 🐙
  | ftoi rd rs =>ftoird:Regrs:Reg⊢ Instr.decode (Instr.ftoi rd rs).encode = Instr.ftoi rd rs revert rd rsftoi⊢ ∀ (rd rs : Reg), Instr.decode (Instr.ftoi rd rs).encode = Instr.ftoi rd rs; native_decideAll goals completed! 🐙
  | alu op rd rs1 rs2 =>aluop:AluOprd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu op rd rs1 rs2).encode = Instr.alu op rd rs1 rs2 cases opalu.addrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.add rd rs1 rs2).encode = Instr.alu AluOp.add rd rs1 rs2alu.subrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.sub rd rs1 rs2).encode = Instr.alu AluOp.sub rd rs1 rs2alu.andrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.and rd rs1 rs2).encode = Instr.alu AluOp.and rd rs1 rs2alu.orrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.or rd rs1 rs2).encode = Instr.alu AluOp.or rd rs1 rs2alu.xorrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.xor rd rs1 rs2).encode = Instr.alu AluOp.xor rd rs1 rs2alu.sllrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.sll rd rs1 rs2).encode = Instr.alu AluOp.sll rd rs1 rs2alu.srlrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.srl rd rs1 rs2).encode = Instr.alu AluOp.srl rd rs1 rs2alu.srard:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.sra rd rs1 rs2).encode = Instr.alu AluOp.sra rd rs1 rs2 <;>alu.addrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.add rd rs1 rs2).encode = Instr.alu AluOp.add rd rs1 rs2alu.subrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.sub rd rs1 rs2).encode = Instr.alu AluOp.sub rd rs1 rs2alu.andrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.and rd rs1 rs2).encode = Instr.alu AluOp.and rd rs1 rs2alu.orrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.or rd rs1 rs2).encode = Instr.alu AluOp.or rd rs1 rs2alu.xorrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.xor rd rs1 rs2).encode = Instr.alu AluOp.xor rd rs1 rs2alu.sllrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.sll rd rs1 rs2).encode = Instr.alu AluOp.sll rd rs1 rs2alu.srlrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.srl rd rs1 rs2).encode = Instr.alu AluOp.srl rd rs1 rs2alu.srard:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.alu AluOp.sra rd rs1 rs2).encode = Instr.alu AluOp.sra rd rs1 rs2 (revert rd rs1 rs2alu.sra⊢ ∀ (rd rs1 rs2 : Reg), Instr.decode (Instr.alu AluOp.sra rd rs1 rs2).encode = Instr.alu AluOp.sra rd rs1 rs2; native_decideAll goals completed! 🐙)
  | fp op rd rs1 rs2 =>fpop:FpOprd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp op rd rs1 rs2).encode = Instr.fp op rd rs1 rs2 cases opfp.faddrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.fadd rd rs1 rs2).encode = Instr.fp FpOp.fadd rd rs1 rs2fp.fsubrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.fsub rd rs1 rs2).encode = Instr.fp FpOp.fsub rd rs1 rs2fp.fmulrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.fmul rd rs1 rs2).encode = Instr.fp FpOp.fmul rd rs1 rs2fp.fminrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.fmin rd rs1 rs2).encode = Instr.fp FpOp.fmin rd rs1 rs2fp.fmaxrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.fmax rd rs1 rs2).encode = Instr.fp FpOp.fmax rd rs1 rs2fp.fltrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.flt rd rs1 rs2).encode = Instr.fp FpOp.flt rd rs1 rs2fp.feqrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.feq rd rs1 rs2).encode = Instr.fp FpOp.feq rd rs1 rs2 <;>fp.faddrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.fadd rd rs1 rs2).encode = Instr.fp FpOp.fadd rd rs1 rs2fp.fsubrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.fsub rd rs1 rs2).encode = Instr.fp FpOp.fsub rd rs1 rs2fp.fmulrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.fmul rd rs1 rs2).encode = Instr.fp FpOp.fmul rd rs1 rs2fp.fminrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.fmin rd rs1 rs2).encode = Instr.fp FpOp.fmin rd rs1 rs2fp.fmaxrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.fmax rd rs1 rs2).encode = Instr.fp FpOp.fmax rd rs1 rs2fp.fltrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.flt rd rs1 rs2).encode = Instr.fp FpOp.flt rd rs1 rs2fp.feqrd:Regrs1:Regrs2:Reg⊢ Instr.decode (Instr.fp FpOp.feq rd rs1 rs2).encode = Instr.fp FpOp.feq rd rs1 rs2 (revert rd rs1 rs2fp.feq⊢ ∀ (rd rs1 rs2 : Reg), Instr.decode (Instr.fp FpOp.feq rd rs1 rs2).encode = Instr.fp FpOp.feq rd rs1 rs2; native_decideAll goals completed! 🐙)
  | send d rs =>sendd:Dirrs:Reg⊢ Instr.decode (Instr.send d rs).encode = Instr.send d rs cases dsend.eastrs:Reg⊢ Instr.decode (Instr.send Dir.east rs).encode = Instr.send Dir.east rssend.westrs:Reg⊢ Instr.decode (Instr.send Dir.west rs).encode = Instr.send Dir.west rssend.northrs:Reg⊢ Instr.decode (Instr.send Dir.north rs).encode = Instr.send Dir.north rssend.southrs:Reg⊢ Instr.decode (Instr.send Dir.south rs).encode = Instr.send Dir.south rs <;>send.eastrs:Reg⊢ Instr.decode (Instr.send Dir.east rs).encode = Instr.send Dir.east rssend.westrs:Reg⊢ Instr.decode (Instr.send Dir.west rs).encode = Instr.send Dir.west rssend.northrs:Reg⊢ Instr.decode (Instr.send Dir.north rs).encode = Instr.send Dir.north rssend.southrs:Reg⊢ Instr.decode (Instr.send Dir.south rs).encode = Instr.send Dir.south rs (revert rssend.south⊢ ∀ (rs : Reg), Instr.decode (Instr.send Dir.south rs).encode = Instr.send Dir.south rs; native_decideAll goals completed! 🐙)
  | recv d rd =>recvd:Dirrd:Reg⊢ Instr.decode (Instr.recv d rd).encode = Instr.recv d rd cases drecv.eastrd:Reg⊢ Instr.decode (Instr.recv Dir.east rd).encode = Instr.recv Dir.east rdrecv.westrd:Reg⊢ Instr.decode (Instr.recv Dir.west rd).encode = Instr.recv Dir.west rdrecv.northrd:Reg⊢ Instr.decode (Instr.recv Dir.north rd).encode = Instr.recv Dir.north rdrecv.southrd:Reg⊢ Instr.decode (Instr.recv Dir.south rd).encode = Instr.recv Dir.south rd <;>recv.eastrd:Reg⊢ Instr.decode (Instr.recv Dir.east rd).encode = Instr.recv Dir.east rdrecv.westrd:Reg⊢ Instr.decode (Instr.recv Dir.west rd).encode = Instr.recv Dir.west rdrecv.northrd:Reg⊢ Instr.decode (Instr.recv Dir.north rd).encode = Instr.recv Dir.north rdrecv.southrd:Reg⊢ Instr.decode (Instr.recv Dir.south rd).encode = Instr.recv Dir.south rd (revert rdrecv.south⊢ ∀ (rd : Reg), Instr.decode (Instr.recv Dir.south rd).encode = Instr.recv Dir.south rd; native_decideAll goals completed! 🐙)
  | ldw rd addr =>ldwrd:Regaddr:BitVec 8⊢ Instr.decode (Instr.ldw rd addr).encode = Instr.ldw rd addr revert rd addrldw⊢ ∀ (rd : Reg) (addr : BitVec 8), Instr.decode (Instr.ldw rd addr).encode = Instr.ldw rd addr; native_decideAll goals completed! 🐙
  | stw rs addr =>stwrs:Regaddr:BitVec 8⊢ Instr.decode (Instr.stw rs addr).encode = Instr.stw rs addr revert rs addrstw⊢ ∀ (rs : Reg) (addr : BitVec 8), Instr.decode (Instr.stw rs addr).encode = Instr.stw rs addr; native_decideAll goals completed! 🐙
  | jmp t =>jmpt:BitVec 12⊢ Instr.decode (Instr.jmp t).encode = Instr.jmp t revert tjmp⊢ ∀ (t : BitVec 12), Instr.decode (Instr.jmp t).encode = Instr.jmp t; native_decideAll goals completed! 🐙
  | beq rs1 rs2 off =>beqrs1:Regrs2:Regoff:BitVec 12⊢ Instr.decode (Instr.beq rs1 rs2 off).encode = Instr.beq rs1 rs2 off revert rs1 rs2 offbeq⊢ ∀ (rs1 rs2 : Reg) (off : BitVec 12), Instr.decode (Instr.beq rs1 rs2 off).encode = Instr.beq rs1 rs2 off; native_decideAll goals completed! 🐙
  | bne rs1 rs2 off =>bners1:Regrs2:Regoff:BitVec 12⊢ Instr.decode (Instr.bne rs1 rs2 off).encode = Instr.bne rs1 rs2 off revert rs1 rs2 offbne⊢ ∀ (rs1 rs2 : Reg) (off : BitVec 12), Instr.decode (Instr.bne rs1 rs2 off).encode = Instr.bne rs1 rs2 off; native_decideAll goals completed! 🐙
  | blt rs1 rs2 off =>bltrs1:Regrs2:Regoff:BitVec 12⊢ Instr.decode (Instr.blt rs1 rs2 off).encode = Instr.blt rs1 rs2 off revert rs1 rs2 offblt⊢ ∀ (rs1 rs2 : Reg) (off : BitVec 12), Instr.decode (Instr.blt rs1 rs2 off).encode = Instr.blt rs1 rs2 off; native_decideAll goals completed! 🐙
  | li rd imm =>lird:Regimm:BitVec 32⊢ Instr.decode (Instr.li rd imm).encode = Instr.li rd imm
    have hb : Instr.encode (.li rd imm)
        = (BitVec.ofNat 64 2 <<< 56)
          ||| ((BitVec.ofNat 5 rd.val).zeroExtend 64 <<< 51)
          ||| imm.zeroExtend 64 := byi:Instr⊢ Instr.decode i.encode = i
      simp only [Instr.encode, field, ofNat64_reg]lird:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 imm⊢ Instr.decode (Instr.li rd imm).encode = Instr.li rd imm
    have hop : ((Instr.encode (.li rd imm) >>> 56).setWidth 8).toNat = 2 := byi:Instr⊢ Instr.decode i.encode = i
      have h : (Instr.encode (.li rd imm) >>> 56).setWidth 8 = 2#8 := byi:Instr⊢ Instr.decode i.encode = i
        rw [hbrd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 imm⊢ BitVec.setWidth 8
    ((2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 imm) >>> 56) =
  2#8]rd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 imm⊢ BitVec.setWidth 8
    ((2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 imm) >>> 56) =
  2#8; bv_deciderd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immh:BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56) = 2#8⊢ (BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2
      rw [hrd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immh:BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56) = 2#8⊢ (2#8).toNat = 2]rd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immh:BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56) = 2#8⊢ (2#8).toNat = 2; decidelird:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2⊢ Instr.decode (Instr.li rd imm).encode = Instr.li rd imm
    have hrd : decodeReg (Instr.encode (.li rd imm)) 51 = rd := byi:Instr⊢ Instr.decode i.encode = i
      have h5 : (Instr.encode (.li rd imm) >>> 51).setWidth 5
          = BitVec.ofNat 5 rd.val := byi:Instr⊢ Instr.decode i.encode = i rw [hbrd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2⊢ BitVec.setWidth 5
    ((2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 imm) >>> 51) =
  BitVec.ofNat 5 ↑rd]rd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2⊢ BitVec.setWidth 5
    ((2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 imm) >>> 51) =
  BitVec.ofNat 5 ↑rd; bv_deciderd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2h5:BitVec.setWidth 5 ((Instr.li rd imm).encode >>> 51) = BitVec.ofNat 5 ↑rd⊢ decodeReg (Instr.li rd imm).encode 51 = rd
      apply Fin.extrd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2h5:BitVec.setWidth 5 ((Instr.li rd imm).encode >>> 51) = BitVec.ofNat 5 ↑rd⊢ ↑(decodeReg (Instr.li rd imm).encode 51) = ↑rd
      show ((Instr.encode (.li rd imm) >>> 51).setWidth 5).toNat = rd.valrd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2h5:BitVec.setWidth 5 ((Instr.li rd imm).encode >>> 51) = BitVec.ofNat 5 ↑rd⊢ (BitVec.setWidth 5 ((Instr.li rd imm).encode >>> 51)).toNat = ↑rd
      rw [h5rd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2h5:BitVec.setWidth 5 ((Instr.li rd imm).encode >>> 51) = BitVec.ofNat 5 ↑rd⊢ (BitVec.ofNat 5 ↑rd).toNat = ↑rd]rd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2h5:BitVec.setWidth 5 ((Instr.li rd imm).encode >>> 51) = BitVec.ofNat 5 ↑rd⊢ (BitVec.ofNat 5 ↑rd).toNat = ↑rd; simp only [BitVec.toNat_ofNat]rd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2h5:BitVec.setWidth 5 ((Instr.li rd imm).encode >>> 51) = BitVec.ofNat 5 ↑rd⊢ ↑rd % 2 ^ 5 = ↑rd; omegalird:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2hrd:decodeReg (Instr.li rd imm).encode 51 = rd⊢ Instr.decode (Instr.li rd imm).encode = Instr.li rd imm
    have himm : (Instr.encode (.li rd imm)).setWidth 32 = imm := byi:Instr⊢ Instr.decode i.encode = i
      rw [hbrd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2hrd:decodeReg (Instr.li rd imm).encode 51 = rd⊢ BitVec.setWidth 32 (2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 imm) = imm]rd:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2hrd:decodeReg (Instr.li rd imm).encode 51 = rd⊢ BitVec.setWidth 32 (2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 imm) = imm; bv_decidelird:Regimm:BitVec 32hb:(Instr.li rd imm).encode = 2#64 <<< 56 ||| BitVec.zeroExtend 64 (BitVec.ofNat 5 ↑rd) <<< 51 ||| BitVec.zeroExtend 64 immhop:(BitVec.setWidth 8 ((Instr.li rd imm).encode >>> 56)).toNat = 2hrd:decodeReg (Instr.li rd imm).encode 51 = rdhimm:BitVec.setWidth 32 (Instr.li rd imm).encode = imm⊢ Instr.decode (Instr.li rd imm).encode = Instr.li rd imm
    simp only [Instr.decode, hop, hrd, himm]All goals completed! 🐙

1.4. The mesh and the chip-edge strap🔗

The elements are arranged in a two-dimensional grid whose edges wrap around: the top row's northern neighbour is the bottom row, the way the screen wraps in Pac-Man. (The wrap-around shape is called a torus, and its virtue is that no element is special — every tessera has exactly four neighbours, which keeps the hardware, the layout, and the proofs uniform.)

Neighbours communicate through the humblest device we could design, called a strap: a mailbox that holds exactly one word. A sender may drop a word in only when the mailbox is empty; the word then sits there — one cycle or one hour, it does not matter — until the receiver takes it. That "it does not matter" is the load-bearing phrase of the whole architecture. Because nothing about a strap depends on when the other side arrives, a strap stretched across a circuit board, or serialised down a two-wire link between chips, behaves identically to a strap between adjacent elements. That is what lets a mosaic grow beyond one piece of silicon, and it is stated here as the theorem the mesh chapter will prove:

                          untrusted host
                                │ UART
                         ┌──────┴──────┐
                         │  conductor  │   privileged by wiring: the only
                         └──────┬──────┘   element on the control MMIO
                                │ gateway strap
    ┌───────────────────────────┴───────────────────────────┐
    │   boundary — BFP ⇄ fp32 · exp/recip · DMA (smart edge) │
    └────┬──────────────┬──────────────┬──────────────┬─────┘
         │              │              │              │
     ┌───┴───┐      ┌───┴───┐      ┌───┴───┐      ┌───┴───┐
  ⟲──┤tessera├──────┤tessera├──────┤tessera├──────┤tessera├──⟲
     └───┬───┘      └───┬───┘      └───┬───┘      └───┬───┘
         │              │              │              │
     ┌───┴───┐      ┌───┴───┐      ┌───┴───┐      ┌───┴───┐
  ⟲──┤tessera├──────┤tessera├──────┤tessera├──────┤tessera├──⟲
     └───┬───┘      └───┬───┘      └───┬───┘      └───┬───┘
         ⟲              ⟲              ⟲              ⟲
                                          (⟲ = torus wrap links)

  Every ── and │ is a strap. At a package boundary a strap becomes
  the 2-wire chip-edge serial link — same semantics, different wire.

Strap semantics are latency-insensitive by construction, which yields the scale-out cornerstone (target theorem for M3):

Mesh semantics are invariant under link serialization — the mosaic does not care where the chip edge is.

Crossing between chips takes only two wires per direction. Words leave on one wire as a self-framed serial stream — a start bit, then the payload, the same scheme serial terminals have used since the 1960s — and the acknowledgement returns on the other. The receiving side simply withholds its acknowledgement until its local mailbox has been emptied, so the sender waits exactly as long as it must and not a cycle longer: the strap discipline, stretched over a wire. Eight pins connect a chip to all four neighbours — few enough to work even on hobbyist fabrication services. (The first version shares one clock between boards; a link between independently-clocked chips is the follow-on, and doubles as the prototype for joining silicon dies inside one package.)

/-- Chip-edge link: wires per direction (tx, rx). -/
def chipEdgeWiresPerDir : Nat := 2

/-- All four directions cross the package in 8 pins. -/
example : 4 * chipEdgeWiresPerDir = 8 := by⊢ 4 * chipEdgeWiresPerDir = 8 decideAll goals completed! 🐙

1.5. Data staging🔗

Weights and activations enter and leave the mesh through the boundary, streamed by the conductor through a DMA port — deliberately the simplest design that can work. A dedicated DMA engine earns its place only when a measured workload demands one, and none has yet.

1.6. Remaining open item🔗

Lane/packing order — OPEN, frozen jointly with the compiler workstream; one ordering, applied everywhere. This does not block M1 (the tessera model and refinement proof are lane-order-independent).

end Tilessa