Hardware Design Patterns Notes Lecture #3 10 Sept 2009 Today is show & tell. We discuss various systems and attempt to dissect them into patterns. These notes mostly describe the systems themselves and do not distill them further than did the discussion. Henry Cook: RF UltraSPARC crypto accelerator Modular arithmetic unit: ALU <=> multi-port scratchpad mem <=> DMA. Cipher/hash unit. Little state; operates on data streamed via DMA. Communication with CPU via in-memory control queue & DMA. DMA engine responsible for high-level control; probably local control within the units. Units operate independently, but seem to execute only one task at a time each. Ilia Lebedev: Fixed-Function Video Pipeline 3D accelerator, not general-purpose programmable. Consists of a pipeline of functional units, each of which is specialized to one graphics task. Triangles come into MCMD vertex processors, which project from 3D into 2D. Potential for variable latency, so need more complex load-balancing at input. Don't know if DirectX standard requires ordering of overlapped triangles, so unclear if reordering occurs after vertex processors. After culling/clipping, only triangles visible in 2D projection remain. Rasterizer spits out up to as many pixels/cycle as there are pixel processors, which access texture cache. The texture cache is addressed in such a way as to pull out the 2D neighbors of the indexed data. Pixel processors appear to have all-to-all communication with several Z-compares. High-level pattern evident throughout design: streams of data arrive and are doled out to one of many identical processing units; their outputs are re-aggregated into a single stream, which serves as the input to the next stage in the graphics pipeline. Rimas Avizienis: Stereo speaker amplifier This device has a fixed-function datapath that accepts as input two serial audio channels and produces as output four audio channels to drive speakers. There doesn't appear to be feedback in the pipeline: an audio sample that comes in will come out after a fixed latency. (There's no apparent support for exceptions in the pipeline.) The control unit is itself a microcontroller, which communicates with the datapath via the standard I2C bus. Sarah Bird: Altera Soft DMA engine This device provides four channels of DMA to/from a memory: it's a threaded pipeline with four context. It provides a configurable priority scheme. Scott Beamer: Focal Point II Network Router Single chip for a 24-port 10Gbit router; achieves full 240Gbit/sec (with large packets only). Designed to work in a variety of routers (L2, L3 with various feature support). Unsurprisingly, nearly half of the 250M-transistor chip is SRAM (16Mbit), with another 20% or so dedicated to TCAM. Has a crossbar that has 1Tbit/s bandwidth @ 3ns latency. The beginning of the pipeline is a merge FIFO: packet headers from any of 24 ports share the same pipeline. The pipeline, which supports L2 and L3 routing, performs IP routing, does an ARP then MAC lookup, then either sends the header to the scheduler or drops it if it would violate QoS. Yunsup Lee: Yet Another Network Router (Broadcom BCM5600) A 24x100Mbit and 2x1Gbit router chip. To cheaply meet significant intra-chip bandwidth demands, a single, fat bus is time-multiplexed between the several ingress/egress units, PCI bus, and memory controller. This pattern probably will not scale to faster routers.