Hardware Design Patterns

Notes

Lecture #3

10 Sept 2009

Today is show & tell.Ба We discuss various systems and attempt to dissect them
into patterns.Ба These notes mostly describe the systems themselves and do not
distill them further than did the discussion.

Henry Cook: RF UltraSPARC crypto accelerator

Modular arithmetic unit: ALU <=> multi-port scratchpad mem <=> DMA.

Cipher/hash unit.Ба Little state; operates on data streamed via DMA.

Communication with CPU via in-memory control queue & DMA.

DMA engine responsible for high-level control; probably local control within
the units.

Units operate independently, but seem to execute only one task at a time each.

Ilia Lebedev: Fixed-Function Video Pipeline

3D accelerator, not general-purpose programmable.Ба Consists of a pipeline of
functional units, each of which is specialized to one graphics task.

Triangles come into MCMD vertex processors, which project from 3D into 2D.Ба
Potential for variable latency, so need more complex load-balancing at input.Ба
Don't know if DirectX standard requires ordering of overlapped triangles, so
unclear if reordering occurs after vertex processors.

After culling/clipping, only triangles visible in 2D projection remain.Ба
Rasterizer spits out up to as many pixels/cycle as there are pixel processors,
which access texture cache.Ба The texture cache is addressed in such a way as
to pull out the 2D neighbors of the indexed data.

Pixel processors appear to have all-to-all communication with several
Z-compares.

High-level pattern evident throughout design: streams of data arrive and are
doled out to one of many identical processing units; their outputs are
re-aggregated into a single stream, which serves as the input to the next
stage in the graphics pipeline.

Rimas Avizienis: Stereo speaker amplifier

This device has a fixed-function datapath that accepts as input two serial
audio channels and produces as output four audio channels to drive speakers.Ба
There doesn't appear to be feedback in the pipeline: an audio sample that
comes in will come out after a fixed latency.Ба (There's no apparent support
for exceptions in the pipeline.)Ба The control unit is itself a
microcontroller, which communicates with the datapath via the standard I2C
bus. Ба

Sarah Bird: Altera Soft DMA engine

This device provides four channels of DMA to/from a memory: it's a threaded
pipeline with four context.Ба It provides a configurable priority scheme.

Scott Beamer: Focal Point II Network Router

Single chip for a 24-port 10Gbit router; achieves full 240Gbit/sec (with large
packets only).Ба Designed to work in a variety of routers (L2, L3 with various
feature support).Ба Unsurprisingly, nearly half of the 250M-transistor chip is
SRAM (16Mbit), with another 20% or so dedicated to TCAM.Ба Has a crossbar that
has 1Tbit/s bandwidth @ 3ns latency.

The beginning of the pipeline is a merge FIFO: packet headers from any of 24
ports share the same pipeline.Ба The pipeline, which supports L2 and L3
routing, performs IP routing, does an ARP then MAC lookup, then either sends
the header to the scheduler or drops it if it would violate QoS.

Yunsup Lee: Yet Another Network Router (Broadcom BCM5600)

A 24x100Mbit and 2x1Gbit router chip.Ба To cheaply meet significant intra-chip
bandwidth demands, a single, fat bus is time-multiplexed between the several
ingress/egress units, PCI bus, and memory controller.Ба This pattern probably
will not scale to faster routers.