# <u>EECS150 - Digital Design</u> <u>Lecture 26 - High-Level Design</u> <u>(Part 2)</u>

April 22, 2010 John Wawrzynek

Spring 2010

EECS150 - Lec26-hdl2

Page 1

### List Processor Example

• Design a circuit that forms the sum of all the 2's complements integers stored in a linked-list structure starting at memory address 0:



DONE

Ř

# List Example Resource Scheduling

In this case, first spread out, then pack. ٠

| Memory | next <sub>1</sub> |                   | <b>x</b> <sub>1</sub> |                  |  |
|--------|-------------------|-------------------|-----------------------|------------------|--|
| adder  |                   | numa <sub>1</sub> |                       | sum <sub>1</sub> |  |

| Memory | next <sub>1</sub> |                   | next <sub>2</sub> | <b>X</b> <sub>1</sub> | next <sub>3</sub> | X <sub>2</sub>    | next <sub>4</sub> | X <sub>3</sub>    |                  |
|--------|-------------------|-------------------|-------------------|-----------------------|-------------------|-------------------|-------------------|-------------------|------------------|
| adder  |                   | numa <sub>1</sub> |                   | numa <sub>2</sub>     | sum <sub>1</sub>  | numa <sub>3</sub> | sum <sub>2</sub>  | numa <sub>4</sub> | sum <sub>3</sub> |
|        |                   |                   |                   |                       |                   |                   |                   |                   |                  |

- 1. X←Memory[NUMA], NUMA←NEXT+1;
- 2. NEXT←Memory[NEXT], SUM←SUM+X;
- Three different loop iterations active at once.
- Short cycle time (no dependencies within a cycle) •
- full utilization (only 2 cycles per result) ٠
- Initialization: x=0, numa=1, sum=0, next=memory[0] •
- Extra control states (out of the loop) ٠
  - one to initialize next
  - one to finish off. 2 cycles after next==0.

Spring 2010

٠

EECS150 - Lec26-hld2

Page 3



- Incremental cost:
  - Addition of another register & mux, adder mux, and control.
- Performance: find max time of the four actions
  - 1. X←Memory[NUMA], NUMA←NEXT+1;
- 0.5+1+10+1+1+0.5 = 14 ns same for all  $\Rightarrow$  T>14ns, F<71MHz
- 2. NEXT←Memory[NEXT],  $SUM \leftarrow SUM + X;$ Spring 2010

EECS150 - Lec26-hld2

## **Other Optimizations**

- Node alignment restriction:
  - If the application of the list processor allows us to restrict the placement of nodes in memory so that they are aligned on even multiples of 2 bytes.
    - NUMA addition can be eliminated.
    - Controller supplies "0" for low-bit of memory address for NEXT, and "1" for X.
  - Furthermore, if we could use a memory with a 16-bit wide output, then could fetch entire node in one cycle:

{NEXT, X}  $\leftarrow$  Memory[NEXT], SUM  $\leftarrow$  SUM + X;

 $\Rightarrow$  execution time cut in half (half as many cycles)

Spring 2010

EECS150 - Lec26-hld2

Page 5

# List Processor Conclusions

- Through careful optimization:
  - clock frequency increased from 32MHz to 71MHz
  - little cost increase.
- "Scheduling" was used to overlap and to maximize use of resources.
- Questions:
  - Consider the design process we went through:
  - Could a computer program go from RTL description to circuits automatically?
  - Could a computer program derive the optimizations that we did?

# Modulo Scheduling

Review of list processor scheduling:

| Memory | next <sub>1</sub> |                   | <b>x</b> <sub>1</sub> |                  |  |
|--------|-------------------|-------------------|-----------------------|------------------|--|
| adder  |                   | numa <sub>1</sub> |                       | sum <sub>1</sub> |  |

• How did we know to "spread" out the schedule of one iteration to allow efficient packing?

|        |                   |                   |                   |                       |                   | 1                 |                   | I                 |                  |
|--------|-------------------|-------------------|-------------------|-----------------------|-------------------|-------------------|-------------------|-------------------|------------------|
| Memory | next <sub>1</sub> |                   | next <sub>2</sub> | <b>x</b> <sub>1</sub> | next <sub>3</sub> | X <sub>2</sub>    | next <sub>4</sub> | Х <sub>3</sub>    |                  |
| adder  |                   | numa <sub>1</sub> |                   | numa <sub>2</sub>     | sum <sub>1</sub>  | numa <sub>3</sub> | sum <sub>2</sub>  | numa <sub>4</sub> | sum <sub>3</sub> |
|        |                   |                   |                   |                       |                   |                   |                   |                   |                  |

- The goal of *modulo scheduling* is to find the schedule for one *characteristic section* of the computation. This is the part the control loops over.
- The entire schedule can then be derived, by repeating the characteristic section or repeating it with some pieces omitted.

Spring 2010

```
EECS150 - Lec26-hld2
```

Page 7

### Modulo Scheduling Procedure

1. Calculate *minimal length of characteristic section*.

The maximum number of cycles that any one resource is used during one iteration of the computation (assuming a resource can only be used once per cycle).

- 2. Schedule one iteration of the computation on the characteristic section wrapping around when necessary. Each time the computation wraps around, decrease the iteration subscript by one.
- 3. If iteration will not fit on minimal length section, increase section by one and try again.

# Modulo Scheduling List Processor



• Finished schedule for 4 iterations:

| Memory      | next <sub>1</sub> |                   | next <sub>2</sub> | <b>x</b> <sub>1</sub> | next <sub>3</sub> | <b>X</b> <sub>2</sub> | next <sub>4</sub> | <b>X</b> <sub>3</sub> |                  | _ |
|-------------|-------------------|-------------------|-------------------|-----------------------|-------------------|-----------------------|-------------------|-----------------------|------------------|---|
| adder       |                   | numa <sub>1</sub> |                   | numa <sub>2</sub>     | sum <sub>1</sub>  | numa <sub>3</sub>     | sum <sub>2</sub>  | numa <sub>4</sub>     | sum <sub>3</sub> |   |
|             |                   |                   |                   |                       |                   |                       |                   |                       |                  |   |
| Spring 2010 |                   |                   | EE                | CS150 - L             | ec26-hld2         |                       |                   |                       | Page             | 9 |

Another Scheduling Example

- Assume A, B, C, D, E stored in a dual port memory.
- Assume a single adder.
- Minimal schedule section length = 3. (Both memory and adder are used for 3 cycles during one iteration.)

| Compute Graph (E)<br>(one iteration of a |  |
|------------------------------------------|--|
| repeating calculation)                   |  |

| memory port 1 | load A | load C |         |
|---------------|--------|--------|---------|
| memory port 2 | load B | load D | store E |
| adder         | E =    | A + B  | C + D   |

Repeating schedule:

| load A | load C |         | load A | load C |         | load A | load C |         |
|--------|--------|---------|--------|--------|---------|--------|--------|---------|
| load B | load D | store E | load B | load D | store E | load B | load D | store E |
| E =    | A + B  | C + D   | E =    | A + B  | C + D   | E =    | A + B  | C + D   |

### **Parallelism**

Parallelism is the act of **doing more than one thing at a time**. Optimization in hardware design often involves using parallelism to trade between cost and performance.

• Example, Student final grade calculation:

• High performance hardware implementation:



**Parallelism** 

- Is there a lower cost hardware implementation? Different tree organization?
- Can factor out multiply by 0.2:



• How about sharing operators (multipliers and adders)?

# Time-Multiplexing

controller

 $\frac{acc1 = mt1 + mt2;}{acc1 = acc1 + mt3;}$ 

acc1 = 0.2 x acc1;

acc2 = 0.4 x proj;grade = acc1 + acc2;

- Time multiplex single ALU for all adds and multiplies:
- Attempts to minimize cost at the expense of time.
  - Need to add extra register, muxes, control.
- If we adopt above approach, we can then consider the combinational hardware circuit diagram as an *abstract computation-graph*.



Using other primitives, other coverings are possible.



mt1 mt1 mt3 proj

ALU

acc1

acc2

 This time-multiplexing "covers" the computation graph by performing the action of each node one at a time. (Sort of *emulates* it.) Spring 2009 EECS150 - Lec27-hld3 Page 13

### HW versus SW

 This time-multiplexed ALU approach is very similar to what a conventional software version would accomplish:

add r2,r1,r3 add r2,r2,r4 mult r2,r4,r5

- CPUs time-multiplex function units (ALUs, etc.)
- This model matches our tendency to express computation sequentially even though most computations naturally contain parallelism.
- Our programming languages also strengthen a sequential tendency.
- In hardware we have the ability to exploit problem parallelism gives us a "knob" to tradeoff performance & cost.
- Maybe best to express computations as abstract computations graphs (rather than "programs") should lead to wider range of implementations.
- Note: modern processors spend much of their cost budget attempting to restore execution parallelism: "super-scalar execution".

### **Exploiting Parallelism in HW**

• Example: Video Codec



- Separate algorithm blocks implemented in separate HW blocks, or HW is time-multiplexed.
- Entire operation is pipelined (with possible pipelining within the blocks).
- "Loop unrolling used within blocks" or for entire computation. Spring 2009 EECS150 - Lec27-hld3 Page 15

# **Optimizing Iterative Computations**

- Hardware implementations of computations almost always involves <u>looping</u>. Why?
- Is this true with software?
- Are there programs without loops?
  - Maybe in "through away" code.
- We probably would not bother building such a thing into hardware, would we?
  - (FPGA may change this.)
- Fact is, our computations are closely tied to loops. Almost all our HW includes some looping mechanism.
- What do we use looping for?

# **Optimizing Iterative Computations**

Types of loops:

1) Looping over input data (streaming):

- ex: MP3 player, video compressor, music synthesizer.
- 2) Looping over memory data
  - ex: vector inner product, matrix multiply, list-processing
- 1) & 2) are really very similar. 1) is often turned into 2) by buffering up input data, and processing "offline". Even for "online" processing, buffers are used to smooth out temporary rate mismatches.
- 3) CPUs are one big loop.
  - Instruction fetch  $\Rightarrow$  execute  $\Rightarrow$  Instruction fetch  $\Rightarrow$  execute  $\Rightarrow$  ...
  - but change their personality with each iteration.
- 4) Others?

#### Loops offer opportunity for parallelism by executing more than one iteration at once, using parallel iteration execution &/or pipelining

Spring 2009

EECS150 - Lec27-hld3

Page 17

# **Pipelining Principle**

- With looping usually we are less interested in the latency of one <u>iteration</u> and more in the loop execution rate, or <u>throughput</u>.
- These can be different due to parallel iteration execution &/or pipelining.
- Pipelining review from CS61C:

Analog to washing clothes:



### **Pipelining**

wash load1 load2 load3 load4 dry load1 load1 load2 load3 load4 fold load1 load2 load3 load4 load3 load4

- In the limit, as we increase the number of loads, the average time per load approaches 20 minutes.
- The <u>latency</u> (time from start to end) for one load = 60 min.
- The <u>throughput</u> = 3 loads/hour
- The pipelined throughput ≈ # of pipe stages x un-pipelined throughput.

Spring 2009

EECS150 - Lec27-hld3

Page 19



• CL block produces a new result every 5ns instead of every 9ns.

### Limits on Pipelining

- Without FF overhead, throughput improvement  $\alpha$  # of stages.
- After many stages are added FF overhead begins to dominate: ٠



- Other limiters to effective pipelining: ٠
  - clock skew contributes to clock overhead
  - unequal stages
  - FFs dominate cost
  - clock distribution power consumption
  - feedback (dependencies between loop iterations) EECS150 - Lec27-hld3

Spring 2009

Page 21

# **Pipelining Example**

•  $F(x) = y_i = a x_i^2 + b x_i + c$ 



- x and y are assumed to be ٠ "streams"
- Divide into 3 (nearly) equal stages. ٠
- Insert pipeline registers at dashed • lines.
- Can we pipeline basic operators? ٠



# **Example: Pipelined Adder**

- Possible, but usually not done.
- (arithmetic units can often be made sufficiently fast without internal pipelining)



Spring 2009





<u>Pipelining Loops with Feedback</u> *"Loop carry dependency"* 

Example 1:  $\mathbf{y}_i = \mathbf{y}_{i-1} + \mathbf{x}_i + \mathbf{a}$ ٠ У<sub>і-1</sub> unpipelined version: add₁ x<sub>i</sub>+y<sub>i-1</sub>  $x_{i+1} + y_i$  $add_2$ y<sub>i</sub> У<sub>i+1</sub> Can't overlap the Can we "cut" the feedback and iterations because of overlap iterations? У<sub>і-1</sub> the dependency. The extra register Try putting a register after add1: doesn't help the а situation (actually  $x_{i+1} + y_i$ add₁  $x_i + y_{i-1}$ hurts).  $add_2$ y<sub>i</sub> У<sub>i+1</sub> In general, can't У. pipeline feedback loops.



### **Pipelining Loops with Feedback**

Example 2:
y<sub>i</sub> = a y<sub>i-1</sub> + x<sub>i</sub> + b



 Reorder to shorten the feedback loop and try putting register after multiply:



| $add_1$          | x <sub>i</sub> +b |                | x <sub>i+1</sub> +b |              | x <sub>i+2</sub> +b |                    |  |
|------------------|-------------------|----------------|---------------------|--------------|---------------------|--------------------|--|
| mult             | ay <sub>i-1</sub> |                | ay <sub>i</sub>     |              | ay <sub>i+1</sub>   |                    |  |
| add <sub>2</sub> |                   | y <sub>i</sub> |                     | <b>y</b> i+1 |                     | ∑ y <sub>i+2</sub> |  |

Still need 2 cycles/iteration

## **Pipelining Loops with Feedback**

- Example 2:
  - $y_i = a y_{i-1} + x_i + b$



- Once again, adding register doesn't help. Best solution is to overlap non-feedback part with feedback part.
- Therefore critical path includes a multiply in series with add.
- Can overlap first add with multiply/ add operation.
- Only 1 cycle/iteration. Higher performance solution (than 2 cycle version).

| $add_1$          | x <sub>i</sub> +b | x <sub>i+1</sub> +b | x <sub>i+2</sub> +b |                   |  |
|------------------|-------------------|---------------------|---------------------|-------------------|--|
| mult             |                   | ay <sub>i-1</sub>   | ay <sub>i</sub>     | ay <sub>i+1</sub> |  |
| add <sub>2</sub> |                   | y <sub>i</sub>      | y <sub>i+1</sub>    | y <sub>i+2</sub>  |  |

> Alternative is to move register to ٠ after multiple, but same critical path.

Spring 2009

```
EECS150 - Lec27-hld3
```

Page 27

# "C-slow" Technique

Another approach to increasing throughput in the presence of feedback: ٠ try to fill in "holes" in the chart with another (independent) computation:

| L                |                   |                | 1                   | 1                | 1                   | L                | I |
|------------------|-------------------|----------------|---------------------|------------------|---------------------|------------------|---|
| add <sub>1</sub> | x <sub>i</sub> +b |                | x <sub>i+1</sub> +b |                  | x <sub>i+2</sub> +b |                  |   |
| mult             | ay <sub>i-1</sub> |                | ay <sub>i</sub>     |                  | ay <sub>i+1</sub>   |                  |   |
| add <sub>2</sub> |                   | y <sub>i</sub> |                     | y <sub>i+1</sub> |                     | y <sub>i+2</sub> |   |

If we have a second similar computation, can interleave it with the first:

| $x^1 \rightarrow F^1 \rightarrow y^1 = a^1 y^1_{i-1} + x^1_i + b^1_i$ | Use muxes to direct each stream.   |
|-----------------------------------------------------------------------|------------------------------------|
|                                                                       | for both stream.                   |
| $x^2 \rightarrow F^2 \rightarrow y^2 = a^2 y^2_{i-1} + x^2_i + b^2$   | Each produces 1 result / 2 cycles. |

- Here the feedback depth=2 cycles (we say C=2).
- Each loop has throughput of  $F_{clk}/C$ . But the aggregate throughput is  $F_{clk}$ .
- With this technique we could pipeline even deeper, assuming we could supply C independent streams.

# "C-slow" Technique

 Essentially this means we go ahead and cut feedback path:



 This makes operations in adjacent pipeline stages independent and allows full cycle for each:

- C computations (in this case C=2) can use the pipeline simultaneously.
- Must be independent.
- Input MUX interleaves input streams.
- Each stream runs at half the pipeline frequency.
- Pipeline achieves full throughput.

#### Multithreaded Processors use this.

| $add_1$          | x+b | x+b | x+b | x+b | x+b | x+b |  |
|------------------|-----|-----|-----|-----|-----|-----|--|
| mult             | ay  | ay  | ay  | ,ay | ay  | ay  |  |
| add <sub>2</sub> | У   | у   | y / | У   | У   | У   |  |

Spring 2009

Page 29

# **Beyond Pipelining - SIMD Parallelism**

- An obvious way to exploit more parallelism from loops is to make multiple instances of the loop execution data-path and run them in parallel, sharing the some controller.
- For P instances, throughput improves by a factor of P.
- example:  $y_i = f(x_i)$



- •
- Assumes the next 4 x values available at once. The validity of this assumption depends on the ratio of f repeat rate to input rate (or memory bandwidth).
- Cost  $\alpha$  P. Usually, much higher than for pipelining. However, potentially provides a high speedup. <u>Often applied after pipelining.</u>
- Limited, once again, by loop carry dependencies. Feedback translates to dependencies between parallel data-paths.
- Vector processors use this technique.

Spring 2009

EECS150 - Lec27-hld3

### SIMD Parallelism with Feedback

• Example, from earlier:



- In this example end up with "carry ripple" situation.
- Could employ look-ahead / parallel-prefix optimization techniques to speed up propagation.
- As with pipelining, this technique is most effective in the absence of a loop carry dependence.

Spring 2009

EECS150 - Lec27-hld3

Page 31