

EECS 151/251A
Spring 2021
Digital Design and Integrated Circuits

Instructor:
John Wawrzynek

Lecture 19: Parallelism

## **Announcements**

- □ Virtual Front Row for today 4/1:
  - □ Bernard Chen
  - Matthew Tran
  - □ Jennifer Zhou
  - Suphakorn Lertruchtkul
  - □ Rahul Arya
- □ Please ask question or make comments!
- Homework assignment 7 (power & memory) posted due Monday.

#### <u>Parallelism</u>

Parallelism is the act of doing more than one thing at a time.

Optimization in hardware design often involves using parallelism to trade between cost and performance.

Parallelism can often also be used to improve energy efficiency.

Example, Student final grade calculation:

High performance hardware implementation:



As many operations as possible are done in parallel.

Spring 2021

# A log(n) lower (time) bound to compute any function of n variables

- Assume we can only use binary operations, each taking unit time
- □ After 1 time unit, an output can only depend on two inputs
- □ Use induction to show that after k time units, an output can only depend on 2<sup>k</sup> inputs
  - After log<sub>2</sub> n time units, output depends on at most n inputs
- □ A binary tree performs such a computation



# Example: Reductions with Trees



If each node (operator) is k-ary instead of binary, what is the delay?

#### Trees for optimization



- □ What property of "+" are we exploiting?
- Other associate operators? Boolean operations? Division? Min/Max?

#### **Parallelism**

- Is there a lower cost hardware implementation? Different tree organization?
- grade =  $((0.2 \times mt1)+(0.2 \times mt2))$ + $((0.2 \times mt3)+(0.4 \times proj));$



- Can factor out multiply by 0.2 (use factoring and associativity):
- grade = (0.2 × ((mt1 + mt2) +
   mt3))) + (0.4 × proj);
- Compare the cost and critical path in both implementations.
  - How about sharing operators (multipliers and adders)?



#### **Time-Multiplexing**

- Time multiplex single ALU for all adds and multiplies:
- Attempts to minimize cost at the expense of time.
  - Need to add extra register, muxes, control.



• If we adopt above approach, we can then consider the combinational hardware circuit diagram as an abstract computation-graph.



Using other primitives, other coverings are possible.

 This time-multiplexing "covers" the computation graph by performing the action of each node one at a time. (Sort of emulates it.)

Spring 2021 EECS151/251A Page 8

#### **HW versus SW**

- This time-multiplexed ALU approach is very similar to what a conventional software version would accomplish:
- CPUs time-multiplex function units (ALUs, etc.)

```
add r2,r1,r3
add r2,r2,r4
mult r2,r4,r5
:
```

- This model matches our tendency to express computation sequentially even though most computations naturally contain parallelism.
- Our programming languages also strengthen a sequential tendency.
- In hardware we have the ability to exploit problem parallelism gives us a "knob" to tradeoff performance & cost.
- Maybe best to express computations as abstract computations graphs (rather than "programs") - should lead to wider range of implementations.
- Note: modern high-performance processors spend much of their cost budget attempting to restore execution parallelism: "super-scalar execution".

#### **Exploiting Parallelism in HW**

Example: Video Codec



- Separate algorithm blocks implemented in separate HW blocks, or HW is time-multiplexed.
- Entire operation is pipelined (with possible pipelining within the blocks).
- "Loop unrolling used within blocks" or for entire computation.

  Spring 2021 EECS151/251A Page 10

#### Optimizing Iterative Computations

- Hardware implementations of computations almost always involves <u>looping</u>. Why?
- Is this true with software?
- Are there programs without loops?
  - Maybe in "through away" code.
- We probably would not bother building such a thing into hardware, would we?
  - (FPGA could change this.)
- Fact is, our computations are closely tied to loops. Almost all our HW includes some looping mechanism.
- What do we use looping for?

#### Optimizing Iterative Computations

#### Types of loops:

- 1) Looping over input data (streaming):
  - ex: MP3 player, video compressor
- 2) Looping over memory data
  - ex: vector inner product, matrix multiply, list-processing
- 1) & 2) are really very similar. 1) is often turned into 2) by buffering up input data, and processing "offline". Even for "online" processing, buffers are used to smooth out temporary rate mismatches.
- 3) CPUs are one big loop.
  - Instruction fetch ⇒ execute ⇒ Instruction fetch ⇒ execute ⇒ ...
  - but change their personality with each iteration.
- 4) Others?

Loops offer opportunity for parallelism by executing more than one iteration at once, using parallel iteration execution &/or pipelining

#### Pipelining Principle

- With looping usually we are less interested in the latency of one <u>iteration</u> and more in the loop execution rate, or <u>throughput</u>.
- These can be different due to <u>parallel iteration execution &/or pipelining.</u>
- Pipelining review from CS61C:

Analog to washing clothes:

20 min

```
step 1: wash (20 minutes) step 2: dry (20 minutes) step 3: fold (20 minutes) 60 minutes \times 4 loads \Rightarrow 4 hours wash dry fold load1 load2 load3 load4 load3 load4 load3 load4
```

overlapped  $\Rightarrow$  2 hours

Spring 2021 EECS151/251A Page 13

#### **Pipelining**

| wash | load1 | load2 | load3 | load4 |       |       |  |
|------|-------|-------|-------|-------|-------|-------|--|
| dry  |       | load1 | load2 | load3 | load4 |       |  |
| fold |       |       | load1 | load2 | load3 | load4 |  |

- In the limit, as we increase the number of loads, the average time per load approaches 20 minutes (1 load completed every 20 minutes)
- The <u>latency</u> (time from start to end) for one load = 60 min.
- The <u>throughput</u> = 3 loads/hour
- The pipelined throughput ≈ # of pipe stages x un-pipelined throughput.

#### Hardware Pipelining Example

Starting Design:



Cut the CL block into pieces (stages) and separate with registers:



CL block produces a new result every 5ns instead of every 9ns.

#### Limits on Pipelining

- Without FF overhead, throughput improvement  $\alpha$  # of stages.
- After many stages are added FF overhead begins to dominate:



- Other limiters to effective pipelining:
  - clock skew contributes to clock overhead
  - unequal stages
  - FFs dominate cost
  - clock distribution power consumption
  - feedback (dependencies between loop iterations) in CPUs, we these data hazards

Spring 2021 EECS151/251A Page 16

#### Pipelining Example

•  $F(x) = y_i = a x_i^2 + b x_i + c$ 



- x and y are assumed to be "streams" of integers (or floats)
- Divide into 3 (nearly) equal stages.
- Insert pipeline registers at dashed lines.
- Can we pipeline basic operators?

Computation graph:



#### Example: Pipelined Ripple Adder



- · Cost and energy increases by adding registers
- Possible, but usually not done.
   (arithmetic units can often be made sufficiently fast without internal pipelining)
   More common to pipeline multiplication.

#### Pipelining Loops with Feedback

"Loop carry dependency"

• Example 1:  $y_i = y_{i-1} + x_i + a$ 

| unpipelined version: |                 |                 |  |  |  |  |
|----------------------|-----------------|-----------------|--|--|--|--|
| add <sub>1</sub>     | $x_i + y_{i-1}$ | $x_{i+1} + y_i$ |  |  |  |  |
| add <sub>2</sub>     | $y_{i}$         | $y_{i+1}$       |  |  |  |  |
|                      | t <b>→</b>      | , , ,           |  |  |  |  |



Can we "cut" the feedback and overlap iterations?

Try putting a register after add1:

| <i>,</i> | )               |                |               |                  |
|----------|-----------------|----------------|---------------|------------------|
| $add_1$  | $x_i + y_{i-1}$ |                | $x_{i+1}+y_i$ |                  |
| $add_2$  |                 | y <sub>i</sub> |               | y <sub>i+1</sub> |



- Can't overlap the iterations because of the dependency.
- The extra register doesn't help the situation (actually hurts).
- In general, can't effectively pipeline feedback loops.

#### Pipelining Loops with Feedback

#### "Loop carry dependency"

However, we can overlap the "non-feedback" part of the iterations:

Add is associative and communitive. a -Therefore we can reorder the computation to shorten the delay of the feedback path:







"Shorten" the feedback path.



Pipelining is limited to 2 stages.

### Pipelining Loops with Feedback

• Example 2:

$$y_i = a y_{i-1} + x_i + b$$



 Reorder to shorten the feedback loop and try putting register after multiply:



Just said we can't - but let's anyway.

|                  |                   |                |                     |                  |                     |                  | 1 |
|------------------|-------------------|----------------|---------------------|------------------|---------------------|------------------|---|
| $add_1$          | x <sub>i</sub> +b |                | x <sub>i+1</sub> +b |                  | x <sub>i+2</sub> +b |                  |   |
| mult             | ay <sub>i-1</sub> |                | ay <sub>i</sub>     |                  | ay <sub>i+1</sub>   |                  |   |
| add <sub>2</sub> |                   | y <sub>i</sub> |                     | y <sub>i+1</sub> |                     | y <sub>i+2</sub> |   |

Still need 2 cycles/iteration

#### "C-slow" Technique

 An approach to increasing throughput in the presence of feedback: try to fill in "holes" in the chart with another (independent) computation:

| add <sub>1</sub> | x <sub>i</sub> +b |                | x <sub>i+1</sub> +b |                  | x <sub>i+2</sub> +b |                  |  |
|------------------|-------------------|----------------|---------------------|------------------|---------------------|------------------|--|
| mult             | ay <sub>i-1</sub> |                | ay <sub>i</sub>     |                  | ay <sub>i+1</sub>   |                  |  |
| add <sub>2</sub> |                   | y <sub>i</sub> |                     | y <sub>i+1</sub> |                     | y <sub>i+2</sub> |  |

If we have a second similar computation, can interleave it with the first:

$$x^1 \rightarrow F^1 \rightarrow y^1 = a^1 y^1_{i-1} + x^1_i + b^1$$
 Use muxes to direct each stream.  
Time multiplex one piece of HW
$$x^2 \rightarrow F^2 \rightarrow y^2 = a^2 y^2_{i-1} + x^2_i + b^2$$
 for both stream.
Each produces 1 result / 2 cycles.

- Here the feedback depth=2 cycles (we say C=2).
- Each loop has throughput of  $F_{clk}/C$ . But the aggregate throughput is  $F_{clk}$ .
- With this technique we could pipeline even deeper, assuming we could supply C independent streams.

#### "C-slow" Technique

 Essentially this means we go ahead and cut feedback path:



 Interleaving makes operations in adjacent pipeline stages independent and allows full cycle for each:

- C computations (in this case C=2) can use the pipeline simultaneously.
- Must be independent.
- Input MUX interleaves input streams.
- Each stream runs at half the pipeline frequency.
- Pipeline achieves full throughput.

Multithreaded Processors use this.

| add <sub>1</sub> | x+b | x+b | x+b  | x+b | x+b | x+b |  |
|------------------|-----|-----|------|-----|-----|-----|--|
| mult             | ay  | ay  | ay   | ,ay | ay  | ay  |  |
| add <sub>2</sub> | У   | У   | 'y / | У   | У   | У   |  |

#### Beyond Pipelining - SIMD Parallelism

- An obvious way to exploit more parallelism from loops is to make multiple instances of the loop execution data-path and run them in parallel, sharing the some controller.
- For P instances, throughput improves by a factor of P.
- example: y<sub>i</sub> = f(x<sub>i</sub>)



Usually called SIMD parallelism. Single Instruction Multiple Data

- Assumes the next 4 x values available at once. The validity of this assumption depends on the ratio of f repeat rate to input rate (or memory bandwidth).
- Cost  $\alpha$  P. Usually, much higher than for pipelining. However, potentially provides a high speedup. Often applied after pipelining.
- Vector processors use this technique.
- Limited, once again, by loop carry dependencies. Feedback translates to dependencies between parallel data-paths.

#### SIMD Parallelism with Feedback

Example, from earlier:



- As with pipelining, this technique is most effective in the absence of a loop carry dependence.
- With loop carry dependence, end up with "carry ripple" situation.
- For associative operations we can employ look-ahead / parallel-prefix optimization techniques to speed up propagation (coming soon!)

Spring 2021 EECS151/251A Page 25