

EECS151/251A Fall 2021 Digital Design and Integrated Circuits

Instructors: John Wawrzynek

Lecture 22: Adders

# Announcements

Virtual Front Row for today 4/13:
 Jeremy Ferguson
 Khashayar Pirouzmand
 Daniel Guzman
 Keyi Hu

- Please ask question or make comments!
- Homework assignment 9 out soon to be due a week after posting.

# Outline



- "tricks with trees"
- Adder review, subtraction, carry-select
- Carry-lookahead
- □ Bit-serial addition, summary



**Tricks with Trees** 

#### A log(n) lower (time) bound to compute any function of n variables

- Assume we can only use binary operations, each taking unit time
- □ After 1 time unit, an output can only depend on two inputs
- Use induction to show that after k time units, an output can only depend on 2<sup>k</sup> inputs
  - After log<sub>2</sub> n time units, output depends on at most n inputs
- □ A binary tree performs such a computation



Demmel - CS267 Lecture 6+

### **Reductions with Trees - Review**



If each node (operator) is k-ary instead of binary, what is the delay?

Demmel - CS267 Lecture 6+

# **Trees for optimization**



$$T = O(N)$$

 $((((((x_0 + x_1) + x_2) + x_3) + x_4) + x_5) + x_6) + x_7$ 



 $((x_0 + x_1) + (x_2 + x_3)) + ((x_4 + x_5) + (x_6 + x_7))$ 

□ What property of "+" are we exploiting?

□ Other associate operators? Boolean operations? Division? Min/Max?

### **Parallel Prefix, or "Scan"**

□ If "+" is an associative operator, and  $x_0, ..., x_{p-1}$  are input data then parallel prefix operation computes:  $y_j = x_0 + x_1 + ... + x_j$  for j=0,1,...,p-1





### Adder review, subtraction, carry-select

#### 4-bit Adder Example

 Motivate the adder circuit design by hand addition:

- □ Add a0 and b0 as follows:
  - r a b С 🛶 \_\_\_\_ carry to next Ο 0 0 0 stage 1 1 0 Π 1 0 1 Π 1 1 Π 1  $r = a XOR b = a \oplus b$
  - c = a AND b = ab

• Add a1 and b1 as follows:

|                             | сі | а | b | l r | СО |  |
|-----------------------------|----|---|---|-----|----|--|
|                             | 0  | 0 | 0 | 0   | 0  |  |
|                             | 0  | 0 | 1 | 1   | 0  |  |
|                             | 0  | 1 | 0 | 1   | 0  |  |
|                             | 0  | 1 | 1 | 0   | 1  |  |
|                             | 1  | 0 | 0 | 1   | 0  |  |
|                             | 1  | 0 | 1 | 0   | 1  |  |
|                             | 1  | 1 | 0 | 0   | 1  |  |
|                             | 1  | 1 | 1 | 1   | 1  |  |
| $r = a \oplus b \oplus c_i$ |    |   |   |     |    |  |
| $co = ab + ac_i + bc_i$     |    |   |   |     |    |  |

# **Algebraic Proof of Carry Simplification**

Cout = a'bc + ab'c + abc' + abc

- = a'bc + ab'c + abc' + abc + abc
- = a'bc + abc + ab'c + abc' + abc
- = (a' + a)bc + ab'c + abc' + abc
- = **[1]**bc + ab'c + abc' + abc
- = bc + ab'c + abc' + abc + abc
- = bc + ab'c + abc + abc' + abc
- = bc + a(b' +b)c + abc' +abc
- = bc + a[1]c + abc' + abc
- = bc + ac + ab[c' + c]
- = bc + ac + ab[1]
- = bc + ac + ab



# 4-bit Adder Example

- Gate Representation of FA-cell
  - $r_i = a_i \oplus b_i \oplus c_{in}$  $c_{out} = a_i c_{in} + a_i b_i + b_i c_{in}$

• Alternative Implementation (with 2-input gates):

$$\mathbf{r_i} = (\mathbf{a_i} \oplus \mathbf{b_i}) \oplus \mathbf{c_{in}}$$

$$c_{out} = c_{in}(a_i + b_i) + a_i b_i$$







"Full adder cell"

#### □ 4-bit adder:





# **Delay in Ripple Adders**

□ Ripple delay amount is a function of the data inputs:



However, we usually only consider the worst case delay on the critical path. There is always at least one set of input data that exposes the worst case delay.

# Adders (cont.)

#### Ripple Adder



s7 must wait for c7 which must wait for c6 ....

#### $T \alpha n$ , $Cost \alpha n$

How do we make it faster, perhaps with more cost?





 $COST = 1.5 * COST_{ripple\_adder} + (n/2 + 1) * COST_{MUX}$ 

#### **Carry Select Adder**

Extending Carry-select to multiple blocks



□ What is the optimal # of blocks and # of bits/block?

- If blocks too small delay dominated by total mux delay
- If blocks too large delay dominated by adder ripple delay



T α sqrt(N), Cost ≈2\*ripple + muxes

# **Carry Select Adder**



□ Compare to ripple adder delay:

 $T_{total} = 2 \operatorname{sqrt}(N) T_{FA} - T_{FA}$  assuming  $T_{FA} = T_{MUX}$ 

For ripple adder  $T_{total} = N T_{FA}$ 

"cross-over" at N=3, Carry select faster for any value of N>3.

#### □ Is sqrt(N) really the optimum?

- From right to left increase size of each block to better match delays
- Ex: 64-bit adder, use block sizes [12 11 10 9 8 7 7], the exact answer depends on the relative delay of mux and FA (note: one less block than sqrt(N) solution)



# Carry-lookahead and Parallel Prefix

#### Adders with Delay α log(n)

Can carry generation be made to be a kind of "reduction operation"?

Lowest delay for a reduction is a balanced tree.

- But in this case all intermediate values are required.
- One way is to use "Parallel Prefix" to compute the carries.



×<sub>5</sub>

×4

x<sub>3</sub>

<sup>x</sup>2

×0

Log(N)

Delay

Parallel Prefix requires that the operation be associative, but simple carry generation is not! 21

- How do we arrange carry generation to be associative?
- □ Reformulate basic adder stage:



$$c_{i+1} = g_i + p_i c_i$$
$$s_i = p_i \oplus c_i$$

□ Ripple adder using p and g signals:



 $|p_i = a_i \oplus b_i|$  $= a_i$ 

 $\square$  So far, no advantage over ripple adder: T  $\alpha$  N

□ "Group" propagate and generate signals:



P true if the group as a whole propagates a carry to c<sub>out</sub>
 G true if the group as a whole generates a carry
  $c_{out} = G + Pc_{in}$ 

Group P and G can be generated hierarchically.

 $C_0$ 



25





8-bit Carry Lookahead Adder



₽,G

$$P = P_a P_b$$
$$G = G_b + G_a P_b$$
$$C_{out} = G + c_{in} P$$



# **Parallel-Prefix Carry Look-ahead Adders**

Ground truth specification of all carries directly (no grouping):

$$c_{0} = 0$$

$$c_{1} = g_{0} + p_{0}c_{0} = g_{0}$$

$$c_{2} = g_{1} + p_{1}c_{1} = g_{1} + p_{1}g_{0}$$

$$c_{3} = g_{2} + p_{2}c_{2} = g_{2} + p_{2}g_{1} + p_{1}p_{2}g_{0}$$

$$c_{4} = g_{3} + p_{3}c_{3} = g_{3} + p_{3}g_{2} + p_{3}p_{2}g_{1} + p_{4}p_{3}p_{2}g_{0}$$

$$(G'', P'') (G', P')$$

$$Binary (G, P)$$

$$Binary (G, P)$$

$$associative operator$$

$$G = G'' + G'P''$$

$$Can be used to form all carries!$$

$$(G, P)$$

Use binary (G,P) operator to form parallel prefix tree 28



# **Other Parallel Prefix Adder Architectures**



*Kogge-Stone adder*: minimum logic depth, and full binary tree with minimum fan-out, resulting in a fast adder but with a large area



Brent-Kung adder: minimum area, but high logic depth



*Ladner-Fischer adder*: minimum logic depth, large fan-out requirement up to n/2



Han-Carlson adder: hybrid designcombining stages from the Brent-Kung andKogge-Stone adder30

# Carry look-ahead Wrap-up

- □ Adder delay O(logN).
- □ Cost?
- Can be applied with other techniques. Group P & G signals can be generated for sub-adders, but another carry propagation technique (for instance ripple) used within the group.
  - For instance on FPGA. Ripple carry up to 32 bits is fast, CLA used to extend to large adders. CLA tree quickly generates carry-in for upper blocks.



Bit-serial Addition, Adder summary

# **Bit-serial Adder**



- □ Addition of 2 n-bit numbers:
  - takes n clock cycles,
  - uses 1 FF, 1 FA cell, plus registers
  - the bit streams may come from or go to other circuits, therefore the registers might not be needed.

### **Adders on FPGAs**

Dedicated carry logic provides fast arithmetic carry capability for highspeed arithmetic functions.

On Virtex-5

٠

٠

- Cin to Cout (per bit) delay = 40ps, versus 900ps for F to X delay.
- 64-bit add delay = 2.5ns.



# **Adder Final Words**

| Туре            | Cost | Delay      |
|-----------------|------|------------|
| Ripple          | 0(N) | O(N)       |
| Carry-select    | 0(N) | O(sqrt(N)) |
| Carry-lookahead | 0(N) | O(log(N))  |
| Bit-serial      | 0(1) | O(N)       |

- Dynamic energy per addition for all of these is O(n).
- □ "O" notation hides the constants. Watch out for this!
- The "real" cost of the carry-select is at least 2X the "real" cost of the ripple. "Real" cost of the CLA is probably at least 2X the "real" cost of the carry-select.
- The actual multiplicative constants depend on the implementation details and technology.
- FPGA and ASIC synthesis tools will try to choose the best adder architecture automatically - assuming you specify addition using the "+" operator, as in "assign A = B + C"