## <u>EECS150 - Digital Design</u> <u>Lecture 24 - Arithmetic Blocks,</u> <u>Part 2 + Shifters</u>

## April 15, 2010 John Wawrzynek

Spring 2010

EECS150 - Lec24-arith2

Page 1

# 

 $a_1b_0+a_0b_1a_0b_0 \leftarrow Product$ 

Many different circuits exist for multiplication. Each one has a different balance between speed (performance) and amount of logic (cost).

. . .

#### "Shift and Add" Multiplier



### "Shift and Add" Multiplier

Signed Multiplication:

Remember for 2's complement numbers MSB has negative weight:

$$X = \sum_{i=0}^{N-2} x_i 2^i - x_{n-1} 2^{n-1}$$

ex:  $-6 = 11010_2 = 0.2^0 + 1.2^1 + 0.2^2 + 1.2^3 - 1.2^4$ = 0 + 2 + 0 + 8 - 16 = -6

- Therefore for multiplication:
  - a) subtract final partial product
  - b) sign-extend partial products
- Modifications to shift & add circuit:
  - a) adder/subtractor
  - b) sign-extender on P shifter register

Spring 2010

EECS150 - Lec24-arith2

#### **Bit-serial Multiplier**

• Bit-serial multiplier (n<sup>2</sup> cycles, one bit of result per n cycles):



#### Array Multiplier

Single cycle multiply: Generates all n partial products simultaneously.



#### Carry-Save Addition

- Speeding up multiplication is a Example: sum three numbers, matter of speeding up the  $3_{10} = 0011, 2_{10} = 0010, 3_{10} = 0011$ summing of the partial products. "Carry-save" addition can help. 3<sub>10</sub> 0011 Carry-save addition passes + 2<sub>10</sub> 0010 carry-save add (saves) the carries to the output,  $c \ \overline{0100} = 4_{10}$ rather than propagating them.  $s 0001 = 1_{10}$ carry-save add 3<sub>10</sub> <u>0011</u>  $c 0010 = 2_{10}$  $s 0110 = 6_{10}$ carry-propagate add - $1000 = 8_{10}$ In general, *carry-save* addition takes in 3 numbers and produces 2. •
  - Whereas, *carry-propagate* takes 2 and produces 1.
  - With this technique, we can avoid carry propagation until final addition
     Spring 2010 EECS150 Lec24-arith2 Page 7



#### Array Multiplier using Carry-save Addition



#### **Carry-save Addition**

CSA is associative and communitive. For example:

$$(((X_0 + X_1) + X_2) + X_3) = ((X_0 + X_1) + (X_2 + X_3))$$



Spring 2010

EECS150 - Lec24-arith2

#### **Constant Multiplication**

- Our discussion so far has assumed both the multiplicand (A) and the multiplier (B) can vary at runtime.
- What if one of the two is a constant?

Y = C \* X

• "Constant Coefficient" multiplication comes up often in signal processing and other hardware. Ex:

$$y_i = \alpha y_{i-1} + x_i$$
  $x_i \longrightarrow y_i$ 

where  $\,\alpha$  is an application dependent constant that is hard-wired into the circuit.

• How do we build and array style (combinational) multiplier that takes advantage of the constancy of one of the operands?

Spring 2010

```
EECS150 - Lec24-arith2
```

Page 11

#### Multiplication by a Constant

- If the constant C in C\*X is a power of 2, then the multiplication is simply a shift of X.
- Ex: 4\*X



- What about division?
- What about multiplication by non- powers of 2?

#### Multiplication by a Constant

- In general, a combination of fixed shifts and addition:
  - Ex: 6\*X = 0110 \* X = (2<sup>2</sup> + 2<sup>1</sup>)\*X



Details:

Spring 2010

EECS150 - Lec24-arith2

Page 13

#### Multiplication by a Constant

• Another example: C = 23<sub>10</sub> = 010111



- In general, the number of additions equals the number of 1's in the constant minus one.
- Using carry-save adders (for all but one of these) helps reduce the delay and cost, but the number of adders is still the number of 1's in C minus 2.
- Is there a way to further reduce the number of adders (and thus the cost and delay)?

#### **Multiplication using Subtraction**

- Subtraction is ~ the same cost and delay as addition.
- Consider C\*X where C is the constant value 15<sub>10</sub> = 01111.
   C\*X requires 3 additions.
- We can "recode" 15

from  $01111 = (2^3 + 2^2 + 2^1 + 2^0)$ to  $1000\overline{1} = (2^4 - 2^0)$ 

where  $\overline{1}$  means negative weight.

Therefore, 15\*X can be implemented with only one subtractor.



Spring 2010

**Canonic Signed Digit Representation** 

- CSD represents numbers using 1, 1, & 0 with the least possible number of non-zero digits.
  - Strings of 2 or more non-zero digits are replaced.
  - Leads to a unique representation.
- To form CSD representation might take 2 passes:
  - First pass: replace all occurrences of 2 or more 1's:

- Second pass: same as a above, plus replace  $0\overline{1}10$  by  $00\overline{1}0$
- Examples:

| 011101 = 29         | 0010111 = 23                          | 0110110 = 54         |  |
|---------------------|---------------------------------------|----------------------|--|
| 100101 = 32 - 4 + 1 | 001100T                               | 1071070              |  |
|                     | 010 <b>T</b> 00 <b>T</b> = 32 - 8 - 1 | 1001010 = 64 - 8 - 2 |  |

• Can we further simplify the multiplier circuits?

Page 15

#### "Constant Coefficient Multiplication" (KCM)

Binary multiplier:  $Y = 231^*X = (2^7 + 2^6 + 2^5 + 2^2 + 2^1 + 2^0)^*X$ 



CSD helps, but the multipliers are limited to shifts followed by adds.
 CSD multiplier: Y = 231\*X = (2<sup>8</sup> - 2<sup>5</sup> + 2<sup>3</sup> - 2<sup>0</sup>)\*X



- How about shift/add/shift/add ...?
  - KCM multiplier:  $Y = 231*X = 7*33*X = (2^3 2^0)*(2^5 + 2^0)*X$

Х



- No simple algorithm exists to determine the optimal KCM representation.
- Most use exhaustive search method. Spring 2010 EECS150 - Lec24-arith2

Page 17

## Fixed Shifters / Rotators

- "fixed" shifters
   "hardwire" the shift
   amount into the circuit.
- Ex: verilog: X >> 2
   (right shift X by 2 places)
- Fixed shift/rotator is nothing but wires!





## <u>Variable Shifters / Rotators</u>

- Example: X >> S, where S is unknown when we synthesize the circuit.
- Uses: shift instruction in processors (ARM includes a shift on every instruction), floating-point arithmetic, division/multiplication by powers of 2, etc.
- One way to build this is a simple shift-register:
  - a) Load word, b) shift enable for S cycles, c) read word.



- Worst case delay O(N) , not good for processor design.
- Can we do it in O(logN) time and fit it in one cycle?

```
Spring 2010
```

```
EECS150 - Lec24-arith2
```

```
Page 19
```

## Log Shifter / Rotator

• Log(N) stages, each shifts (or not) by a power of 2 places,





## <u>"Improved" Shifter / Rotator</u>

• How about this approach? Could it lead to even less delay?



- What is the delay of these big muxes?
- · Look a transistor-level implementation?

#### **Barrel Shifter**



#### **Connection Matrix**

| ↓ <sup>x</sup> 7 | ↓ <sup>x</sup> 6 | ↓ <sup>x</sup> 5 | $\checkmark^{x_4}$ | ↓ <sup>x</sup> 3 | $\checkmark^{x_2}$ | $\checkmark^{x_1}$ | y <sub>0</sub>                                                                            |
|------------------|------------------|------------------|--------------------|------------------|--------------------|--------------------|-------------------------------------------------------------------------------------------|
| Ľ۵,              | μ<br>Δ           | ΓQ.              | ά.                 | ٩<br>٩           |                    | ΓQ.                | Q L                                                                                       |
| <u>م</u>         | <u>م</u>         | <u>م</u>         | <u>م</u>           | <u>م</u>         | <u>م</u>           | <u>م</u>           |                                                                                           |
| <u>م</u>         | - <u>a</u> -     | ۵.               | <u>д</u>           | <u>م</u>         | Ą                  | <u>д</u>           | <u> </u> |
| 4                | <u>م</u>         | đ                | <u>م</u>           | đ                | <u>ل</u>           | - <u>a</u>         |                                                                                           |
| 4                | <u>م</u>         | đ                | - <u>a</u> -       | ۵.               | <u>م</u>           | - <u>a</u>         |                                                                                           |
| 4                | <u>م</u>         | <u>م</u>         | <u>م</u>           | ۵.               | <u>ل</u>           | - <u>a</u>         |                                                                                           |
| 4                | - <u>a</u> -     | <u>م</u>         | - <u>a</u> -       | ۵.               | <u>ل</u>           | - <u>-</u>         |                                                                                           |
| L                | -L               | -La              | -L                 | <u>ц</u>         | -L                 | -L                 |                                                                                           |

#### Generally useful structure:

- N<sup>2</sup> control points.
- What other interesting functions can it do?

#### **Cross-bar Switch**



Nlog(N) control signals.

Supports all interesting permutations

> All one-to-one and one-to-many connections.

Commonly used in communication hardware (switches, routers).

Page 25