## EECS150 - Digital Design

# <u>Lecture 23 - Arithmetic Blocks,</u> <u>Part 2 + Shifters</u>

## April 12, 2011 John Wawrzynek

 Spring 2011
 EECS150 - Lec23-anith2
 Page 1

## **Multiplication**

$$a_1b_0+a_0b_1 a_0b_0 \leftarrow Product$$

Many different circuits exist for multiplication. Each one has a different balance between speed (performance) and amount of logic (cost).

#### "Shift and Add" Multiplier

В



What is the critical path for

determining the min clock

- Sums each partial product, one at a time.
- In binary, each partial product is shifted versions of A or 0.

Control Algorithm:

- 1. P ← 0, A ← multiplicand,
  - B ← multiplier
- 2. If LSB of B==1 then add A to P else add 0
- 3. Shift [P][B] right 1
- 4. Repeat steps 2 and 3 n-1 times.
- 5. [P][B] has product.

Spring 2010

period?

EECS150 - Lec24-arith2

Page 3

## "Shift and Add" Multiplier

#### Signed Multiplication:

Remember for 2's complement numbers MSB has negative weight:

$$X = \sum_{i=0}^{N-2} x_i 2^i - x_{n-1} 2^{n-1}$$

ex: 
$$-6 = 11010_2 = 0.2^0 + 1.2^1 + 0.2^2 + 1.2^3 - 1.2^4$$
  
= 0 + 2 + 0 + 8 - 16 = -6

- · Therefore for multiplication:
  - a) subtract final partial product
  - b) sign-extend partial products
- · Modifications to shift & add circuit:
  - a) adder/subtractor
  - b) sign-extender on P shifter register

## **Bit-serial Multiplier**

• Bit-serial multiplier (n<sup>2</sup> cycles, one bit of result per n cycles):



Control Algorithm:

## **Array Multiplier**

Single cycle multiply: Generates all n partial products simultaneously.



#### **Carry-Save Addition**

- Speeding up multiplication is a matter of speeding up the summing of the partial products.
- "Carry-save" addition can help.
- Carry-save addition passes (saves) the carries to the output, rather than propagating them.
- Example: sum three numbers,
   3<sub>10</sub> = 0011, 2<sub>10</sub> = 0010, 3<sub>10</sub> = 0011

Page 7

tion can neip. 
$$3_{10} = 0011$$
 on passes  $2_{10} = 0010$  or to the output, gating them.  $3_{10} = 0010$  or  $0100 = 4_{10}$  or  $0001 = 1_{10}$  carry-save add  $3_{10} = 0011$  or  $0010 = 2_{10}$  or  $0010 = 2_{10}$  or  $0110 = 6_{10}$  or  $0110 = 8_{10}$ 

- In general, *carry-save* addition takes in 3 numbers and produces 2.
- Whereas, carry-propagate takes 2 and produces 1.
- With this technique, we can avoid carry propagation until final addition Spring 2010 EECS150 Lec24-arith2

## **Carry-save Circuits**



#### **Array Multiplier using Carry-save Addition**



## **Carry-save Addition**

CSA is associative and communitive. For example:

$$(((X_0 + X_1) + X_2) + X_3) = ((X_0 + X_1) + (X_2 + X_3))$$



- A balanced tree can be used to reduce the logic delay.
- This structure is the basis of the Wallace Tree Multiplier.
- Partial products are summed with the CSA tree. Fast CPA (ex: CLA) is used for final sum.
- Multiplier delay  $\alpha \log_{3/2}$ N +  $\log_2$ N

#### **Constant Multiplication**

- Our discussion so far has assumed both the multiplicand (A) and the multiplier (B) can vary at runtime.
- What if one of the two is a constant?

$$Y = C * X$$

 "Constant Coefficient" multiplication comes up often in signal processing and other hardware. Ex:

$$y_i = \alpha y_{i-1} + x_i$$
  $x_i \longrightarrow y_i$ 

where  $\alpha$  is an application dependent constant that is hard-wired into the circuit.

 How do we build and array style (combinational) multiplier that takes advantage of the constancy of one of the operands?

Spring 2010 EECS150 - Lec24-arith2 Page 11

## Multiplication by a Constant

- If the constant C in C\*X is a power of 2, then the multiplication is simply a shift of X.
- Ex: 4\*X



- What about division?
- What about multiplication by non- powers of 2?

#### Multiplication by a Constant

• In general, a combination of fixed shifts and addition:

$$- Ex: 6*X = 0110 * X = (2^2 + 2^1)*X$$



Details:



Spring 2010

EECS150 - Lec24-arith2

Page 13

## Multiplication by a Constant

• Another example: C = 23<sub>10</sub> = 010111



- In general, the number of additions equals the number of 1's in the constant minus one.
- Using carry-save adders (for all but one of these) helps reduce the delay and cost, but the number of adders is still the number of 1's in C minus 2.
- Is there a way to further reduce the number of adders (and thus the cost and delay)?

Spring 2010

#### **Multiplication using Subtraction**

- Subtraction is ~ the same cost and delay as addition.
- Consider C\*X where C is the constant value 15<sub>10</sub> = 01111.

C\*X requires 3 additions.

We can "recode" 15

from 
$$01111 = (2^3 + 2^2 + 2^1 + 2^0)$$
  
to  $1000\overline{1} = (2^4 - 2^0)$ 

where T means negative weight.

 Therefore, 15\*X can be implemented with only one subtractor.



Spring 2010

EECS150 - Lec24-arith2

Page 15

## Canonic Signed Digit Representation

- CSD represents numbers using 1, 1, & 0 with the least possible number of non-zero digits.
  - Strings of 2 or more non-zero digits are replaced.
  - Leads to a unique representation.
- To form CSD representation might take 2 passes:
  - First pass: replace all occurrences of 2 or more 1's:

- Second pass: same as a above, plus replace 0110 by 0010
- Examples:

Can we further simplify the multiplier circuits?

### "Constant Coefficient Multiplication" (KCM)

- CSD helps, but the multipliers are limited to shifts followed by adds.
  - CSD multiplier:  $Y = 231*X = (2^8 2^5 + 2^3 2^0)*X$



- How about shift/add/shift/add ...?
  - KCM multiplier:  $Y = 231*X = 7*33*X = (2^3 2^0)*(2^5 + 2^0)*X$



- No simple algorithm exists to determine the optimal KCM representation.
- · Most use exhaustive search method.

Spring 2010 EECS150 - Lec24-arith2

Page 17

Page 18

## Fixed Shifters / Rotators

- "fixed" shifters
   "hardwire" the shift
   amount into the circuit.
- Ex: verilog: X >> 2
  - (right shift X by 2 places)
- Fixed shift/rotator is nothing but wires!So what?







Spring 2011 EECS150 - Lec23-arith2

## Variable Shifters / Rotators

- Example:  $X \gg S$ , where S is unknown when we synthesize the circuit.
- Uses: shift instruction in processors (ARM includes a shift on every instruction), floating-point arithmetic, division/multiplication by powers of 2, etc.
- One way to build this is a simple shift-register:
  - a) Load word, b) shift enable for S cycles, c) read word.



- Worst case delay O(N), not good for processor design.
- Can we do it in O(logN) time and fit it in one cycle?

Spring 2011 EECS150 - Lec23-anith2 Page 19

## Log Shifter / Rotator

· Log(N) stages, each shifts (or not) by a power of 2 places, S=



## **LUT Mapping of Log shifter**



Efficient with 2to1 multiplexors, for instance, 3LUTs.

Virtex5 has 6LUTs. Naturally makes 4to1 muxes:

Reorganize shifter to use 4to1 muxes.



## "Improved" Shifter / Rotator

· How about this approach? Could it lead to even less delay?



- What is the delay of these big muxes?
- · Look a transistor-level implementation?

Spring 2011

## **Barrel Shifter**



# **Connection Matrix**



Generally useful structure:

- N<sup>2</sup> control points.
- What other interesting functions can it do?

Page 23

## **Cross-bar Switch**



- Nlog(N) control signals.
- Supports all interesting permutations
  - All one-to-one and one-to-many connections.
- Commonly used in communication hardware (switches, routers).

Page 25