### CS 150 Digital Design

#### **Lecture 28 – Power and Energy**

2011-4-28

John Wawrzynek

today's lecturer: John Lazzaro

TAs: Michael Eastham and Austin Doupnik

www-inst.eecs.berkeley.edu/~cs150/



Sad fact: Computers turn electrical energy into heat. Computation is a byproduct.

#### **Energy and Performance**

Air or water carries heat away, or chip melts.



#### This is how electric tea pots work ...

Heats 1 gram of water 0.24 degree C

0.24 Calories per Second

1 Joule of Heat Energy per Second



#### Cooling an iPod nano ...



Like resistor on last slide, iPod relies on passive transfer of heat from case to the air.

Why? Users don't want fans in their pocket ...

# To stay "cool to the touch" via passive cooling, power budget of 5 W.

If iPod nano used 5W all the time, its battery would last 15 minutes ...

EECS 150 L28: Power and Energy

#### Powering an iPod nano (2005 edition)



Battery has 1.2 W-hour rating: Can supply 1.2 W of power for 1 hour.

1.2 W / 5 W = 15 minutes.

More W-hours require bigger battery and thus bigger "form factor" -- it wouldn't be "nano" anymore :-).

Real specs for iPod nano:
14 hours for music,
4 hours for slide shows.

85 mW for music.300 mW for slides.

EECS 150 L28: Power and Energy

#### Finding the (2005) iPod nano CPU ...



PP5020







EECS 150 L28: Power and Energy

Two 80 MHz CPUs. One CPU used for audio, one for slides.

Low-power ARM roughly ImW per MHz ... variable clock, sleep modes.

85 mW system power realistic ...

#### Year-to-year: continuous improvements



EECS 150 L28: Power and Energy

Source: ifixit.com







#### How? Small IC packages, fewer parts

iPod nano 2006 -

#### iPod nano 2005





Source: arstechnica.com

EECS 150 L28: Power and Energy



#### Aluminum permits thinner case ...



#### **Year-to-year: continuous improvements**



EECS 150 L28: Power and Energy

Source: ifixit.com

life

# +

nearly the same depth

#### 2010 Nano



0.74 ounces

#### 2010 Shuffle



0.44 ounces



2010 Shuffle: "up to" 15 hours audio playback



EECS 150 L28: Power and Energy



0.39 W Hr

Sources: iFixit, Apple



0.19 W Hr

#### Notebooks ... now most of the PC market.

2006 Apple MacBook -- 5.2 lbs



12.8 in







EECS 150 L28: Power and Energy

#### Battery: Set by size and weight limits ...



Battery rating: 55 W-hour.

At 2.3 GHz, Intel Core Puo CPU consumes 31 W running a heavy load - under 2 hours battery life! And, just for CPU!

Almost full 1 inch depth. Width and height set by available space, weight.

EECS 150 L28: Power and Energy

At 1 GHz, CPU consumes 13 Watts. "Energy saver" option uses this mode...

# MacBook Air ... fits in a manila envelope!





#### Lithium battery density and mass ...

#### **Energy densities table**

| Storage type                           | Specific energy<br>MJ/kg | Energy density<br>MJ./Liter |
|----------------------------------------|--------------------------|-----------------------------|
| Uranium-235 used in nuclear weapons    | 144,000,000              | 1,500,000,000               |
| Natural gas                            | 53.6                     | 0.0364                      |
| Gasoline (petrol)                      | 46.4                     | 34.2                        |
| Diesel fuel/residential<br>heating oil | 46.2                     | 37.3                        |
| Anthracite coal                        | 32.5                     | 72.4                        |
| Lithium-ion battery                    | 0.46 to 0.72             | 0.83 to 3.6                 |

#### Thus the interest in fuel cells for portable electronics ...

Source: Machine Design magazine

#### The CPU is only part of power budget!

2004-era notebook running a full workload.



"Amdahl's Law for Power"

If our CPU took no power at all to run, that would only double battery life!





#### **Servers: Total Cost of Ownership (TCO)**



Reliability: running computers hot makes them fail more often.

Machine rooms are expensive. Removing heat dictates how many servers to put in a machine room.

Electric bill adds up! Powering the servers + powering the air conditioners is a big part of TCO.

EECS 150 L28: Power and Energy

#### **Processors and Energy**



#### **Switching Energy: Fundamental Physics**

#### **Every logic transition dissipates energy.**



Strong result: Independent of technology.

How can we limit switching energy?

- (1) Slow down clock (fewer transitions). But we like speed ...
- (2) Reduce Vdd. But lowering Vdd limits the clock speed ...
- (3) Fewer circuits. But more transistors can do more work.
- (4) Reduce C per node. One reason why we scale processes.

EECS 150 L28: Power and Energy

#### Scaling switching energy per gate ...



IC process scaling ("Moore's Law")



Pue to reducing V and C (length and width of Cs decrease, but plate distance gets smaller).

Recent slope more shallow because V is being scaled less aggressively.

From: "Facing the Hot Chips Challenge Again", Bill Holt, Intel, presented at Hot Chips 17, 2005.

UC Regents Spring 2011 © UCB

#### **Second Factor: Leakage Currents**

Even when a logic gate isn't switching, it burns power.



Isub: Even when this nFet is off, it passes an loff leakage current.

We can engineer any loff we like, but a lower loff also results in a lower lon, and thus a lower maximum clock speed.

Igate: Ideal capacitors have zero DC current. But modern transistor gates are a few atoms thick, and are not ideal.

EECS 150 L28: Power and Energy

#### Intel's 2006 processor designs, leakage vs switching power





A lot of work was done to get a ratio this good  $\dots 50/50$ is common.

Bill Holt, Intel, Hot Chips 17 UC Regents Spring 2011 © UCB

#### Engineering "On" Current at 25 nm ...



We can increase  $I_{on}$  by raising  $V_{dd}$  and/or lowering  $V_{t}$ .



#### Plot on a "Log" Scale to See "Off" Current



## We can decrease $l_{off}$ by raising $V_{t}$ - but that lowers $l_{on}$ .



EECS 150 L28: Power and Energy

#### Device engineers trade speed and power

We can reduce  $CV^2$  ( $P_{active}$ ) by lowering  $V_{dd}$ .

We can increase speed by raising  $V_{dd}$  and lowering  $V_{t}$ .

We can reduce leakage (P<sub>standby</sub>) by raising V<sub>t</sub>.





From: Silicon Device Scaling to the Sub-10-nm Regime Meikei leong,1\* Bruce Doris,2 Jakub Kedzierski,1 Ken Rim,1 Min Yang1

#### Customize processes for product types ...



From: "Facing the Hot Chips Challenge Again", Bill Holt, Intel, presented at Hot Chips 17, 2005.

UC Regents Spring 2011 © UCB

#### Intel: Comparing 2 CPU generations ...



Find enough tricks, and you can afford to raise Vdd a little so that you can raise the clock speed!

Clock speed unchanged.

Lower Vdd, lower C, but more leakage.

EECS 150 L28: Power and Energy

Pesign tricks: architecture & circuits.

#### Long-term possibility: New devices

Electrostatic mechanical relays at the nanoscale



Electromechanical Computing at 500°C with Silicon Carbide. Te-Hao Lee, Swarup Bhunia, Mehran Mehregany

EECS 150 L28: Power and Energy

#### Working inverter at 500 kHz ... for a while.





- + 10 fA leakage current
- + Works at 500 degrees C
- Fails after 1-10 days of 500 kHz toggles.
- Switching requires 6V V<sub>dd</sub>

Electromechanical Computing at 500°C with Silicon Carbide. Te-Hao Lee, Swarup Bhunia, Mehran Mehregany

EECS 150 L28: Power and Energy

#### Five low-power design techniques



Parallelism and pipelining



Power-down idle transistors



Slow down non-critical paths



**Clock gating** 



Thermal management



Design Technique #1 (of 5)

### **Trading Hardware for Power**

via Parallelism and Pipelining ...





#### And so, we can transform this:

```
Vdd
                 Freq = 1
Logic Block
                 Vdd = 1
                 Throughput = 1
                 Power = 1
                 Area = 1
                 Pwr Den = 1
```

Block processes stereo audio. 1/2 of clocks for "left", 1/2 for "right".

#### Into this:

CV<sup>2</sup> power only



Ex: Top block processes audio channel 1, bottom block processes audio channel 2.



S MAGIC TRICK BROUGHT TO YOU BY CORY HALL UC Regents Spring 2011 © UCB

### Chandrakasan & Brodersen (UCB EECS)

| Architecture       | Power (normalized) |  |  |  |  |
|--------------------|--------------------|--|--|--|--|
| Simple             | 1                  |  |  |  |  |
| Parallel           | 0.36               |  |  |  |  |
| Pipelined          | 0.39               |  |  |  |  |
| Pipelined-Parallel | 0.2                |  |  |  |  |

| Architecture       | Area (normalized) |
|--------------------|-------------------|
| Simple             | 1                 |
| Parallel           | 3.4               |
| Pipelined          | 1.3               |
| Pipelined-Parallel | 3.7               |

| Architecture       | Voltage |
|--------------------|---------|
| Simple             | 5V      |
| Parallel           | 2.9V    |
| Pipelined          | 2.9V    |
| Pipelined-Parallel | 2.0     |





 $Area = 636 \times 833 \mu^2$ 

Simple





Pipelined

EECS 150 L28: Power and Ener Minimizing Power Consumption in CMOS Circuits

From:

Anantha P. Chandrakasan

Robert W. Brodersen Regents Spring 2011 © UCB

### **Multiple Cores for Low Power**

Trade hardware for power, on a large scale ...



# Cell: The PS3 chip







**COMPUTER** ENTERTAINMENT

**TOSHIBA** 







#### **One Synergistic Processing Unit (SPU)**



SPU issues 2 inst/cycle (in order) to 7 execution units 256 KB Local Store, 128 128-bit Registers SPU fills Local Store using DMA to DRAM and network

EECS 150 L28: Power and Energy

### A "Schmoo" plot for a Cell SPU ...

The lower Vdd, the less dynamic energy consumption.

$$\mathbf{E}_{0\to 1} = \frac{1}{2} \mathbf{C} \mathbf{V}_{dd}^2 \qquad \mathbf{E}_{1\to 0} = \frac{1}{2} \mathbf{C} \mathbf{V}_{dd}^2$$

The lower Vdd, the longer the maximum clock period, the slower the clock frequency.



### Clock speed alone doesn't help E/op ...

But, lowering clock frequency while keeping voltage constant spreads the same amount of work over a longer time, so chip stays cooler ...

$$\mathbf{E}_{0\to 1} = \frac{1}{2} \mathbf{C} \mathbf{V}_{dd}^2 \qquad \mathbf{E}_{1\to 0} = \frac{1}{2} \mathbf{C} \mathbf{V}_{dd}^2$$



EECS 150 L28: Power and Energy

### Scaling V and f does lower energy/op

1 W to get 2.2 GHz

7W to reliably get 4.4 GHz performance. 26 C die temp. performance. 47C die temp.

> If a program that needs a 4.4 Ghz CPU can be recoded to use two 2.2 Ghz CPUs ... big win.



EECS 150 L28: Power and Energy

### Intel's dual-core analysis ...

#### But only if your app(s) can put 2 cores to use!



From: "Facing the Hot Chips Challenge Again", Bill Holt, Intel, presented at Hot Chips 17, 2005.

EECS 150 L28: Power and Energy

UC Regents Spring 2011 © UCB

#### How iPod nano 2005 puts its 2 cores to use ...



#### PP5020



digital media management system-on-chip Emmi



EECS 150 L28: Power and Energy

#### **Dual ARM Processors**

- Dual 32-bit ARM7TDMI processors
- Up to 80 MHz processor operation per core with independent clock-skipping feature on COP
- Efficient cross-bar implementation providing zero wait state access to internal RAM
- Integrated 96KB of SRAM
- 8KB of unified cache per processor
- Six DMA channels

Two 80 MHz CPUs. Was used in several nano generations, with one CPU doing audio decoding, the other doing photos, etc.

Design Technique #2 (of 5)

### Powering down idle circuits



### Add "sleep" transistors to logic ...



Example: Floating point unit logic.

When running fixed-point instructions, put logic "to sleep".

+++ When "asleep", leakage power is dramatically reduced.

--- Presence of sleep transistors slows down the clock rate when the logic block is in use.



### Intel example: Sleeping cache blocks



From: "Facing the Hot Chips Challenge Again", Bill Holt, Intel, presented at Hot Chips 17, 2005.

EECS 150 L28: Power and Energy

UC Regents Spring 2011 © UCB

### Fall 2009: Intel Mainstream Desktop





### Power Consumption (Load) Entire System





From: PC Perspective website

EECS 150 L28: Power and Energy

Design Technique #3 (of 5)

### Slow down "slack paths"



### Fact: Most logic on a chip is "too fast"



Most wires have hundreds of picoseconds to spare.





From "The circuit and physical design of the POWER4 microprocessor", IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.

EECS 150 L28: Power and Energy

### Use several supply voltages on a chip ...



Why use multi-Vdd? We can reduce dynamic power by using low-power Vdd for logic off the critical path.

What if we can't do a multi-Vdd design? In a multi-Vt process, we can reduce leakage power on the slow logic by using high-Vth transistors.

From: "Facing the Hot Chips Challenge Again", Bill Holt, Intel, presented at Hot Chips 17, 2005.

EECS 150 L28: Power and Energy

UC Regents Spring 2011 © UCB

#### LOW POWER ARM 1136JF-STM DESIGN

George Kuo, Anand Iyer Cadence Design Systems, Inc. San Jose, CA 95134, USA

Logical partition into 0.8V and 1.0V nets done manually to meet 350 MHz spec (90nm).

Level-shifter insertion and placement done automatically.

Dynamic power in 0.8V section cut 50% below baseline.

Leakage power in 1.0V section cut 70% below baseline.

Vdd (1.0V) domain w/ VDDCORE3&4  $\sim$ 100K std cells + 44 RAMs 3,300 level shifters Vdd (0.8V) domain w/ VDDCORE1&2 ~200K std cells

From a chapter from new book on ASIC design by Chinnery and Keutzer (UCB).

UC Regents Spring 2011 © UCB

EECS 150 L28: Power and Energy

Design Technique #4 (of 5)

### Gating clocks to save power



### On a CPU, where does the power go?



Half of the power goes to latches (Flip-Flops).

Most of the time, the latches don't change state.

So (gasp) gated clocks are a big win. But, done with CAD tools in a disciplined way.

### Synopsis Power Compiler can do this ...



"Up to 70%
power savings
at the block
level, for
applicable
circuits"
Synopsis Data
Sheet



### Power Compiler also can do this ...



10-20%
push-button
power
savings,
using
techniques
like this one.



Design Technique #5 (of 5)

### **Thermal Management**



### Keep chip cool to minimize leakage power



Figure 3: I<sub>CCINTQ</sub> vs. Junction Temperature with Increase Relative to 25°C

Optimizing Designs for Power Consumption through Changes to the FPGA Environment



EECS 150 L28: Power and Energy

WP285 (v1.0) February 14, 2008

### Monitor die temperature, servo clock speed



63

### **Preceded by Intel: "Turbo Boost"**

| Processor                                        | r 870<br>2.93 GHz |      | Int | ntel Core i7-860<br>2.80 GHz |      |      | Intel Core i5-<br>750<br>2.66 GHz |      |     |     |     |     |
|--------------------------------------------------|-------------------|------|-----|------------------------------|------|------|-----------------------------------|------|-----|-----|-----|-----|
| Processor Cores                                  | 4                 |      |     |                              | 4    |      |                                   | 4    |     |     |     |     |
| Active Cores                                     | 1C                | 2C   | 3C  | 4C                           | 1C   | 2C   | 3C                                | 4C   | 1C  | 2C  | 3C  | 4C  |
| Maximum Intel® Turbo Boost Technology Bin Upside | 5                 | 4    | 2   | 2                            | 5    | 4    | 1                                 | 1    | 4   | 4   | 1   | 1   |
| Maximum Intel® Turbo Boost Technology Frequency  | 3.6               | 3.46 | 3.2 | 3.2                          | 3.46 | 3.33 | 2.93                              | 2.93 | 3.2 | 3.2 | 2.8 | 2.8 |

EECS 150 L28: Power and Energy

Bin Size = 133 MHz

### IBM Power 4: How does die heat up?



4 dies on a multi-chip module

2 CPUs \_\_\_\_\_ per die





### 115 Watts: Concentrated in "hot spots"





66.8 C == 152 F 82 C == 179.6

EECS 150 L28: Power and Energy

# **Power and Energy in FPGAs**





#### **FPGA: Xilinx Virtex-5 XC5VLX110T**







Colors represent different types of resources:

Logic
Block RAM
DSP (ALUs)
Clocking
I/O
Serial I/O + PCI

A routing fabric runs throughout the chip to wire everything together.





### Power Issue #1: Switching Fabric 'C'



Slices define regular connections to the switching fabric, and to slices in CLBs above and below it on the die.



The LX110T has 17,280 slices.

### How slices appear on the die ...

X0, X2, ... are lower CLB slices.

X1, X3, ... are upper CLB slices.

Y0, Y1, ... are CLB column positions.



Lower-left corner of the die.

EECS 150 L28: Power and Energy

### Simple model of FPGA interconnect ...

Why 'C' is so big: (1) each green dot is a transistor switch (2) path may not be shortest length (3) all wires are too long!





In a non-FPGA (ASIC) chip, this wire may have 10 times less capacitance, and thus use 10 times less power on each flip!

EECS 150 L28: Power and Energy

### Power Issue #2: Unused slice logic.



Unused logic that cannot be turned off wastes static power.

SLICEM Reset Type □ Sync □ Async DI2 □ DPRAM64/32 A5 SPRAM64/32 D5 🗁 A4 □SRL32 A3 □SRL16 □LUT A2 □RAM D3 🗁 A1 ROM CK SRLOW WA1-WA6 WA7 CI 🗀 CMU) DI2 A6 DPRAM64/32 A5 DSPRAM64/32 C5 🗁 **□**c C4 🗁 A4 □ SRL32 □ SRL16 A3 DLUT A2 DRAM A1 ROM **−** CQ DI1 MC31 CE SRHIGH CK SRLOW WA1-WA6 CX □ ві 🗀 BMUX DI2 A6 DPRAM64/32 A5 SPRAM64/32 <del></del>В A5 | SPRAM A4 | SRL32 A3 | SRL16 A3 | LUT A2 | RAM A1 | ROM B4 🗀 B3 🗀 - BQ B2 D-DI1 MC31 WA1-WA6 WA7 A6 DPRAM64/32 A5 🗀 **→** A A4 □ SRL32 A3 □ SRL16 BLUT A2 □ RAM A4 🗁 A3 🗀 -DAQ A2 🗀 DI1 □ INIT0 A1 ROM CE SRHIGH MC31 CK SRLOW WA1-WA6 AX 🗀 SR 🗁 CF [ CLK WSGEN UG190\_5\_03\_041006 UC Regents Spring 2011 © UCB

Power issue #3: Clock power.

32 global clock wires go down the red column.

Any 10 may be sent to a clock region.

Much of the clock wire is a clock wire to nowhere.
Big power waste.





### Power Issue #4: CAD speed tradeoffs ...

An example design from a Xilinx "white paper" ...

# CAD set for high speed.

|   | 1 x 32.5k x 36-bit      | Area                    | Fmax MHz | Total Dynamic<br>Power (mW) | Dynamic RAM<br>Power (mW) |  |
|---|-------------------------|-------------------------|----------|-----------------------------|---------------------------|--|
|   | Speed<br>Implementation | 70 Slices<br>73 RAMB16  | 139      | 135.6                       | 99.6                      |  |
| 1 | XST 9.2i-power          | 751 Slices<br>65 RAMB16 | 123      | 48                          | 2.4                       |  |

CAD set for low power.



A small drop in clock speed, but a big power win. But, design size explodes, and the tools run slow.

EECS 150 L28: Power and Energy

## The final stretch ...

#### Homework due

|    | Thu 4/28 | Lec #28: Course Wrap-up              | 111114 | now                              |
|----|----------|--------------------------------------|--------|----------------------------------|
|    | Tue 5/3  | RRR Week, No Lecture                 |        | Cleanup &                        |
| 16 | Thu 5/5  | RRR Week, No Lecture                 |        | Optimizations, Early<br>Checkoff |
| 17 | Mon 5/9  | Final Exam is Monday 5/9, 11:30-2:30 |        | Final Project Checkoff           |



Have a great summer!

UC Regents Spring 2011 © UCB