inst.eecs.berkeley.edu/~eecs251b

# EECS251B : Advanced Digital Circuits and Systems

# Lecture 21 – Low-Power Design

#### Borivoje Nikolić

Advancing science: Microsoft and Quantinuum demonstrate the most reliable logical qubits on record with an error rate 800x better than physical qubits

**April 3, 2024.** Today signifies a major achievement for the entire quantum ecosystem: Microsoft and Quantinuum demonstrated the most reliable logical qubits on record. By applying Microsoft's breakthrough qubit-virtualization system, with error diagnostics and correction, to Quantinuum's ion-trap hardware, we ran more than 14,000 individual experiments without a single error. Furthermore, we demonstrated more reliable quantum computation by performing error diagnostics and corrections on logical qubits without destroying them. This finally moves us out of the current noisy intermediate-scale quantum (NISQ) level to Level 2 Resilient quantum computing.



Quantinuum scientists making adjustments to a beam line array used to deliver laser pulses in H-Series quantum computers. Photo courtesy of Quantinuum.





#### Announcements

- Quiz 3 is today
- Homework 4 released
- Project
  - Preliminary design review next Tuesday
  - Starting at 9am, so everyone can present
  - 7 minutes per team
- Lab 5 due this week



# Architectural Power-Performance Tradeoffs

# **Optimal Processors**

- Processors used to be optimized for performance
  - Optimal logic depth was found to be 8-11 FO4 delays in superscalar processors
  - 1.8-3 FO4 in sequentials, rest in combinatorial
    - Kunkel, Smith, ISCA'86
    - Hriskesh, Jouppi, Farkas, Burger, Keckler, Shivakumar, ISCA'02
    - Harstein, Puzak, ISCA'02
    - Sprangle, Carmean, ISCA'02
- But those designs have very high power dissipation
  - Need to optimize for both performance and power/energy

# From System View: What is the Optimum?

- How do sensitivities relate to more traditional metrics:
  - Power per operation (MIPS/W, GOPS/W, TOPS/W)
  - Energy per operation (Joules per op)
  - Energy-delay product
- Can be reformatted as a goal of optimizing power x delay<sup>n</sup>
  - n = 0 minimize power per operation
  - n = 1 minimize energy per operation
  - n = 2 minimize energy-delay product
  - $n = 3 minimize energy-(delay)^2 product$

# **Optimization Problem**

- Set up optimization problem:
  - Maximize performance under energy constraints
  - Minimize energy under performance constraints
- ${}^{\bullet}$  Or minimize a composite function of  $E^m D^n$ 
  - What are the right m and n?

$$m = 1$$
,  $n = 1$  is EDP – improves at lower  $V_{DD}$ 

$$m = 1$$
,  $n = 2$  is invariant to  $V_{DD}$ 

• 
$$E \sim CV_{DD}^2$$

• D ~  $1/V_{DD}$ 

#### Hardware Intensity

- Introduced by Zyuban and Strenski in 2002.
- Measures where is the design on the Energy-Delay curve



Slope of the optimal E-D curve at the chosen design point

## **Optimum Across Hierarchy Layers**



Zyuban et al, TComp'04

#### Optimal logic depth in pipelined processors is ~18FO4 Relatively flat in the 16-22FO4 range

Architectural Tradeoffs

• H, Mair, ISSCC'20



#### **Energy-Delay Tradeoff of Modern Processors**

|                            | <u> </u> |          |              |              |
|----------------------------|----------|----------|--------------|--------------|
|                            | D1       | D2       | D3           | D4           |
| In-order vs out-of-order   | in-order | in-order | out-of-order | out-of-order |
| Issue width                | 1-issue  | 2-issue  | 2-issue      | 4-issue      |
| Cycle time (FO4)           | 27.5     | 16.9     | 17.2         | 16.3         |
| Branch pred size (entries) | 264      | 600      | 1024         | 870          |
| BTB size (entries)         | 64       | 90       | 554          | 1024         |
| I-cache size (KB)          | 21       | 32       | 32           | 32           |
| D-cache size (KB)          | 8        | 11       | 14           | 42           |
| Fetch latency              | 1.0      | 1.6      | 2.2          | 2.1          |
| Decode/Rename latency      | 1.0      | 1.7      | 2.4          | 3.0          |
| Retire latency             | N/A      | N/A      | 2.0          | 2.2          |
| Integer ALU latency        | 1.0      | 1.0      | 1.0          | 1.0          |
| FP ALU latency             | 3.0      | 4.0      | 3.9          | 4.1          |
| L1 D-cache latency         | 1.0      | 1.1      | 1.1          | 1.1          |
| ROB size                   | N/A      | N/A      | 22           | 32           |
| IW size                    | N/A      | N/A      | 11           | 9            |
| LSQ size                   | N/A      | N/A      | 16           | 16           |

 Table 3: Design Configuration Details For Selected Design Points.



Azizi, ISCA'10

(

(

## Architectural Tradeoffs: Tri-Gear

- HP: High performance (ARM Cortex A78, optimized for speed, 3.0GHz)
- BP: Balanced performance (ARM Cortex A78, optimized for power, 2.6GHz))
- HE: High efficiency (ARM A55, 2.0GHz)





# **Circuit-Level Tradeoffs**

#### Alpha-Power Based Delay Model



 $D = \sum t_{pi} = \sum \frac{K_{d} V_{DD}}{(V_{DD} - V_{Th})^{\alpha}} \left( 1 + \frac{W_{L,i}}{W_{in\,i}} \right)$ 



# • Switching $E_{Sw} = \alpha_{0 \to 1} (C_{L,i} + C_{int,i}) V_{DD}^{2}$

kage  

$$E_{Lk} = W_{In}I_0e^{\frac{-(V_{Th}-\gamma V_{DD})}{nV_t}}V_{DD}D$$

# Sizing, Supply, Threshold Optimization

- Transistor sizing can yield large power savings with small delay penalties
  - Gate sizing
  - Beta-ratio adjustments  $\beta = Wp/Wn$
  - (Stack resizing)
- Supply voltage affects both active and leakage energy
- Threshold voltage affects primarily the leakage

Apply to Sizing of an Inverter Chain



Unconstrained energy: find min  $D = \Sigma t_{pi}$ 

$$C_{gin,j} = \sqrt{C_{gin,j-1}C_{gin,j+1}} \qquad \qquad W_j = \sqrt{W_{j-1}W_{j+1}}$$

Constrained energy: find min D, under  $E < E_{max}$ Where  $E = \Sigma E_i$ 

EECS251B L21 LOW-POWER DESIGN

# **Constrained Optimization**

- Find min(D) subject to  $E = E_{max}$ 
  - Constrained function minimization
- E.g. Lagrange multipliers

Or dual:

 $\Lambda(x) = D(x) + \lambda(E(x) - E_{\max})$ 

 $K(x) = E(x) + \lambda (D - D_{max})$ 

 $\frac{\partial \Lambda}{\partial \mathbf{x}} = 0$ 

• Can solve analytically for  $x = W_{i}, V_{DD}, V_{Th}$ 



#### Inverter Chain: Sizing Optimization



- Variable taper achieves minimum energy
- Reduce number of stages at large d<sub>inc</sub>

Sensitivity to Sizing and Supply

• Gate sizing (W<sub>i</sub>)



 $\infty$  for equal  $f_{eff}$  $(D_{min})$ 

• Supply voltage (V<sub>dd</sub>)

$$-\frac{\partial E_{sw}}{\partial V_{DD}} = \frac{E_{sw}}{D} 2 \frac{1 - x_v}{\alpha - 1 + x_v}$$

 $x_v = (V_{Th} + \Delta V_{Th})/V_{dd}$ 





• Threshold voltage  $(V_{th})$ 

$$-\frac{\partial E}{\partial \Delta V_{Th}} = P_{Lk} \left( \frac{V_{DD} - V_{Th} - \Delta V_{Th}}{\alpha n V_t} - 1 \right)$$

Low initial leakage ⇒ speedup comes for "free"





# Scaling Supplies





- Strong function of voltage (V<sup>2</sup> dependence).
- Relatively independent of logic function and style.
- Power Delay Product Improves with lowering V<sub>DD</sub>.

Chandrakasan, JSSC'92

#### Lower V<sub>DD</sub> Increases Delay



Relatively independent of logic function and style.

#### Trade-off Between Power and Delay



 $\bigcirc$ 

#### Architecture Trade-off for Fixed-rate Processing Reference Datapath



- Critical path delay  $\Rightarrow$  T<sub>adder</sub> + T<sub>comparator</sub> (= 25ns)  $\Rightarrow$   $f_{ref} = 40Mhz$
- Total capacitance being switched = C<sub>ref</sub>
- $V_{dd} = V_{ref} = 5V$
- Power for reference datapath =  $P_{ref} = C_{ref} V_{ref}^2 f_{ref}$ from [Chandrakasan92] (*IEEE JSSC*)

## Parallel Datapath



• The clock rate can be reduced by half with the throughput  $\Rightarrow f_{par} = f_{ref} / 2$ 

• 
$$V_{par} = V_{ref} / 1.7$$
,  $C_{par} = 2.15C_{ref}$ 

•  $P_{par} = (2.15C_{ref}) (V_{ref}/1.7)^2 (f_{ref}/2) \approx 0.36 P_{ref}$ 

## **Pipelined Datapath**



- Critical path delay is less  $\Rightarrow$  max [T<sub>adder</sub>, T<sub>comparator</sub>]
- Keeping clock rate constant:  $f_{pipe} = f_{ref}$ Voltage can be dropped  $\Rightarrow V_{pipe} = V_{ref} / 1.7$
- Capacitance slightly higher:  $C_{pipe} = 1.15C_{ref}$
- $P_{pipe} = (1.15C_{ref}) (V_{ref}/1.7)^2 f_{ref} \approx 0.39 P_{ref}$

## A Simple Datapath: Summary

| Architecture type                                    | Voltage | Area | Power |
|------------------------------------------------------|---------|------|-------|
| Simple datapath<br>(no pipelining or<br>parallelism) | 5V      | 1    | 1     |
| Pipelined datapath                                   | 2.9V    | 1.3  | 0.39  |
| Parallel datapath                                    | 2.9V    | 3.4  | 0.36  |
| Pipeline-Parallel                                    | 2.0V    | 3.7  | 0.2   |

EECS251B L21 LOW-POWER DESIGN



# **Multiple Supplies**

# Multiple Supply Voltages

- Block-level supply assignment ("power domains" or "voltage islands")
  - Higher throughput/lower latency functions are implemented in higher  $V_{DD}$
  - Slower functions are implemented with lower  $V_{DD}$
  - Often called "Voltage islands"
  - Separate supply grids, level conversion performed at block boundaries
- Multiple supplies inside a block
  - Non-critical paths moved to lower supply voltage
  - Level conversion within the block
  - Physical design challenging
  - (Not used in practice)

## **Power Domains**



Utilize Unified Power Format (UPF) to capture design intent Common Power Format (CPF) is similar

#### Power Domain Design Intent

top\_wrapper



#### https://vlsitutorials.com/upf-low-power-vlsi/

EECS251B L21 LOW-POWER DESIGN

There are primarily 3 power domains –

• Logic inside aon\_wrapper [but not inside aon\_pgd\_wrapper] is always-on.

• Logic inside pgd\_wrapper can be power gated.

 Logic inside aon\_pgd\_wrapper can be power gated but won't be power gated when pgd\_wrapper is powered ON.
 There are two voltage domains –

• The supply voltage to logic inside aon\_wrapper [but not inside aon\_pgd\_wrapper] and logic inside pgd\_wrapper is 0.9V.

• The supply voltage to logic inside aon\_pgd\_wrapper is 1.1V.

There are two registers – reg A and reg B. The state of reg A needs to be retained in power gated state. There are six signals sig1-sig6 coming to and from different logic blocks.

## **Unified Power Format**



https://vlsitutorials.com/upf-low-power-vlsi/

EECS251B L21 LOW-POWER DESIGN

#### **# Create Power Domains**

create\_power\_domain pd\_top -include\_scope
create\_power\_domain pd\_aon -elements {aon\_wrapper}
create\_power\_domain pd\_gated -elements {pgd\_wrapper}
create\_power\_domain pd\_gated\_aon -elements
{{aon\_wrapper/aon\_pgd\_wrapper}}

There are primarily 3 power domains -

• Logic inside aon\_wrapper [but not inside aon\_pgd\_wrapper] is always-on.

- Logic inside pgd\_wrapper can be power gated.
- Logic inside aon\_pgd\_wrapper can be power gated but won't be power gated when pgd\_wrapper is powered ON.

#### **Unified Power Format**



#### **# Create Supply Ports**

create\_supply\_port VCCL -direction in -domain pd\_top
create\_supply\_port VCCH -direction in -domain pd\_top
create\_supply\_port GND -direction in -domain pd\_top

#### # Create Supply Nets

create\_supply\_net VCCL -domain pd\_top create\_supply\_net VCCH -domain pd\_top create\_supply\_net GND -domain pd\_top create\_supply\_net VCCL -domain pd\_aon -reuse create\_supply\_net GND -domain pd\_aon -reuse create\_supply\_net VCCH -domain pd\_gated\_aon -reuse create\_supply\_net VCCH\_gated -domain pd\_gated\_aon create\_supply\_net GND -domain pd\_gated\_aon -reuse create\_supply\_net GND -domain pd\_gated\_aon -reuse create\_supply\_net VCCL -domain pd\_gated\_aon -reuse create\_supply\_net VCCL -domain pd\_gated -reuse create\_supply\_net VCCL\_gated -domain pd\_gated create\_supply\_net GND -domain pd\_gated -reuse

There are two voltage domains -

• The supply voltage to logic inside aon\_wrapper [but not inside aon\_pgd\_wrapper] and logic inside pgd\_wrapper is 0.9V.

• The supply voltage to logic inside aon\_pgd\_wrapper is 1.1V.

https://vlsitutorials.com/upf-low-power-vlsi/

## **Unified Power Format**



# Connect Supply Nets with corresponding Ports

connect\_supply\_net VCCL -ports VCCL
connect\_supply\_net VCCH -ports VCCH
connect\_supply\_net GND -ports GND

#### **# Establish Connections**

set\_domain\_supply\_net pd\_top -primary\_power\_net VCCL -primary\_ground\_r GND

set\_domain\_supply\_net pd\_aon -primary\_power\_net VCCL -primary\_ground\_ GND

set\_domain\_supply\_net pd\_gated\_aon -primary\_power\_net VCCH\_gated primary\_ground\_net GND

set\_domain\_supply\_net pd\_gated -primary\_power\_net VCCL\_gated primary\_ground\_net GND

#### https://vlsitutorials.com/upf-low-power-vlsi/



# Shut-Down Logic for pgd\_wrapper & aon\_pgd\_wrapper create\_power\_switch sw\_pgd\_wrapper \ -domain pd\_gated \ -input\_supply\_port "sw\_VCCL VCCL " \ -output\_supply\_port "sw\_VCCL\_gated VCCL\_gated" \ -control\_port "sw\_pgd\_en aon\_wrapper/pmu/pgd\_en" \ -on\_state "SW\_PGD\_ON sw\_VCCL {!sw\_pgd\_en}" create\_power\_switch sw\_aon\_pgd\_wrapper \ -domain pd\_gated\_aon \ -input\_supply\_port "sw\_VCCH VCCH " \ -output\_supply\_port "sw\_VCCH\_gated VCCH\_gated" \ -control\_port "sw\_aon\_pgd\_en aon\_wrapper/pmu/aon\_pgd\_en" \ -on\_state "SW\_AONPGD\_ON sw\_VCCH {!sw\_aon\_pgd\_en}"

### https://vlsitutorials.com/upf-low-power-vlsi/

EECS251B L21 LOW-POWER DESIGN



#### # Isolation strategy

. . .

**set\_isolation** isol\_clamp1\_sig\_from\_pgd \ -domain  $pd_gated \setminus$ -isolation\_power\_net VCCL  $\setminus$ -isolation\_ground\_net GND \ -clamp\_value 1 \ -elements {pgd\_wrapper/sig2} **set\_isolation\_control** isol\_clamp1\_sig\_from\_pgd \ -domain  $pd_{qated}$ -isolation\_signal aon\_wrapper/pmu/isol\_pgd\_en \ -isolation\_sense low  $\setminus$ -location parent **set\_isolation** isol\_clamp0\_sig\_from\_pgd \ -domain  $pd_gated \setminus$ -isolation\_power\_net VCCL  $\setminus$ -isolation ground net GND \ -clamp\_value  $0 \setminus$ -elements {pgd\_wrapper/sig4} set\_isolation\_control isol\_clamp0\_sig\_from\_pgd \ -domain  $pd_gated \setminus$ -isolation\_signal aon\_wrapper/pmu/isol\_pgd\_en \ -isolation\_sense low  $\setminus$ -location parent

### https://vlsitutorials.com/upf-low-power-vlsi/

EECS251B L21 LOW-POWER DESIGN



### # Level Shifter strategy

set\_level\_shifter LtoH\_sig\_to\_aonpgd \
-domain pd\_gated\_aon \
-applies\_to inputs \
-rule low\_to\_high \
-location self
set\_level\_shifter HtoL\_sig\_from\_aonpgd \
-domain pd\_gated\_aon \
-applies\_to outputs \
-rule high\_to\_low \

-location self

https://vlsitutorials.com/upf-low-power-vlsi/

EECS251B L21 LOW-POWER DESIGN



#### **# Retention strategy**

set\_retention pgd\_retain \
-domain pd\_gated \
-retention\_power\_net VCCL \
-retention\_ground\_net GND \
-elements {pgd\_wrapper/regA}
set\_retention\_control pgd\_retain \
-domain pd\_gated \

-save\_signal {aon\_wrapper/pmu/ret\_en high} \
-restore\_signal {aon\_wrapper/pmu/ret\_en low}

There are two registers – reg A and reg B. The state of reg A needs to be retained in power gated state.

## https://vlsitutorials.com/upf-low-power-vlsi/

EECS251B L21 LOW-POWER DESIGN



### https://vlsitutorials.com/upf-low-power-vlsi/

EECS251B L21 LOW-POWER DESIGN

### # Create Power State Table

add\_port\_state VDDH \
-state {HighVoltage 1.1}
add\_port\_state VDDL \
-state {LowVoltage 0.9}
add\_port\_state sw\_aon\_pgd\_wrapper/sw\_VCCH\_gated \
-state {HighVoltage 1.1} \
-state {aonpgd\_off off}
add\_port\_state sw\_pgd\_wrapper/sw\_VCCL\_gated \
-state {LowVoltage 0.9} \
-state {pgd\_off off}

create\_pst pwr\_state\_table \
-supplies {VCCH VCCL VDDH\_gated VDDL\_gated}
add\_pst\_state PRE\_BOOT \
-pst pwr\_state\_table \
-state { HighVoltage LowVoltage aonpgd\_off pgd\_off}
add\_pst\_state AONPGD\_ON \
-pst pwr\_state\_table \
-state { HighVoltage LowVoltage HighVoltage pgd\_off}
add\_pst\_state PGD\_ON \
-pst pwr\_state\_table \
-state { HighVoltage LowVoltage aonpgd\_off LowVoltage}
add\_pst\_state ALL\_ON \
-pst pwr\_state\_table \
-state { HighVoltage LowVoltage HighVoltage LowVoltage}



EECS251B L21 LOW-POWER DESIGN

LS

# **Practical Examples**

- Intel Skylake (ISSCC'16)
  - Four power planes indicated by colors



# **Practical Examples**

• Intel 28-core Skylake-SP (ISSCC'18)



- Vcc: core supply (per core)
- Vccclm: Un-core supply
- Vccsa: System Agent supply
- Vccio: Infrastructure supply
- Vccsfr: PLL supply
- Vccddrd: DDR logic supply
- Vccddra: DDR I/O supply

• 9 primary VCC domains are partitioned into 35 VCC planes





# Level converter





# Multiple Supplies Within A Block

- Downsizing, lowering the supply on the critical path will lower the operating frequency
- Downsize (lowering supply) non-critical paths
  - Narrows down the path delay distribution
  - Increases impact of variations



Multiple Supplies in a Block



# Multiple Supplies in a Block

## CVS

## Layout:





## Usami'98

EECS251B L21 LOW-POWER DESIGN



EECS251B L21 LOW-POWER DESIGN

# Summary

- Power-performance tradeoffs
  - Sizing
  - Supplies
  - Thresholds
- Lowering supplies
- Multiple supply voltages

# Next Lecture

- Low-power design
  - Dynamic voltage scaling
  - Clock gating

