# **EE241B: Advanced Digital Circuits**

# **Lecture 19 – Supply Voltage**

# **Borivoje Nikolić**

April 2, AnandTech: Intel Details 10th Gen Comet Lake-H for 45 W



"This CPU can hit his frequency on two cores, when the system is bot within its secondary power limits but also Intel's Thermad Velocity Boost is enabled, which means there has to be additional thermal headroom in the system (and it has to be enabled by the CPM). This allows the CPU to go from 5.1 GHz to 5.3 GHz. Every Intel Thermal Velocity Boost enabled CPU requires CPM support in order to get those extra two bins on the single core frequency.

### Announcements

- Assignment 3 due today, April 2.
  - Quiz next Tuesday, end of class







- Module 5
  - Circuit-level power-performance tradeoffs
  - Reducing supply voltage



### **Architectural Tradeoffs**

• H, Mair, ISSCC'20









5.D Circuit-Level Tradeoffs





$$D = \sum t_{pi} = \sum \frac{K_d V_{DD}}{\left(V_{DD} - V_{Th}\right)^{\alpha}} \left(1 + \frac{W_{L,i}}{W_{in,i}}\right)$$



$$E_{Sw} = \alpha_{0 \to 1} \left( C_{L,i} + C_{\text{int},i} \right) V_{DD}^{2}$$



◆ Leakage

$$E_{Lk} = W_{ln}I_0 e^{\frac{-(V_{Th} - \gamma V_{DD})}{nV_l}} V_{DD}D$$



### Sizing, Supply, Threshold Optimization

- Transistor sizing can yield large power savings with small delay penalties
  - Gate sizing
  - Beta-ratio adjustments

$$\beta = Wp/Wn$$

- (Stack resizing)
- Supply voltage affects both active and leakage energy
- Threshold voltage affects primarily the leakage





### Apply to Sizing of an Inverter Chain



Unconstrained energy: find min  $D = \Sigma t_{pi}$ 

$$C_{gin,j} = \sqrt{C_{gin,j-1}C_{gin,j+1}}$$

$$W_{i} = \sqrt{W_{i-1}W_{i+1}}$$

Constrained energy: find min D, under  $E < E_{max}$ 

# Inverter Chain: Sizing Optimization

# Sensitivity to Sizing and Supply

Gate sizing (W<sub>i</sub>)

$$-\frac{\partial E_{sw}}{\partial D} \frac{\partial W_{j}}{\partial W_{j}} = \frac{e_{j}}{\tau_{nom} (f_{j} - f_{j-1})}$$

 $\infty$  for equal  $f_{eff}$   $(D_{min})$ 

• Supply voltage  $(V_{dd})$ 

$$-\frac{\partial E_{sw}}{\partial D} / \frac{\partial V_{DD}}{\partial D} = \frac{E_{sw}}{D} 2 \frac{1 - x_{v}}{\alpha - 1 + x_{v}}$$

$$c_v = (V_{Th} + \Delta V_{Th})/V_{dd}$$

# $Sens(V_{dd})$

# Power / Energy Optimization Space

|         | Constant Throughput/Latency                                                      |                                                                                | Variable Throughput/Latency |                                 |
|---------|----------------------------------------------------------------------------------|--------------------------------------------------------------------------------|-----------------------------|---------------------------------|
| Energy  | Design Time                                                                      | Sleep Mode                                                                     |                             | Run Time                        |
| Active  | Logic design<br>Scaled V <sub>DD</sub><br>Trans. sizing<br>Multi-V <sub>DD</sub> | Clock gating                                                                   |                             | DFS, DVS                        |
| Leakage | Stack effects Trans sizing Scaling V <sub>DD</sub> + Multi-V <sub>Th</sub>       | Sleep T's<br>Multi-V <sub>DD</sub> Variable V <sub>Th</sub><br>+ Input control |                             | DVS<br>Variable V <sub>Th</sub> |

# Constrained Optimization

- Find min(D) subject to  $E = E_{max}$ 
  - Constrained function minimization
- E.g. Lagrange multipliers

$$\Lambda(x) = D(x) + \lambda(F(x) - F)$$

$$\Lambda(x) = D(x) + \lambda(E(x) - E_{\text{max}})$$

$$K(x) = E(x) + \lambda(D - D_{max})$$

Or dual:

$$\frac{\partial \Lambda}{\partial \mathbf{x}} = 0$$

• Can solve analytically for  $x = W_{ji}, V_{DD}, V_{Th}$ 

# Inverter Chain: Sizing Optimization



$$W_{j} = \sqrt{\frac{W_{j-1}W_{j+1}}{1 + \lambda W_{i-1}}}$$

$$\lambda = -\frac{2KV_{DD}^{2}}{\tau S_{DD}}$$

- Variable taper achieves minimum energy
- Reduce number of stages at large  $\mathbf{d}_{\text{inc}}$

# Sensitivity to $V_{th}$

Threshold voltage  $(V_{th})$ 

$$-\frac{\frac{\partial E}{\partial \Delta V_{Th}}}{\frac{\partial D}{\partial \Delta V_{th}}} = P_{Lk} \left( \frac{V_{DD} - V_{Th} - \Delta V_{Th}}{\alpha n V_t} - 1 \right)$$

Low initial leakage

⇒ speedup comes for "free"



# **Energy-Performance Tradeoffs**

| Enable Time/<br>Perf. Impact | Design Time                                                                                                                                         | Run Time                                           |
|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|
| Near-zero<br>perf. penalty   | Clock gating<br>Architectural switching<br>reduction<br>Multi-V <sub>Th</sub>                                                                       | Dynamic V <sub>DD</sub><br>Dynamic V <sub>Th</sub> |
| True<br>tradeoffs            | Fine-granularity clock<br>gating<br>V <sub>DD</sub> , V <sub>TH</sub> adjustments<br>Multi-V <sub>DD</sub><br>Sizing, logic styles<br>Stack forcing | Power gating                                       |



# 5.E Scaling Supplies

## Power /Energy Optimization Space

|         | Constant Throughput/Latency                                                |              | Variable Throughput/Latency                 |                                 |
|---------|----------------------------------------------------------------------------|--------------|---------------------------------------------|---------------------------------|
| Energy  | Design Time                                                                | Sleep Mode   |                                             | Run Time                        |
| Active  | Logic design Scaled V <sub>DD</sub> Trans. sizing Multi-V <sub>DD</sub>    | Clock gating |                                             | DFS, DVS                        |
| Leakage | Stack effects Trans sizing Scaling V <sub>DD</sub> + Multi-V <sub>Th</sub> |              | ep T's<br>Variable V <sub>Th</sub><br>ttrol | DVS<br>Variable V <sub>Th</sub> |

# Supply Voltage Adjustment

- How to maintain throughput under reduced supply?
- Introducing more parallelism/pipelining
  - Area increase
  - Cost/power tradeoff
- Multiple voltage domains
  - Separate supply voltages for different blocks
  - Lower VDD for slower blocks
  - Cost of DC-DC converters
- Dynamic voltage scaling with variable throughput
- Reducing  $V_{TH}$  to improve speed
  - Leakage issues





- Strong function of voltage (V<sup>2</sup> dependence).
- Relatively independent of logic function and style.
- Power Delay Product Improves with lowering V<sub>DD</sub>.

Chandrakasan, JSSC'92

# Reducing $V_{\rm DD}$

32nm process



# Lower V<sub>DD</sub> Increases Delay



• Relatively independent of logic function and style.

### Trade-off Between Power and Delay



### Two Types of Processing

- Fixed-rate processing (e.g. signal processing for multimedia or communications)
  - Stream-based computation
  - No advantage in obtaining throughput in excess of the real-time constraint
- Variable-rate or burst-mode computation (e.g. general purpose computation)
  - Mostly idle (or low-load) with bursts of computation
  - Faster is better

### Architecture Trade-off for Fixed-rate Processing Reference Datapath



- Critical path delay  $\Rightarrow$   $T_{adder} + T_{comparator}$  (= 25ns)  $\Rightarrow$   $f_{ref} = 40 Mhz$
- Total capacitance being switched = C<sub>ref</sub>
- $V_{dd} = V_{ref} = 5V$
- Power for reference datapath =  $P_{ref} = C_{ref} V_{ref}^2 f_{ref}$

# Parallel Datapath



- $\label{eq:Area} \begin{array}{l} \text{Area} = 1476 \times 1219 \, \mu^2 \\ \bullet \quad \text{The clock rate can be reduced by half with the same throughput} \Rightarrow f_{par} = f_{ref}/2 \\ \bullet \quad V_{par} = V_{ref}/1.7, \, C_{par} = 2.15 C_{ref} \\ \bullet \quad P_{par} = (2.15 C_{ref}) \, (V_{ref}/1.7)^2 \, (f_{ref}/2) = 0.36 \, P_{ref} \end{array}$

# Pipelined Datapath



- Critical path delay is less  $\Rightarrow$  max [T<sub>adder</sub>, T<sub>comparator</sub>]
- Keeping clock rate constant:  $f_{pipe} = f_{ref}$ Voltage can be dropped  $\Rightarrow \bigvee_{pipe} = \bigvee_{ref} / 1.7$ Capacitance slightly higher:  $C_{pipe} = 1.15C_{ref}$
- $P_{pipe} = (1.15C_{ref}) (V_{ref}/1.7)^2 f_{ref} \approx 0.39 P_{ref}$



# A Simple Datapath: Summary

| Architecture type                                    | Voltage | Area | Power |
|------------------------------------------------------|---------|------|-------|
| Simple datapath<br>(no pipelining or<br>parallelism) | 5V      | 1    | 1     |
| Pipelined datapath                                   | 2.9V    | 1.3  | 0.39  |
| Parallel datapath                                    | 2.9V    | 3.4  | 0.36  |
| Pipeline-Parallel                                    | 2.0V    | 3.7  | 0.2   |

# Next Lecture

- Low-power design
  - Multiple supplies
  - Dynamic voltage scaling







