### Managing Physical Design Issues **Managing Physical Design** in ASIC Toolflows **Issues in ASIC Toolflows** Logical Effort Physical Design Issues Clock Distribution Power Distribution - Wire Delay – Power Consumption 1. What is the issue? 2. How do custom designers - Capacitive Coupling address the issue? 3. How can we approximate these 6.375 Complex Digital Systems approaches in an ASIC toolflow? Christopher Batten February 21, 2006 6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 2 Which gate topology and **Review of the simple RC model** for the CMOS inverter transistor sizing is optimal? $\mathsf{R}_{\mathsf{eff}}$

 $V_{\text{out}}$ 

 $\begin{aligned} \mathsf{R}_{\mathsf{eff}} &= \mathsf{R}_{\mathsf{eff},\mathsf{N}} = \mathsf{R}_{\mathsf{eff},\mathsf{P}} \\ \mathsf{C}_{\mathsf{g}} &= \mathsf{C}_{\mathsf{g},\mathsf{N}} + \mathsf{C}_{\mathsf{g},\mathsf{P}} \\ \mathsf{C}_{\mathsf{d}} &= \mathsf{C}_{\mathsf{d},\mathsf{N}} + \mathsf{C}_{\mathsf{d},\mathsf{P}} \end{aligned}$ 

 $V_{in}$ 

Ideally, given a gate topology, we would like to answer two questions in a lightweight and technology independent way:

- 1. What is the optimal transistor sizing?
- 2. What is the optimal number of stages?

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 4

 $\cdot R_{eff}$ 

 ${\rm V}_{\rm out}$ 

# A gate template is gate with same drive current as minimum sized inverter



6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 5 T

## We begin by deriving an equation for unitless delay in terms of a template

Determine RC for an actual gate relative to the template

$$C_{\text{in}} = \alpha \cdot C_{\text{in,T}} \quad C_{\text{p}} = \alpha \cdot C_{\text{p,T}} \quad R_{\text{EFF}} = \frac{R_{\text{EFF,T}}}{\alpha}$$

#### Derive absolute delay in terms of the template



## We begin by deriving an equation for unitless delay in terms of a template

Determine RC for an actual gate relative to the template

 $C_{\text{in}} = \alpha \cdot C_{\text{in},\text{T}} \quad C_{\text{p}} = \alpha \cdot C_{\text{p},\text{T}} \quad R_{\text{EFF}} = \frac{R_{\text{EFF},\text{T}}}{\alpha}$ 

### Derive absolute delay in terms of the template



6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 6  $\Upsilon$ 

## We begin by deriving an equation for unitless delay in terms of a template

Determine RC for an actual gate relative to the template

$$\mathbf{C}_{\mathsf{in}} \ = \ \boldsymbol{\alpha} \cdot \mathbf{C}_{\mathsf{in},\mathsf{T}} \quad \mathbf{C}_{\mathsf{p}} \ = \ \boldsymbol{\alpha} \cdot \mathbf{C}_{\mathsf{p},\mathsf{T}} \quad \mathbf{R}_{\mathsf{EFF}} \ = \ \frac{\mathbf{R}_{\mathsf{EFF},\mathsf{T}}}{\alpha}$$

### Derive absolute delay in terms of the template

$$\begin{split} \textbf{d}_{abs} &= \textbf{K} \cdot \textbf{R}_{EFF} \Big( \textbf{C}_{out} + \textbf{C}_{p} \Big) = \textbf{K} \cdot \textbf{R}_{EFF} \cdot \textbf{C}_{out} + \textbf{K} \cdot \textbf{R}_{EFF} \cdot \textbf{C}_{p} \\ &= \textbf{K} \cdot \frac{\textbf{R}_{EFF} \cdot \textbf{C}_{in}}{\alpha} \cdot \alpha \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} + \textbf{K} \cdot \frac{\textbf{R}_{EFF,T}}{\alpha} \cdot \alpha \textbf{C}_{p,T} \\ &= \textbf{K} \cdot \frac{\textbf{R}_{EFF,T}}{\alpha} \cdot \alpha \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} + \textbf{K} \cdot \frac{\textbf{R}_{EFF,T}}{\textbf{C}_{in}} \cdot \alpha \textbf{C}_{p,T} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} + \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{p,T} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} + \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{p,T} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} + \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} + \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} + \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} + \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} + \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ &= \textbf{K} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ &= \textbf{K} \cdot \textbf{R}_{EFF,T} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ &= \textbf{K} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{out}}{\textbf{C}_{in}} \\ \\ &= \textbf{K} \cdot \textbf{C}_{in,T} \, \frac{\textbf{C}_{in}} \ \\ \\ &= \textbf$$

Normalize this delay to the delay of an min inverter with no parasitics

$$d = \frac{d_{abs}}{\tau} = \frac{K \cdot R_{EFF,T} \cdot C_{in,T}}{K \cdot R_{inv} \cdot C_{inv}} \frac{C_{out}}{C_{in}} + \frac{K \cdot R_{EFF,T} \cdot C_{p,T}}{K \cdot R_{inv} \cdot C_{inv}} = \frac{R_{EFF,T} \cdot C_{in,T}}{R_{inv} \cdot C_{inv}} \times \frac{C_{out}}{C_{in}} + \frac{R_{EFF,T} \cdot C_{p,T}}{R_{inv} \cdot C_{inv}}$$
For our 0.18um technology,  $\tau \approx 10ps$ 

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 8

## We begin by deriving an equation for unitless delay in terms of a template



**Parasitic Delay** is relative to a minimum sized inverter and is roughly independent of actual transistor widths

**Electrical Effort** is the fanout of the gate and is a function of actual transistor widths

Logical Effort compares characteristic RC time constant of gate to minimum sized inverter and is independent of actual transistor widths

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 9 🗡

## We begin by deriving an equation for unitless delay in terms of a template

$$\begin{split} \textbf{d} &= \frac{\textbf{d}_{abs}}{\tau} = \frac{\textbf{R}_{\text{EFF,T}} \cdot \textbf{C}_{\text{in,T}}}{\textbf{R}_{\text{inv}} \cdot \textbf{C}_{\text{inv}}} \times \frac{\textbf{C}_{out}}{\textbf{C}_{\text{in}}} + \frac{\textbf{R}_{\text{EFF,T}} \cdot \textbf{C}_{\text{p,T}}}{\textbf{R}_{\text{inv}} \cdot \textbf{C}_{\text{inv}}}\\ \textbf{d} &= \frac{\textbf{d}_{abs}}{\tau} = \textbf{g} \times \textbf{h} + \textbf{p} \end{split}$$

**Parasitic Delay** is relative to a minimum sized inverter and is roughly independent of actual transistor widths

- **Electrical Effort** is the fanout of the gate and is a function of actual transistor widths
- Logical Effort compares characteristic RC time constant of gate to minimum sized inverter and is independent of actual transistor widths

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 10 imes

## Logical effort is simply ratio of input cap to min inverter with same current drive



# Examples illustrating unit-less delay of gates with equal drive strength



6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 12

6.375 Spring 2006  $\cdot$  L06 Managing Physical Design Issues in ASIC Toolflows  $\cdot$  11  $\Psi$ 

# Examples illustrating unit-less delay of gates with similar area



# Path delay (D) is just the sum of the stage delays

$$D = \sum_{i} d_{i} = \sum_{i} (g_{i} \times h_{i} + p_{i}) = \sum_{i} (g_{i} \times h_{i}) + \sum_{i} p_{i}$$

$$(4/3) \times (C/C) + (4/3) \times (4C/C) = 10.67$$

$$+ 4$$

$$(-) + (4/3) \times (4C/2C) + (4/3) \times (4C/2C) = 9.33$$

$$+ 4$$

$$(-) + (4/3) \times (4C/4C) + (4/3) \times (4C/$$

# What is the optimal delay for any general two stage topology?

Form unitless delay equation Only free variable is C<sub>2</sub>

$$D = (g_1h_1 + p_1) + (g_2h_2 + p_2)$$
$$= \left(g_1\frac{C_2}{C_1} + p_1\right) + \left(g_2\frac{C_3}{C_2} + p_2\right)$$

Minimize with respect to C<sub>2</sub>

$$\frac{\partial D}{\partial C_2} = \frac{g_1}{C_1} - \frac{g_2 C_3}{\left(C_2\right)^2} = 0$$

 $\begin{array}{c|c} C_1 & C_2 \\ \hline g_1 & g_2 \\ \hline p_1 & p_2 \end{array} C_3$ 

Minimal delay occurs when stage effort is equal

$$\frac{g_1}{C_1} = \frac{g_2C_3}{(C_2)^2}$$
$$g_1 \frac{C_2}{C_1} = g_2 \frac{C_3}{C_2}$$
$$g_1h_1 = g_2h_2$$

## Key Result: Delay is minimized when effort is shared equally among stages

$$\mathsf{D} = \sum_{i} \mathsf{d}_{i} = \sum_{i} (\mathsf{g}_{i} \times \mathsf{h}_{i} + \mathsf{p}_{i}) = \sum_{i} (\mathsf{g}_{i} \times \mathsf{h}_{i}) + \sum_{i} \mathsf{p}_{i}$$



6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 16

## We now generalize this result with some additional terminology

| Path delay                           | $D = \Sigma d_i = \Sigma g_i h_i + \Sigma p_i$ | Sum of stage delays                                      |
|--------------------------------------|------------------------------------------------|----------------------------------------------------------|
| Path logical effort                  | G = П g <sub>i</sub>                           | Product of stage LE                                      |
| Path electrical effort               | $H = \Pi h_i = C_{out}/C_{in}$                 | Product of stage EE<br>(Internal C's cancel out)         |
| Path effort                          | $F = \Pi f_{i} = \Pi (g_{i}h_{i}) = GH$        | Product of stage efforts                                 |
| Optimal stage effort<br>for N stages | f <sub>OPT</sub> = F <sup>1/N</sup>            | Optimal delay when<br>$g_1h_1 = g_2h_2 = \dots = g_Nh_N$ |
| Optimal path delay                   | $D_{OPT} = \Sigma f_{OPT} + \Sigma p_i$        |                                                          |

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 17

## **Steps for transistor sizing**

- 1. Calculate path effort
- 2. Calculate optimal path delay
- 3. Assign each stage equal effort
- 4. Work from  $C_{out}$  backwards assigning  $C_{in}$  values for each stage
- 5. Convert C<sub>in</sub> values into transistors sizes

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 18

## Finding the path effort and optimal delay (Steps 1 and 2)



## Finding actual transistor sizes for H=1 case (Steps 3-5)



 $F_{OPT} = F^{1/N} = (3.33)^{1/2} = 1.82$ 

 $\mathbf{C}_{\text{out}}$  and  $\mathbf{C}_{\text{in}}$  are given in equivalent gate transistor width cap

Stage effort of nor gate must equal 1.82 We know logical effort is 5/3, so we can find  $C_2$ 

$$(5/3)(6/C_2) = 1.82$$
  
 $C_2 = 5.5$ 

Double check that stage effort of first stage works out

(2)(5.5/6) = 1.82

## Finding actual transistor sizes for H=1 case (Steps 3-5)



# Optimum number of stages for varying parasitic delays and stage effort



# How many stages of inverters required if want to drive large load?





No simple closed form solution, but we can examine this function numerically

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 22  $\Psi$ 

# Optimum stage effort for varying parasitic delays



# A good rule-of-thumb is to target a stage effort around four



Minimum delay when:

- Stage effort = logical effort x electrical effort ≈ 3.4-3.8
- Some derivations use e = 2.718.. this ignores parasitics
- Broad optimum, stage efforts of 2.4-6.0 within 15-20% of minimum

Fan-out-of-four (FO4) is convenient design size (~5T)

FO4 delay: Delay of inverter driving four copies of itself



<sup>6.375</sup> Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 25 🗡

# Managing Physical Design Issues in ASIC Toolflows

- Logical Effort
- Physical Design Issues
  - Clock Distribution
  - Power Distribution
  - Wire Delay
  - Power Consumption
  - Capacitive Coupling
- 1. What is the issue?
- 2. How do custom designers address the issue?
- 3. How can we approximate these approaches in an ASIC toolflow?

# Do the topologies in our original example have the optimum number of stages?



6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 26  $\Upsilon$ 

## Clock Distribution: The Issue Clock propagates across entire chip

Cannot really distribute clock instantaneously with a perfectly regular period

Clock



## **Clock Distribution: The Issue** Two forms of variability



## **Clock Distribution: The Issue** Two forms of variability



## **Clock Distribution: The Issue**

Why is minimizing skew and jitter hard?



## **Clock Distribution: Custom Approach** Clock grids lower skew but high power



## **Clock Distribution: Custom Approach** Trees have more skew but less power



## **Clock Distribution: Custom Approach** Other techniques

- Use latch-based design
  - Time borrowing helps reduce impact of clock uncertainty
  - Timing analysis can be more difficult
- Make logical partitioning match physical partitioning
  - Limits global communication where skew is usually the worst
  - Helps break distribution problem into smaller subproblems
- Use globally asynchronous, locally synchronous design
  - Divides design into synchronous regions which communicate through asynchronous channels
  - Requires overhead for inter-domain communication
- Use asynchronous design
  - Avoids clocks all together
  - Incurs its own forms of control overhead

## **Clock Distribution: Custom Approach** Active deskewing circuits in Intel Itanium



## Clock Distribution: ASIC Approach Clock Tree Synthesis

- Modern back-end tools include clock tree synthesis
  - Creates balanced RC-trees
  - Uses special clock buffer standard cells
  - Can add clock shielding
  - Can exploit useful clock skew
- Automatic clock tree generation still results in significantly worse clock uncertainties as compare to custom clock trees

# Example of clock tree synthesis using commercial ASIC back-end tools



# Example of clock tree synthesis using commercial ASIC back-end tools



6.375 Spring 2006  $\cdot$  L06 Managing Physical Design Issues in ASIC Toolflows  $\cdot$  38 imes

## **Power Distribution: The Issue** Possible IR drop across power network



**Power Distribution: The Issue** IR drop can be static or dynamic



## **Power Distribution: Custom Approach** Carefully tailor power network



Routed power distribution on two stacked layers of metal (one for VDD, one for GND). OK for low-cost, low-power designs with few layers of metal.



Power Grid. Interconnected vertical and horizontal power bars. Common on most high-performance designs. Often well over half of total metal on upper thicker layers used for VDD/GND.



Dedicated VDD/GND planes. Very expensive. Only used on Alpha 21264. Simplified circuit analysis. Dropped on subsequent Alphas.

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 41 🗡

## **Power Distribution: ASIC Approach** Power rings partition the power problem



# Example of power distribution network using commercial ASIC back-end tools



#### 6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 43

## **Power Distribution: ASIC Approach** Strapping and rings for standard cells



<sup>6.375</sup> Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 42 imes

# Example of power distribution network using commercial ASIC back-end tools



## Wire Delay: The Issue Large RC makes long wires slow



## Wire delay increases quadratically

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 46  $\Upsilon$ 

## Wire Delay: Custom Approach Manual insertion of repeaters



## Wire Delay: Custom Approach Several issues with repeater insertion





- Repeater must connect to transistor layers
- Blocks other routes with vias that connect down
- Requires space on active layers for buffer transistors
- Repeaters often grouped in preallocated repeater boxes spread around chip, and thus repeater location might not give ideal spacing

## Wire Delay: Impact on RTL

- Make logical, physical partitioning match
  - Limits global communication
  - Helps simplify automatic buffer insertion
- Add extra pipeline stages for wire delay
  - P4 included stages just for driving signals
  - Requires very early physical prototyping
- · Use latency insensitive methodology
  - Create macroblocks with registered interfaces
  - Enables pipelining wires late in design cycle

| 1<br>2 | TC Next IP      |  |
|--------|-----------------|--|
| 3      | TC Fetch        |  |
| 4      |                 |  |
| 5      | Drive           |  |
| 6      | Alloc           |  |
| 7      | Rename          |  |
| 8      |                 |  |
| 9      | Queue           |  |
| 10     | Schedule 1      |  |
| 11     | Schedule 2      |  |
| 12     | Schedule 3      |  |
| 13     | Dispatch 1      |  |
| 14     | Dispatch 2      |  |
| 15     | Register File 1 |  |
| 16     | Register File 2 |  |
| 17     | Execute         |  |
| 18     | Flags           |  |
| 19     | Branch Check    |  |
| 20     | Drive           |  |

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 51

## Wire Delay: ASIC Approach

- Front-end tools include rough wire-load models
  - Usually statistical in nature
  - Helps synthesis tool with technology mapping
- Back-end tools include better wire-load models
  - After trial placement can use Manhattan distance
  - Tool will automatically insert buffers where necessary

6.375 Spring 2006  $\cdot$  L06 Managing Physical Design Issues in ASIC Toolflows  $\cdot$  50  $\Upsilon$ 

## **Power Consumption: The Issue** Power has been increasing rapidly

## **Power Consumption: The Issue** Why is it a problem?

- Power dissipation is limiting factor in many systems
  - Battery weight and life for portable devices
  - Packaging and cooling costs for tethered systems
  - Case temperature for laptop/wearable computers
  - Fan noise for media hubs
- Example 1: Cellphone
  - 3 Watt hard power limit any more and customers complain
  - Battery life is a strong product differentiator
- Example 2: Internet data center
  - ~8,000 servers, ~2 MegaWatts
  - 25% of operational costs are in electricity bill for supplying power and running air-conditioning to remove heat

## **Power Consumption: The Issue** Main forms are dynamic and static power



Dynamic Power Switching power used to charge up load capacitance

 $P_{dynamic} = \alpha F (1/2) C V_{DD}^2$ 

Static Power Subthreshold leakage power when transistor is "off"

 $P_{\text{static}} = V_{\text{DD}} I_{\text{off}}$ 

#### 6.375 Spring 2006 $\cdot$ L06 Managing Physical Design Issues in ASIC Toolflows $\cdot$ 53 imes

## **Power Consumption: Custom Approach**

$$P_{dynamic} = \alpha F (1/2) C V_{DD}^{2}$$

### **Reduce Activity**

- Clock gating so clock node of inactive logic doesn't switch
- Data gating so data nodes of inactive logic doesn't switch
- Bus encodings to minimize transitions
- Balance logic paths to avoid glitches during settling

### **Reduce Frequency**

- Doesn't save energy, just reduces rate at which it is consumed
- Lower power means less heat dissipation but must run longer

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 54  $\Psi$ 

## **Power Consumption: Custom Approach**

 $P_{dynamic} = \alpha F (1/2) C V_{DD}^2$ 

### **Reduce Switched Capacitance**

- Careful transistor sizing (small transistors off critical path)
- Tighter layout (good floorplanning)
- Segmented tri-state bus structures

### **Reduce Supply Voltage**

- Need to lower frequency as well quadratic+ power savings
- Can lower statically for cells off critical path
- Can lower dynamically for just-in-time computation

## **Power Consumption: Custom Approach**



### **Reduce Supply Voltage**

In addition to dynamic power reduction, reducing Vdd can help reduce static power

### **Reduce Off Current**

- Increase length of transistors off critical path
- Use high-Vt cells off critical path (extra Vt increases fab costs)
- Use stacked devices
- Use power gating (ie switch off the power supply with a large transistor)

## **Power Consumption** Reducing activity with clock gating

- Don't clock flip-flop if not needed
- Avoids transitioning downstream logic
- · Enable adds control logic complexity
- P4 has hundreds of gated clock domains

Clock

Enable

Latched Enable

Gated Clock



## **Power Consumption** Reducing activity with data gating



#### **Power Consumption** Reducing supply voltage Energy Both static and 0.8 dynamic voltage scaling is possible Energy Delay 0.6 0.4 0.2 **Delay rises sharply** Delay as supply voltage 0.0 L 1.0 2.4 3.8 5.2 6.6 approaches Vt supply voltage

## **Power Consumption** Parallel architecture to reduce energy

### 8-bit adder/cmp

- -40MHz at 5V, area = 530 k $\mu^2$
- Base power P<sub>ref</sub>

### Two parallel interleaved adder/cmp units

- -20MHz at 2.9V, area = 1,800 k $\mu^2$  (3.4x)
- -2010112 at 2.50, area 1,000 r
- Power = 0.36  $P_{ref}$

### One pipelined adder/cmp unit

- -40MHz at 2.9V, area = 690 k $\mu^2$  (1.3x)
- Power = 0.39 P<sub>ref</sub>

### **Pipelined and parallel**

- 20MHz at 2.0V, area = 1,961 k $\mu^2$  (3.7x)
- Power = 0.2 P<sub>ref</sub>



## **Power Consumption: ASIC Approach**

- · Minimize activity
  - Automatic clock gating is possible if we write Verilog so tools can infer gating
  - Partition designs so minimal number of components activated to perform each operation
  - Floorplan units to reduce length of power-hungry global wires
- Use lowest voltage and slowest frequency necessary to reach target performance
  - Use pipelined and parallel architectures if possible
- Modern standard cell libraries include low-power cells, high-VT cells, and low-VT cells – tools can automatically replace non-critical cells to optimize for power

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 61

## Capacitive Coupling Custom vs ASIC Approach

### **Custom Approach**

- Avoid placing simultaneously switching signals next to each other for long parallel runs (use swizzling)
- Reroute signals which will be quiet during switching in between simultaneous switching signals
- · Route signals close to power rails for capacitance ballast
- Extensive dynamic signal simulation

### **ASIC Approach**

- Automatic routers can specifically avoid long straight routes, sometimes this causes the router to avoid the "most direct" route
- · Critical nets (such as the clock) can use automatic shielding
- Static timing tools help focus dynamic signal simulation
- Fixing a coupling problem can require a point change which itself might cause new problems

## **Capacitive Coupling: The Issue** Delay is a function of switching on neighbors



- Most of the wire capacitance is to neighboring wires
- If A switches then it injects voltage noise on where the magnitude depends on capacitive divider formed [  $C_{AB}/(C_{AB}+C_B)$  ]
  - If A switches in opposite direction while B switches, coupling capacitance effectively doubles (Miller effect)
  - If A switches in same direction while B switches, coupling capacitance disappears
- These effects can lead to large variance in possible delay of B driver, possibly factor of 5 or 6 between best and worst case

6.375 Spring 2006 • L06 Managing Physical Design Issues in ASIC Toolflows • 62  $\Psi$ 

## Take away points

- Logical effort is a useful tool for quickly determining transistor sizing and number of stages
- It is essential to consider physical design issues early and often in ASIC design
  - Physical prototyping enables designers to evaluate impact of physical design issues early in the design process with
  - Making logical partitioning match physical partitioning helps expose physical design tradeoffs at the RTL level

Next Lecture: Arvind will introduce using guarded atomic actions to describe hardware