# Microprocessor Design in the Nanoscale Era

### Stefan Rusu

Senior Principal Engineer Intel Corporation

**IEEE Fellow** 

stefan.rusu@intel.com



©2012 Intel Corporation

## Agenda

- Microprocessor Design Trends
- Process Technology Directions
- Active Power Management
- Leakage Reduction Techniques
- Packaging and Thermal Modeling
- Future Directions and Summary



### **Microprocessor Evolution**





|             | 4004 Processor     | Westmere-EX Processor |
|-------------|--------------------|-----------------------|
| Year        | 1971               | 2011                  |
| Transistors | 2300               | 2.6 B                 |
| Process     | 10 µm              | 32 nm                 |
| Die area    | 12 mm <sup>2</sup> | 513 mm <sup>2</sup>   |

Die photos not at scale



## Scaling Trends



M. Bohr



# Client Processor Trend: Integrated Graphics

- Ivy Bridge 22nm client processor with monolithic integrated graphics
- Up to 4 dual-threaded cores and 8MB L3 cache
- Dual channel DDR3 memory controller at 1600MT/s
- Integrated PCIe interface (16 Gen3 + 4 Gen2 + 4 DMI lanes)
  - First Client CPU to support PCIe Gen3
- Three independent displays
- 1.4B transistors in 160mm<sup>2</sup> die



S. Damaraju, ISSCC 2012



### Client Processor Trend: Integrated WiFi RF



- Sensitive RF circuits integrated with 32nm ATOM and PCH
- Integration of traditional III-V RF components
  - -21dBm Power amp, and 34dBm T/R switch, 3.5dB NF LNA

H. Lakdawala, ISSCC 2012



### Server Processor Trends: More Cores



Server core count increases every generation, while keeping within flat power budget



### Server Processor Trends: More Cache



Cache size increases with every process generation



### Server Processors Power Trends





Stefan Rusu – July 2012

### Voltage Scaling Has Slowed Down





# Agenda

- Microprocessor Design Trends
- Process Technology Directions
- Active Power Management
- Leakage Reduction Techniques
- Packaging and Thermal Modeling
- Future Directions and Summary



# 30 Years of MOSFET Scaling



| Gate Length:          | 1.0 μm | 35 nm  |
|-----------------------|--------|--------|
| Gate Oxide Thickness: | 35 nm  | 1.2 nm |
| Operating Voltage:    | 4.0 V  | 1.2 V  |



### 90 nm Strained Silicon Transistors

### NMOS



### SiN cap layer Tensile channel strain

### PMOS



#### SiGe source-drain Compressive channel strain

M. Bohr, ISSCC 2009



# 45 nm High-k + Metal Gate Transistors

### 65 nm Transistor



### SiO<sub>2</sub> dielectric Polysilicon gate electrode

### 45 nm HK+MG



Hafnium-based dielectric Metal gate electrode

M. Bohr, ISSCC 2009



### HK/MG Gate Leakage Reduction



(intel)

15

K. Mistry, IEDM 2007

### 6T SRAM Bit Cell Leakage Reduction





Stefan Rusu – July 2012

M. Bohr, ISSCC 2009

### **Traditional Planar Transistor**



Traditional 2-D planar transistors form a conducting channel in the silicon region under the gate electrode when in the "on" state

M. Bohr, 2011



### 22 nm Tri-Gate Transistor



3-D Tri-Gate transistors form conducting channels on three sides of a vertical fin structure, providing "fully depleted" operation



## **Transistor Scaling Trends**

### 32 nm Planar Transistors



### 22 nm Tri-Gate Transistors



M. Bohr, 2011



### Transistor Gate Delay



32nm planar transistors 22% faster than 45nm planar

D. Perlmutter, ISSCC 2012

![](_page_19_Picture_4.jpeg)

Stefan Rusu – July 2012

### **Transistor Gate Delay**

![](_page_20_Figure_1.jpeg)

22nm planar transistors would have been only 14% faster

![](_page_20_Picture_3.jpeg)

### Transistor Gate Delay

![](_page_21_Figure_1.jpeg)

22nm Tri-Gate transistors provide improved performance at high voltage and unprecedented 37% speedup at low voltage

![](_page_21_Picture_3.jpeg)

## Intel Transistor Leadership

![](_page_22_Figure_1.jpeg)

![](_page_22_Picture_2.jpeg)

## Lithography Challenges

![](_page_23_Figure_1.jpeg)

193 nm enhancements enable the 22 nm generation

![](_page_23_Picture_3.jpeg)

Stefan Rusu – July 2012

## Extreme Ultraviolet Lithography

• EUV lithography uses extremely short wavelength light

- -Visible light 400 to 700 nm
- -DUV lithography 193 and 248 nm
- -EUV lithography 13 nm

![](_page_24_Picture_5.jpeg)

![](_page_24_Picture_6.jpeg)

![](_page_24_Picture_7.jpeg)

#### World's First EUV Mask

![](_page_24_Picture_9.jpeg)

Stefan Rusu – July 2012

### Layout Restrictions

### 65 nm Layout Style

![](_page_25_Picture_2.jpeg)

Bi-directional features Varied gate dimensions Varied pitches

### 32 nm Layout Style

![](_page_25_Picture_5.jpeg)

Uni-directional features Uniform gate dimension Gridded layout

M. Bohr, ISSCC 2009

![](_page_25_Picture_8.jpeg)

450mm in the Era of Complex Scaling: Must coordinate demand drivers, technical requirements and resources

![](_page_26_Figure_1.jpeg)

![](_page_26_Picture_2.jpeg)

Stefan Rusu – July 2012

### **Process Variations**

![](_page_27_Picture_1.jpeg)

**Resist Thickness** 

Lens Aberrations

Random Placement of Dopant Atoms

![](_page_27_Picture_5.jpeg)

# Voltage and Temperature Variations

- Voltage
  - -Chip activity change
  - -Current delivery—RLC
  - -Dynamic: ns to 10-100µs
  - -Within-die variation
- Temperature
  - -Activity & ambient change
  - -Dynamic: 100-1000µs
  - -Within-die variation

![](_page_28_Figure_10.jpeg)

![](_page_28_Picture_11.jpeg)

![](_page_28_Picture_12.jpeg)

### Impact on Design Methodology

![](_page_29_Figure_1.jpeg)

Major paradigm shift from deterministic design to probabilistic / statistical design

![](_page_29_Picture_3.jpeg)

Stefan Rusu – July 2012

### SRAM Cell Size Scaling

![](_page_30_Figure_1.jpeg)

Memory density continues to double every 2 years

![](_page_30_Picture_3.jpeg)

### Interconnect Trends

![](_page_31_Figure_1.jpeg)

![](_page_31_Picture_2.jpeg)

Stefan Rusu – July 2012

# 22nm Interconnects

![](_page_32_Picture_1.jpeg)

![](_page_32_Picture_2.jpeg)

- M1 to M8 cross-section
- M1-M6 use ultra-low-k ILD and self-aligned vias providing 13-18% capacitance reduction

- Cross-section of integrated MIM capacitor
- Enables capacitance density of >20fF/mm<sup>2</sup>

C. Auth, VLSI Symposium 2012

![](_page_32_Picture_8.jpeg)

### **On-chip Interconnect Trend**

![](_page_33_Figure_1.jpeg)

- Local interconnects scale with gate delay
- Global interconnects do not keep up with scaling

![](_page_33_Picture_4.jpeg)

# Agenda

- Microprocessor Design Trends
- Process Technology Directions
- Active Power Management
- Leakage Reduction Techniques
- Packaging and Thermal Modeling
- Future Directions and Summary

![](_page_34_Picture_7.jpeg)

## Voltage and Frequency Scaling

![](_page_35_Figure_1.jpeg)

3 - Voltage and Frequency scaling: Cubic Power Reduction

![](_page_35_Picture_3.jpeg)

### Memory and RF Vmin Reduction

- Write Assist circuit temporarily drops the array supply node to make it easier to write into the bit-cell
- Both Cache and Register Files use this technique to improve write Vmin in 22nm Ivy Bridge processor
- 22nm transistor and circuit improvements enable Vmin reduction of >100mV for Cache and 60mV for RF

S. Damaraju, ISSCC 2012

![](_page_36_Figure_4.jpeg)

![](_page_36_Picture_5.jpeg)

### NTV Pentium<sup>®</sup> Processor

![](_page_37_Figure_1.jpeg)

| Ultra-low<br>Power | Energy<br>Efficient | High<br>Performance |
|--------------------|---------------------|---------------------|
| 280 mV             | 0.45 V              | 1.2 V               |
| 3 MHz              | 60 MHz              | 915 MHz             |
| 2 mW               | 10 mW               | 737 mW              |
| 1500 Mips/W        | 5830 Mips/W         | 1240 Mips/W         |

| Technology   | 32nm High-K Metal Gate |  |
|--------------|------------------------|--|
| Interconnect | 1 Poly, 9 Metal (Cu)   |  |
| Transistors  | Core:6M                |  |
| Core Area    | 2mm <sup>2</sup>       |  |
| Package      | 951 Pins FCBGA11       |  |

#### S. Jain, ISSCC 2012

![](_page_37_Picture_5.jpeg)

## **Clock Gating**

![](_page_38_Figure_1.jpeg)

- Save power by gating the clock when data activity is low
- Requires detailed logic validation

![](_page_38_Picture_4.jpeg)

### Core Power Management

![](_page_39_Figure_1.jpeg)

![](_page_39_Picture_2.jpeg)

### Multiple Voltage Domains

![](_page_40_Figure_1.jpeg)

### Multiple voltage domains minimize power consumption across the core and uncore areas

Rusu, ISSCC 2009

![](_page_40_Picture_4.jpeg)

## Multiple Clock Domains

![](_page_41_Figure_1.jpeg)

![](_page_41_Figure_2.jpeg)

### Three primary clock domains: core, un-core, I/O Total of 16 PLLs and 8 DLLs Rusu, ISSCC 2009

(intel)

Stefan Rusu – July 2012

# Agenda

- Microprocessor Design Trends
- Process Technology Directions
- Active Power Management
- Leakage Reduction Techniques
- Packaging and Thermal Modeling
- Future Directions and Summary

![](_page_42_Picture_7.jpeg)

### Subthreshold Leakage Trend

![](_page_43_Figure_1.jpeg)

![](_page_43_Picture_2.jpeg)

# Leakage Reduction Techniques

![](_page_44_Figure_1.jpeg)

![](_page_44_Picture_2.jpeg)

### Leakage is a Strong Function of Voltage

![](_page_45_Figure_1.jpeg)

with lower supply voltage

![](_page_45_Picture_3.jpeg)

![](_page_46_Figure_0.jpeg)

S. Rusu, US Pat. 7,657,767

![](_page_46_Picture_2.jpeg)

### Leakage Shut-off Infrared Images

### 16MB part

![](_page_47_Figure_2.jpeg)

### 8MB part

![](_page_47_Picture_4.jpeg)

### 4MB part

![](_page_47_Figure_6.jpeg)

### Leakage reduction ► 3W (8MB)

### 5W (4MB)

![](_page_47_Picture_9.jpeg)

Stefan Rusu – July 2012

## Cache Dynamic Shut-off

![](_page_48_Figure_1.jpeg)

<u>Normal Operation</u> In the full-load state, all 16 ways are enabled (green)

#### Cache-by-Demand Operation

Under idle or low-load states, cache ways are dynamically flushed out and put in shut-off mode (red)

![](_page_48_Picture_5.jpeg)

![](_page_49_Figure_0.jpeg)

Three PMOS sleep transistor groups for sub-array leakage reduction

Y. Wang, ISSCC 2009

![](_page_49_Picture_3.jpeg)

Stefan Rusu – July 2012

### Cache Leakage Reduction Benefit

![](_page_50_Figure_1.jpeg)

### Fast corner / 1.0V / 110C

Leakage management circuit reduces sub-array leakage by 58%

![](_page_50_Picture_4.jpeg)

# Leakage Mitigation: Long-Le Transistors

![](_page_51_Figure_1.jpeg)

- All transistors can be either nominal or long-Le
- Most library cells are available in both flavors
- Long-Le transistors are ~10% slower, but have 3x lower leakage
- All paths with timing slack use long-Le transistors

S. Rusu, ISSCC 2006

 Initial design uses only long channel devices

### Long-Le Transistors Usage Map

![](_page_52_Figure_1.jpeg)

### Massive long-channel usage in uncore reduces leakage

![](_page_52_Picture_3.jpeg)

### Power & Leakage Breakdown

Nehalem-EX 45nm example

![](_page_53_Figure_2.jpeg)

S. Rusu, ISSCC 2009

![](_page_53_Picture_4.jpeg)

### Core and Cache Recovery Example

![](_page_54_Picture_1.jpeg)

Defective core and cache slices can be disabled in horizontal pairs

S. Rusu, ISSCC 2009

![](_page_54_Picture_4.jpeg)

# Minimize Leakage in Disabled Blocks

Disabled cores ► Power gated

![](_page_55_Figure_2.jpeg)

Disabled cache slices ► All major arrays in shut-off

![](_page_55_Figure_4.jpeg)

![](_page_55_Picture_5.jpeg)

### Core/Cache Recovery – Infrared Image

![](_page_56_Figure_1.jpeg)

### All cores and cache slices are enabled

![](_page_56_Picture_3.jpeg)

Stefan Rusu – July 2012

57

S. Rusu, ISSCC 2009

### Core/Cache Recovery – Infrared Image

![](_page_57_Figure_1.jpeg)

Shut-off 2 cores (top row) and 2 cache slices (bottom row) Disabled blocks are clock and power gated

S. Rusu, ISSCC 2009

![](_page_57_Picture_4.jpeg)

Stefan Rusu – July 2012

# Agenda

- Microprocessor Design Trends
- Process Technology Directions
- Active Power Management
- Leakage Reduction Techniques
- Packaging and Thermal Modeling
- Future Directions and Summary

![](_page_58_Picture_7.jpeg)

## Microprocessor Package Evolution

![](_page_59_Picture_1.jpeg)

![](_page_59_Picture_2.jpeg)

- 1971 4004 Processor
  - 16-pin ceramic package
  - Wire bond attach
  - 750 kHz I/O

- 2012 Xeon<sup>®</sup> E5 Processor
  - 2011-contact organic package
  - Flip-chip attach
  - 8.0 GHz I/O

![](_page_59_Picture_11.jpeg)

### **Power Density Models**

![](_page_60_Figure_1.jpeg)

With increasing power density and large on-die caches, detailed, non-uniform power models are required

![](_page_60_Picture_3.jpeg)

## **Thermal Modeling**

![](_page_61_Picture_1.jpeg)

### Simulated power density

### Infrared emission microscope measurement

D. Genossar and N. Shamir "Intel® Pentium® M Processor Power Estimation, Budgeting, Optimization and Validation", Intel Technology Journal 5/2003

![](_page_61_Picture_5.jpeg)

# **Thermal Sensors**

- Multiple temperature sensors
  One in each core hot spot
  One in the die center
- Temperature information is available through PECI bus for system fan management

![](_page_62_Figure_3.jpeg)

![](_page_62_Figure_4.jpeg)

![](_page_62_Picture_5.jpeg)

Stefan Rusu – July 2012

### Power Management Unit

![](_page_63_Figure_1.jpeg)

PMU controls processor voltage and frequency based on compute loading and thermal data

![](_page_63_Picture_3.jpeg)

# Agenda

- Microprocessor Design Trends
- Process Technology Directions
- Active Power Management
- Leakage Reduction Techniques
- Packaging and Thermal Modeling
- Future Directions and Summary

![](_page_64_Picture_7.jpeg)

### **Future Directions**

![](_page_65_Picture_1.jpeg)

- 2D mesh network with multiple Voltage / Frequency islands
- Communication across islands achieved through FIFOs

Ogras (CMU), DAC 2007

![](_page_65_Picture_5.jpeg)

### Fine Grain Power Management

![](_page_66_Figure_1.jpeg)

![](_page_66_Figure_2.jpeg)

25-core processor example:

![](_page_66_Figure_4.jpeg)

![](_page_66_Figure_5.jpeg)

![](_page_66_Picture_6.jpeg)

![](_page_66_Picture_7.jpeg)

## Summary

- Moore's Law has fueled the worldwide technology revolution for over 40 years and will continue for at least another decade
  - -0.7x transistor dimension scaling every two years
  - -Tri-gate devices provide significant benefits
- Continued microprocessor performance improvement depends on our ability to manage active power and leakage
  - -Clock and power gate un-used or disabled blocks
  - -Multiple voltage and clock domains
  - -Dynamic voltage and frequency adjustment
- Core and cache recovery enables multiple product options
  - -Disabled cores and cache slices are clock and power gated

![](_page_67_Picture_10.jpeg)