







### Multiple Issue

- Modern processors can issue and execute multiple instructions per clock cycle
- CPI < 1 (*superscalar*), so can use *Instructions Per Cycle* (IPC) instead
- •e.g. 4 GHz 4-way multiple-issue can execute 16 billion IPS with peak CPI = 0.25 and peak IPC = 4
  - But dependencies and structural hazards reduce this in practice

# Multiple Issue Static multiple issue Compiler reorders independent/commutative instructions to be issued together Compiler detects and avoids hazards Dynamic multiple issue CPU examines pipeline and chooses instructions to reorder/issue CPU can resolve hazards at runtime



# Pipeline Depth and Issue Width

| • | Intel  | Processors | over | Time |  |
|---|--------|------------|------|------|--|
|   | initer | FIDLESSUIS | over | Time |  |

| Microprocessor    | Year | Clock Rate    | Pipeline<br>Stages | lssue<br>width | Cores | Power |
|-------------------|------|---------------|--------------------|----------------|-------|-------|
| i486              | 1989 | 25 MHz        | 5                  | 1              | 1     | 5W    |
| Pentium           | 1993 | 66 MHz        | 5                  | 2              | 1     | 10W   |
| Pentium Pro       | 1997 | 200 MHz       | 10                 | 3              | 1     | 29W   |
| P4 Willamette     | 2001 | 2000 MHz      | 22                 | 3              | 1     | 75W   |
| P4 Prescott       | 2004 | 3600 MHz      | 31                 | 3              | 1     | 103W  |
| Core 2 Conroe     | 2006 | 2930 MHz      | 12-14              | 4*             | 2     | 75W   |
| Core 2 Penryn     | 2008 | 2930 MHz      | 12-14              | 4*             | 4     | 95W   |
| Core i7 Westmere  | 2010 | 3460 MHz      | 14                 | 4*             | 6     | 130W  |
| Xeon Sandy Bridge | 2012 | 3100 MHz      | 14-19              | 4*             | 8     | 150W  |
| Xeon Ivy Bridge   | 2014 | 2800 MHz      | 14-19              | 4*             | 15    | 155W  |
| 8/31/2014         |      | Summer 2014 - | Lecture 23         |                |       | 9     |







# Why Do Dynamic Scheduling?

- Why not just let the compiler schedule code?
- Not all stalls are predicable • e.g. cache misses
- Can't always schedule around branches • Branch outcome is dynamically determined by I/O
- Different implementations of an ISA have different latencies and hazards
  - Forward compatibility and optimizations

### Speculation

- "Guess" what to do with an instruction
  - Start operation as soon as possible
  - Check whether guess was right and roll back if necessary
- Examples:
  - Speculate on branch outcome (Branch Prediction) • Roll back if path taken is different
  - Speculate on load
     Load into an internal register before instruction to minimize time waiting for memory
- Can be done in hardware or by compiler
- Common to static and dynamic multiple issue

Not a Simple Linear Pipeline **3 major units operating in parallel:** • Instruction fetch and issue unit • Issues instructions *in program order* • Many parallel functional (execution) units • Each unit has an input buffer called a *Reservation Station* • Holds operands and records the operation • Holds operands and records the operation • Commit unit

- Saves results from functional units in Reorder Buffers
- Stores results once branch resolved so OK to execute
- Commits results in program order

### Out-of-Order Execution (1/2)

Can also unroll loops in hardware

- 1) Fetch instructions in program order ( $\leq 4$ /clock)
- 2) Predict branches as taken/untaken
- To avoid hazards on registers, rename registers using a set of internal registers (≈ 80 registers)
- 4) Collection of renamed instructions might execute in a window (≈ 60 instructions)

### Out-of-Order Execution (2/2)

- 5) Execute instructions with ready operands in 1 of multiple *functional units* (ALUs, FPUs, Ld/St)
- 6) Buffer results of executed instructions until predicted branches are resolved in *reorder buffer*
- 7) If predicted branch correctly, *commit* results in program order
- 8) If predicted branch incorrectly, discard all dependent results and start with correct PC



# Out-Of-Order Intel

# • All use O-O-O since 2001

| Microprocessor       | Year | Clock Rate | Pipeline<br>Stages | lssue<br>width | Out-of-order/<br>Speculation | Cores | Power |
|----------------------|------|------------|--------------------|----------------|------------------------------|-------|-------|
| i486                 | 1989 | 25 MHz     | 5                  | 1              | No                           | 1     | 5W    |
| Pentium              | 1993 | 66 MHz     | 5                  | 2              | No                           | 1     | 10W   |
| Pentium Pro          | 1997 | 200 MHz    | 10                 | 3              | Yes                          | 1     | 29W   |
| P4 Willamette        | 2001 | 2000 MHz   | 22                 | 3              | Yes                          | 1     | 75W   |
| P4 Prescott          | 2004 | 3600 MHz   | 31                 | 3              | Yes                          | 1     | 103W  |
| Core 2 Conroe        | 2006 | 2930 MHz   | 12-14              | 4*             | Yes                          | 2     | 75W   |
| Core 2 Penryn        | 2008 | 2930 MHz   | 12-14              | 4*             | Yes                          | 4     | 95W   |
| Core i7<br>Westmere  | 2010 | 3460 MHz   | 14                 | 4*             | Yes                          | 6     | 130W  |
| Xeon Sandy<br>Bridge | 2012 | 3100 MHz   | 14-19              | 4*             | Yes                          | 8     | 150W  |
| Xeon Ivy Bridge      | 2014 | 2800 MHz   | 14-19              | 4*             | Yes                          | 15    | 155W  |
| 8/31/2014            |      |            | Summer 2014 - Lei  | ture 23        |                              |       | 21    |







# Agenda •Multiple Issue

- Administrivia
- •Virtual Memory Introduction

# Administrivia

- •HW5 due tonight
- Project 2 (Performance Optimization) due Sunday
- No lab today
  - TAs will be in lab to check-off make up labs
     Highly encouraged to make up labs today if you're behind treated as Tuesday checkoff for lateness
- Project 3 (Pipelined Processor in Logisim) released Friday/Saturday





# An example of a properties of the prope

# Memory Hierarchy Requirements

- Allow multiple processes to simultaneously occupy memory and provide protection
- Don't let programs read from or write to each other's memories
- Give each program the illusion that it has its own private address space
   Suppose a program has base address 0x00400000, then different processes each think their code resides at the same address
  - Each program must have a different view of memory

### Virtual Memory

- · Next level in the memory hierarchy
  - Provides illusion of very large main memory
  - Working set of "pages" residing in main memory (subset of all pages residing on disk)
- Main goal: Avoid reaching all the way back to disk as much as possible
- Additional goals:
  - Let OS share memory among many programs and protect them from each other
    Each process thinks it has all the memory to itself



VM Analogy (1/2) VM Analogy (2/2) · Trying to find a book in the UCB library system • Indication of current location within the library system is like valid bit · Valid if in current library (main memory) vs. invalid if in another branch (disk) • Book title is like virtual address (VA) · Found on the card in the card catalogue • What you want/are requesting Availability/terms of use like access rights Book call number is like physical address (PA) What you are allowed to do with the book Where it is actually located (ability to check out, duration, etc.) • Card catalogue is like a page table (PT) · Also found on the card in the card catalogue · Maps from book title to call number Does not contain the actual that data you want The catalogue itself takes up space in the library









### • Page Table functionality:

- Incoming request is Virtual Address (VA), want Physical Address (PA)
- Physical Offset = Virtual Offset (page-aligned)
- So just swap Virtual Page Number (VPN) for Physical Page Number (PPN)

### Implementation?

- Use VPN as index into PT
- Store PPN and management bits (Valid, Access Rights)
- Does NOT store actual data (the data sits in PM)









| <u>Caches</u>                                         | Virtual Memory                       |
|-------------------------------------------------------|--------------------------------------|
| Block                                                 | Page                                 |
| Cache Miss                                            | Page Fault                           |
| Block Size: 32-64B                                    | Page Size: 4KiB-8KiB                 |
| Placement:<br>Direct Mapped,<br>N-way Set Associative | Fully Associative<br>(almost always) |
| Replacement:<br>LRU or Random                         | LRU                                  |
| Write Thru or Back                                    | Write Back                           |

















