Facebook gets OK to build 'Burger Shack' and 'BBQ House' at Menlo Park campus

The commission expressed enthusiasm for the project, particularly its potential to keep employees from hopping into their cars at lunch time. The BBQ and burger eateries will supplement two cafes on the campus.

Tsurusaka said he doesn’t expect workers to leave the campus if they have on-site food options. “The food for employees, it’s provided without charge,” he said. “So we would expect employees to stay on campus to take advantage of that.”

By Jason Green
Daily News Staff Writer
Posted: 10/07/2011

--

Review

• Flynn Taxonomy of Parallel Architectures
  – SIMD: Single Instruction Multiple Data
  – MIMD: Multiple Instruction Multiple Data
  – SISD: Single Instruction Single Data (unused)
  – MISD: Multiple Instruction Single Data
• Intel SSE SIMD Instructions
  – One instruction fetch that operates on multiple operands simultaneously
  – 64/128 bit XMM registers

--

Intel SSE Intrinsics

• Intrinsics are C functions and procedures for putting in assembly language, including SSE instructions
  – With intrinsics, can program using these instructions indirectly
  – One-to-one correspondence between SSE instructions and intrinsics

--

Example SSE Intrinsics

Intrinsics:                    Corresponding SSE instructions:
• Vector data type:            MOVAPD/aligned, packed double
   _m128d                      MOVAPD/aligned, packed double
• Load and store operations:   MOVUPD/unaligned, packed double
   _mm_load_pd                MOVUPD/unaligned, packed double
   _mm_store_pd                MOVUPD/unaligned, packed double
• Load and broadcast across vector
   _mm_load1_pd                MOVSD + shuffling/duplicating
• Arithmetic:
   _mm_add_pd                  ADDPD/add, packed double
   _mm_mul_pd                  MULPD/multiple, packed double
Example: 2 x 2 Matrix Multiply

Definition of Matrix Multiply:

\[ C_{ij} = (A \times B)_{ij} = \sum_{k=1}^{2} A_{ik} \cdot B_{kj} \]

\[
\begin{array}{c|cc}
A_{12} & A_{13} & A_{14} \\
A_{21} & A_{22} & A_{24} \\
\hline
1 & 0 & 2 \\
0 & 1 & 4 \\
\end{array}
\]

\[
\begin{array}{c|cc}
B_{11} & B_{12} & B_{14} \\
B_{21} & B_{22} & B_{23} \\
\hline
1 & 3 & 5 \\
2 & 4 & 6 \\
\end{array}
\]

\[
\begin{array}{c|cc}
C_{11} & C_{12} & C_{13} \\
C_{21} & C_{22} & C_{23} \\
\hline
1 & 2 & 3 \\
2 & 3 & 4 \\
\end{array}
\]

Example: 2 x 2 Matrix Multiply

• Initialization

\[
C_1 = \begin{pmatrix} 0 \\ 0 \end{pmatrix}
\]

\[
C_2 = \begin{pmatrix} 0 \\ 0 \end{pmatrix}
\]

• I = 1

\[
A \rightarrow A_{12}, A_{13}
\]

\[
\text{ymm_load_pd: Stored in memory in Column order}
\]

\[
B \rightarrow B_{12}, B_{13}
\]

\[
\text{ymm_load1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register [duplicates value in both halves of XMM]}
\]

Example: 2 x 2 Matrix Multiply

• First iteration intermediate result

\[
C_1 = \begin{pmatrix} 0 & 0 \\ 0 & 0 \end{pmatrix}
\]

\[
C_2 = \begin{pmatrix} 0 & 0 \\ 0 & 0 \end{pmatrix}
\]

\[
\text{c1 = mm_add_pd(c1, mm_mul_pd(a,b1));}
\]

\[
\text{c2 = mm_add_pd(c2, mm_mul_pd(a,b2));}
\]

SSE instructions first do parallel multiplies and then parallel adds in XMM registers

• I = 2

\[
A \rightarrow A_{12}, A_{13}
\]

\[
\text{ymm_load_pd: Stored in memory in Column order}
\]

\[
B \rightarrow B_{12}, B_{13}
\]

\[
\text{ymm_load1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register [duplicates value in both halves of XMM]}
\]
Example: 2 x 2 Matrix Multiply

- Second iteration intermediate result
  \[ C_1 = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}, \]
  \[ C_2 = \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} \]
  \[ c = _{\text{mm}}\text{add}\_{pd}(C_1, C_2), \text{mm}	ext{mul}\_{pd}(a_{12}, b_{11}); \]
  \[ c = _{\text{mm}}\text{add}\_{pd}(C_2, c_1), \text{mm}	ext{mul}\_{pd}(a_{11}, b_{21}); \]
  \[ c = _{\text{mm}}\text{add}\_{pd}(C_1, c_2), \text{mm}	ext{mul}\_{pd}(a_{22}, b_{12}); \]

- I = 2
  \[ A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \]
  \[ B = \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} \]
  \[ _{\text{mm}}\text{load}_{pd} \text{ (stored in memory in Column order)} \]

Example: 2 x 2 Matrix Multiply (Part 1 of 2)

- SSE instructions that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM)

Example: 2 x 2 Matrix Multiply (Part 2 of 2)

- \[ \text{Inner loop from gcc} -O-S \]
  \[ L2: \text{movapd} \{ %rax,%rsi \}, %xmm1 \]
  \[ \text{movddup} \{ %rax \}, %xmm0 \]
  \[ \text{mulpd} \%xmm1, %xmm0 \]
  \[ \text{addpd} \%xmm1, %xmm0 \]
  \[ \text{movddup} \{ %rax \}, %xmm0 \]

You Are Here!

- Parallel Requests
  - Assigned to computer
  - e.g., Lookup, Ads

- Parallel Threads
  - Assigned to core
  - e.g., Search, "Data"

Review

- Intel SSE SIMD Instructions
  - One instruction fetch that operates on multiple operands simultaneously
- SSE Instructions in C
  - Can embed the SEE machine instructions directly into C programs through the use of intrinsics
Parallel Processing: Multi-Processor Systems (MIMD)

- **Multi-Processor (MIMD):** a computer system with at least 2 processors

1. Deliver high throughput for independent jobs via job-level parallelism
2. Improve the runtime of a single program that has been specially crafted to run on a multi-processor - a parallel processing program

Now use term *core* for processor (“Multi-core”) because “Multi-Processor Microprocessor” too redundant

**Multiprocessors and You**

- Only path to performance is parallelism
  - Clock rates flat or declining
  - SIMD: 2X width every 3-4 years
    - 128b wide now, 256b in 2011, 512b in 2014, 1024b in 2018?
  - MIMD: Add 2 cores every 2 years: 2, 4, 6, 8, 10, ...
- A key challenge is to craft parallel programs that have high performance on multiprocessors as the number of processors increase – i.e., that scale
  - Scheduling, load balancing, time for synchronization, overhead for communication
- Will explore this further in labs and projects

**Parallel Performance Over Time**

<table>
<thead>
<tr>
<th>Year</th>
<th>Cores</th>
<th>SIMD bits /Core</th>
<th>Core * SIMD bits</th>
<th>Peak DP FLOPs</th>
</tr>
</thead>
<tbody>
<tr>
<td>2003</td>
<td>2</td>
<td>128</td>
<td>256</td>
<td>4</td>
</tr>
<tr>
<td>2005</td>
<td>4</td>
<td>128</td>
<td>512</td>
<td>8</td>
</tr>
<tr>
<td>2007</td>
<td>6</td>
<td>128</td>
<td>768</td>
<td>12</td>
</tr>
<tr>
<td>2009</td>
<td>8</td>
<td>128</td>
<td>1024</td>
<td>16</td>
</tr>
<tr>
<td>2011</td>
<td>10</td>
<td>256</td>
<td>2560</td>
<td>40</td>
</tr>
<tr>
<td>2013</td>
<td>12</td>
<td>256</td>
<td>3072</td>
<td>48</td>
</tr>
<tr>
<td>2015</td>
<td>14</td>
<td>512</td>
<td>7168</td>
<td>112</td>
</tr>
<tr>
<td>2017</td>
<td>16</td>
<td>512</td>
<td>8192</td>
<td>128</td>
</tr>
<tr>
<td>2019</td>
<td>18</td>
<td>1024</td>
<td>18432</td>
<td>288</td>
</tr>
<tr>
<td>2021</td>
<td>20</td>
<td>1024</td>
<td>20480</td>
<td>320</td>
</tr>
</tbody>
</table>

**Multi-Processor Key Questions**

- **Q1** – How do they share data?
- **Q2** – How do they coordinate?
- **Q3** – How many processors can be supported?

**Shared Memory Multi-Processor (SMP)**

- **Q1** – Single address space shared by all processors/cores
- **Q2** – Processors coordinate/communicate through shared variables in memory (via loads and stores)
  - Use of shared data must be coordinated via synchronization primitives (locks) that allow access to data to only one processor at a time
- All multicore computers today are SMP
Example: Sum Reduction

- Sum 100,000 numbers on 100 processor SMP
  - Each processor has ID: 0 ≤ Pn ≤ 99
  - Partition 1000 numbers per processor
  - Initial summation on each processor
    \[ \text{sum}[Pn] = 0; \]
    \[ \text{for } i = 1000^*Pn; \]
    \[ i < 1000^*(Pn+1); \]
    \[ i = i + 1 \]
    \[ \text{sum}[Pn] = \text{sum}[Pn] + A[i]; \]
- Now need to add these partial sums
  - Reduction: divide and conquer
  - Half the processors add pairs, then quarter...
  - Need to synchronize between reduction steps

```c
half = 100;
repeat
  synch();
  if (half%2 != 0 && Pn == 0)
    \[ \text{sum}[0] = \text{sum}[0] + \text{sum}[\text{half-1}]; \]
    /* Conditional sum needed when half is odd; Processor0 gets missing element */
  half = half/2; /* dividing line on who sums */
  if (Pn < half) \text{sum}[Pn] = \text{sum}[Pn] + \text{sum}[Pn+half];
until (half == 1);
```

An Example with 10 Processors

- What if?
  - Processors 1 and 2 read Memory[1000] (value 20)
  - Processor 0 writes Memory[1000] with 40
Keeping Multiple Caches Coherent

- Architect’s job: shared memory => keep cache values coherent
- Idea: When any processor has cache miss or writes, notify other processors via interconnection network
  - If only reading, many processors can have copies
  - If a processor writes, invalidate all other copies
- Shared written result can “ping-pong” between caches

How Does HW Keep $Coherent$?

- Each cache tracks state of each block in cache:
  1. Shared: up-to-date data, other caches may have a copy
  2. Modified: up-to-date data, changed (dirty), no other cache has a copy, OK to write, memory out-of-date

2 Optional Performance Optimizations of Cache Coherency via new States

- Each cache tracks state of each block in cache:
  3. Exclusive: up-to-date data, no other cache has a copy, OK to write, memory up-to-date
  - Avoids writing to memory if block replaced
  - Supplies data on read instead of going to memory
  4. Owner: up-to-date data, other caches may have a copy (they must be in Shared state)
  - Only cache that supplies data on read instead of going to memory

Name of Common Cache Coherency Protocol: MOESI

- Memory access to cache is either
  - Modified (in cache)
  - Owned (in cache)
  - Exclusive (in cache)
  - Shared (in cache)
  - Invalid (not in cache)

So, In Conclusion...

- Sequential software is slow software
  - SIMD and MIMD only path to higher performance
- SSE Intrinsics allow SIMD instructions to be invoked from C programs
- Multiprocessor (Multicore) uses Shared Memory (single address space)
- Cache coherency implements shared memory even with multiple copies in multiple caches
  - More on this next time