# Review Shared Memory Multiprocessor (SMP) - Q1 Single address space shared by all processors/cores - Q2 Processors coordinate/communicate through shared variables in memory (via loads and stores) - Use of shared data must be coordinated via synchronization primitives (locks) that allow access to data to only one processor at a time - All multicore computers today are SMP CSSC LTG Thread Level Paralelium II (3) Garcia, Spring 2013 0 # CS61C Example: Sum Reduction • Sum 100,000 numbers on 100 processor SMP - Each processor has ID: 0 ≤ Pn ≤ 99 - Partition 1000 numbers per processor - Initial summation on each processor [Phase I] sum[Pn] = 0; for (i = 1000\*Pn; i < 1000\*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; • Now need to add these partial sums [Phase II] - Reduction: divide and conquer - Half the processors add pairs, then quarter, ... - Need to synchronize between reduction steps # Three Key Questions about Multiprocessors - Q3 How many processors can be supported? - Key bottleneck in an SMP is the memory system - Caches can effectively increase memory bandwidth/open the bottleneck - But what happens to the memory being actively shared among the processors through the caches? ## **Keeping Multiple Caches Coherent** - Architect's job: shared memory → keep cache values coherent - Idea: When any processor has cache miss or writes, notify other processors via interconnection network - If only reading, many processors can have copies - If a processor writes, invalidate all other copies - Shared written result can "ping-pong" between caches CS61C L20 Thread Level Parallelism II (16) rcia, Spring 2013 ( ### How Does HW Keep \$ Coherent? Each cache tracks state of each *block* in cache: Shared: up-to-date data, not allowed to write other caches may have a copy copy in memory is also up-to-date Modified: up-to-date, changed (dirty), OK to write no other cache has a copy, copy in memory is out-of-date - must respond to read request Cal Invalid: Not really in the cache Garcia, Spring 2013 © U # 2 Optional Performance Optimizations of Cache Coherency via new States Exclusive: up-to-date data, OK to write (change to modified) no other cache has a copy, copy in memory up-to-date - Avoids writing to memory if block replaced - Supplies data on read instead of going to memory Owner: up-to-date data, OK to write (if invalidate shared copies first then change to modified) other caches may have a copy (they must be in Shared state) copy in memory not up-to-date So, owner must supply data on read instead of going to memory http://youtu.be/Wd8qzqfPfdM CS61C 120 Thread Level Parallelism II (18) de Coulos 2012 @ # Common Cache Coherency Protocol: MOESI (snoopy protocol) Each block in each cache is in one of the following states: Modified (in cache) Owned (in cache) Exclusive (in cache) Shared (in cache) Invalid (not in cache) Compatability Matrix: Allowed states for a given cache block in any pair of caches CS61C L20 Thread Level Parallelism II (19) ### **Common Cache Coherency Protocol:** MOESI (snoopy protocol) MOESI X 1111 · Each block in each cache is in one of the following states: Modified (in cache) Owned (in cache) Exclusive (in cache) Shared (in cache) Invalid (not in cache) # Cache Coherency and Block Size - Suppose block size is 32 bytes - · Suppose Processor 0 reading and writing variable X, Processor 1 reading and writing variable Y - Suppose in X location 4000, Y in 4012 - · What will happen? - · Effect called false sharing - · How can you prevent it? # Dan's Laptop? sysctl hw Be careful! You can \*change some of these values with hw.ncpu: 2 hw.byteorder: 1234 hw.memsize: 8589934592 hw.activecpu: 2 hw.physicalcpu: 2 hw.physicalcpu\_max: 2 hw.logicalcpu: 2 hw.logicalcpu\_max: 2 hw.cputype: 7 hw.cpusubtype: 4 hw.cpu64bit\_capable: 1 hw.cpufamily: 2028621756 hw.cacheconfig: 2 1 2 0 0 0 0 0 0 0 hw.cachesize: 8321499136 32768 6291456 0 0 0 0 0 0 0 hw.pagesize: 4096 hw.busfrequency: 1064000000 hw.busfrequency\_min: 1064000000 hw.busfrequency\_max: 1064000000 hw.cpufrequency\_min: 3060000000 hw.cpufrequency\_max: 3060000000 hw.cachelinesize: 64 hw.l1icachesize: 32768 hw.l1dcachesize: 32768 hw.l2cachesize: 52768 hw.l2cachesize: 6291456 hw.tbfrequency: 1000000000 hw.packages: 1 hw.optional.floatingpoint: 1 hw.optional.mmx: 1 hw.optional.sse: 1 hw.optional.sse2: 1 hw.optional.sse2: 1 hw.optional.sse3: 1 hw.optional.supplemer hw.optional.sse4\_1: 1 hw.optional.sse4\_2: 0 hw.optional.x86\_64: 1 hw.optional.aes: 0 hw.optional.avx1 0:0 hw.optional.rdrand: 0 hw.optional.f16c: 0 hw.optional.enfstrg: 0 hw.machine = x86\_64 ### And In Conclusion, ... - Sequential software is slow software - SIMD and MIMD only path to higher performance - Multiprocessor (Multicore) uses Shared Memory (single address space) - Cache coherency implements shared memory even with multiple copies in multiple caches - False sharing a concern - Next Time: OpenMP as simple parallel extension to C