# CS 61C: Great Ideas in Computer Architecture # The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 3/08/2013 Spring 2013 -- Lecture #19 #### Review of Last Lecture - Amdahl's Law limits benefits of parallelization - Request Level Parallelism - Handle multiple requests in parallel (e.g. web search) - MapReduce Data Level Parallelism - Framework to divide up data to be processed in parallel - Mapper outputs intermediate key-value pairs - Reducer "combines" intermediate values with same key /08/2013 Spring 2013 - Lecture #19 # Agenda • Flynn's Taxonomy • Administrivia • Data Level Parallelism and SIMD • Bonus: Loop Unrolling # Agenda • Flynn's Taxonomy Administrivia • Data Level Parallelism and SIMD Bonus: Loop Unrolling 3/08/2013 Spring 2013 -- Lecture #19 # Administrivia • HW3 due Sunday • Proj2 (MapReduce) to be released soon - Part 1 due 3/17 - Part 2 due 3/24 - Work in **partners**, preferably at least 1 knows Java • Midterms graded Collect after lecture today or from Lab TA next Spring 2013 - Lecture #19 week ## **SIMD Architectures** - Data-Level Parallelism (DLP): Executing one operation on multiple data streams - Example: Multiplying a coefficient vector by a data vector (e.g. in filtering) $$y[i] := c[i] \times x[i], 0 \le i < n$$ - Sources of performance improvement: - One instruction is fetched & decoded for entire operation - Multiplications are known to be independent - Pipelining/concurrency in memory access as well /08/2013 Spring 2013 — Lecture #19 # "Advanced Digital Media Boost" - To improve performance, Intel's SIMD instructions - Fetch one instruction, do the work of multiple instructions - MMX (MultiMedia eXtension, Pentium II processor family) - SSE (Streaming SIMD Extension, Pentium III and beyond) # for each f in array f = sqrt(f) for each f in array { load f to the floating-point register calculate the square root write the result from the register to memory } for every 4 members in array { load 4 members to the SSE register calculate 4 square roots in one operation write the result from the register to memory } SIMD ## **SSE Instruction Categories** for Multimedia Support Instruction category Unsigned add/subtract | Eight 8-bit or Four 16-bit Saturating add/subtract Eight 8-bit or Four 16-bit Max/min/minimum Eight 8-bit or Four 16-bit Average Eight 8-bit or Four 16-bit Shift right/left Eight 8-bit or Four 16-bit Intel processors are CISC (complicated instrs) • SSE-2+ supports wider data types to allow 16 × 8-bit and 8 × 16-bit operands Spring 2013 - Lecture #19 # Example: Image Converter (1/5) - Converts BMP (bitmap) image to a YUV (color space) image format: - Read individual pixels from the BMP image, convert pixels into YUV format - Can pack the pixels and operate on a set of pixels with a single instruction - Bitmap image consists of 8-bit monochrome pixels - By packing these pixel values in a 128-bit register, we can operate on 128/8 = 16 values at a time - Significant performance boost 3/08/2013 Spring 2013 -- Lecture #19 ## Example: Image Converter (2/5) - FMADDPS Multiply and add packed single precision floating point instruction ← CISC Instr! - One of the typical operations computed in transformations (e.g. DFT or FFT) $$P = \sum_{n=1}^{N} f(n) \times x(n)$$ 3/08/2013 Spring 2013 - Lecture #19 # Example: Image Converter (3/5) - FP numbers f(n) and x(n) in src1 and src2; p in dest; - C implementation for N = 4 (128 bits): ``` for (int i = 0; i < 4; i++) p = p + src1[i] * src2[i]; ``` 1) Regular x86 instructions for the inner loop: fmul [...] faddp [...] - Instructions executed: 4 \* 2 = 8 (x86) 3/08/2013 Spring 2013 -- Lecture #19 # Example: Image Converter (4/5) - FP numbers f(n) and x(n) in src1 and src2; p in dest; - C implementation for N = 4 (128 bits): ``` for (int i = 0; i < 4; i++) p = p + src1[i] * src2[i]; ``` 2) SSE2 instructions for the inner loop: ``` //xmm0=p, xmm1=src1[i], xmm2=src2[i] mulps %xmm1, %xmm2 // xmm2 * xmm1 -> xmm2 addps %xmm2, %xmm0 // xmm0 + xmm2 -> xmm0 - Instructions executed: 2 (SSE2) ``` 3/08/2013 Spring 2013 - Lecture #19 # Example: Image Converter (5/5) - FP numbers f(n) and x(n) in src1 and src2; p in dest; - C implementation for N = 4 (128 bits): ``` for (int i = 0; i < 4; i++) p = p + src1[i] * src2[i]; ``` 3) SSE5 accomplishes the same in **one** instruction: ``` fmaddps %xmm0, %xmm1, %xmm2, %xmm0 // xmm2 * xmm1 + xmm0 -> xmm0 // multiply xmm1 x xmm2 packed single, // then add product packed single to sum in xmm0 ``` 3/08/2013 Spring 2013 -- Lecture #19 ## Summary - Flynn Taxonomy of Parallel Architectures - SIMD: Single Instruction Multiple Data - MIMD: Multiple Instruction Multiple Data - SISD: Single Instruction Single Data - MISD: Multiple Instruction Single Data (unused) - Intel SSE SIMD Instructions - One instruction fetch that operates on multiple operands simultaneously - 128/64 bit XMM registers 3/08/2013 Spring 2013 - Lecture #19 # **BONUS SLIDES** You are responsible for the material contained on the following slides, though we may not have enough time to get to them in lecture. They have been prepared in a way that should be easily readable and the material will be touched upon in the following lecture. 3/08/2013 Spring 2013 -- Lecture #1 # Agenda - Flynn's Taxonomy - Administrivia - Data Level Parallelism and SIMD - Bonus: Loop Unrolling 3/08/2013 Spring 2013 -- Lecture #19 ### Data Level Parallelism and SIMD - SIMD wants adjacent values in memory that can be operated in parallel - Usually specified in programs as loops ``` for(i=0; i<1000; i++) x[i] = x[i] + s; ``` How can we reveal more data level parallelism than is available in a single iteration of a loop? Unroll the loop and adjust iteration rate 3/08/2013 Spring 2013 -- Lecture #19 # Looping in MIPS #### Assumptions: $\$s0 \rightarrow initial address (beginning of array)$ \$s1 → scalar value s $\$s2 \rightarrow termination address (end of array)$ ### Loop: ``` lw $t0,0($s0) addu $t0,$t0,$s1 ``` sw \$t0,0(\$s0) # store result addiu \$s0,\$s0,4 # move to next element bne \$s0,\$s2,Loop # repeat Loop if not done 3/08/2013 Spring 2013 – Lecture #19 34 ## **Loop Unrolled** Loop: lw \$t0,0(\$s0) addu \$t0,\$t0,\$s1 sw \$t0,0(\$s0) lw \$t1,4(\$s0) addu \$t1,\$t1,\$s1 sw \$t1,4(\$s0) lw \$t2,8(\$s0) addu \$t2,\$t2,\$s1 sw \$t2,8(\$s0) lw \$t3,12(\$s0) addu \$t3,\$t3,\$s1 sw \$t3,12(\$s0) addu \$t3,\$t3,\$s1 sw \$t3,2(\$s0) addu \$t3,\$t3,\$s1 #### NOTE: - 1. Using different registers eliminate stalls - Loop overhead encountered only once every 4 data iterations - 3. This unrolling works if loop\_limit mod 4 = 0 Spring 2013 -- Lecture #19 #### Loop Unrolled Scheduled Note: We just switched from integer instructions to single-precision FP instructions! Loop: lwc1 \$t0,0(\$s0) lwc1 \$t1.4(\$s0) 4 Loads side-by-side: lwc1 \$t2,8(\$s0) Could replace with 4 wide SIMD Load lwc1 \$t3,12(\$s0) add.s \$t0,\$t0,\$s1 add.s \$t1.\$t1.\$s1 4 Adds side-by-side: add.s \$t2.\$t2.\$s1 Could replace with 4 wide SIMD Add add.s \$t3,\$t3,\$s1 swc1 \$t0,0(\$s0) swc1 \$t1,4(\$s0) 4 Stores side-by-side: swc1 \$t2.8(\$s0) Could replace with 4 wide SIMD Store swc1 \$t3.12(\$s0) addiu \$s0.\$s0.16 bne \$s0,\$s2,Loop Spring 2013 - Lecture #19 # Loop Unrolling in C • Instead of compiler doing loop unrolling, could do it yourself in C: ``` for(i=0; i<1000; i++) x[i] = x[i] + s; Loop Unroll for(i=0; i<1000; i=i+4) { x[i] = x[i] + s; What is x[i+1] = x[i+1] + s; downside x[i+2] = x[i+2] + s; of doing x[i+3] = x[i+3] + s; this in C? } ``` # **Generalizing Loop Unrolling** - Take a loop of n iterations and perform a k-fold unrolling of the body of the loop: - First run the loop with k copies of the body floor(n/k) times - To finish leftovers, then run the loop with 1 copy of the body n mod k times - (Will revisit loop unrolling again when get to pipelining later in semester) 013 Spring 2013 - Lecture #19