# Wawrzynek & Weaver CS 61C Sp 2018 Great Ideas in Computer Architecture Final Exam | Print your name:(last) | | |----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------| | (1ast) | (1115t) | | · - | of Student Conduct and acknowledge that any reported to the Center for Student Conduct and am aware that Nick believes in retribution. | | Sign your name: | | | Print your class account login: cs61c | and SID: | | Your TA's name: | | | Number of exam of person to your left: | Number of exam of person to your right: | You may consult four sheets of notes (each double-sided). You may not consult other notes, textbooks, etc. Calculators, computers, and other electronic devices are not permitted. Please write your answers in the spaces provided in the test. You have 170 minutes. There are 13 questions, of varying credit (180 points total). The questions are of varying difficulty, so avoid spending too long on any one question. Parts of the exam will be graded automatically by scanning the **bubbles you fill in**, so please do your best to fill them in somewhat completely. Don't worry—if something goes wrong with the scanning, you'll have a chance to correct it during the regrade period. If you have a question, raise your hand, and when an instructor motions to you, come to them to ask the question. Do not turn this page until your instructor tells you to do so. | Question: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | Total | |-----------|----|----|----|----|---|----|----|----|----|----|----|----|----|-------| | Points: | 15 | 10 | 14 | 14 | 9 | 15 | 18 | 10 | 13 | 18 | 16 | 20 | 8 | 180 | # Problem 1 [MT1-1] Number Rep (15 points) Answer the following questions about number representation: (a) Unsigned Base 4 (i) What is the range that a 4 digit unsigned base 4 number can represent? Write the bounds in decimal. (ii) Convert 107<sub>10</sub> to unsigned base 4. \_\_\_\_ (b) Signed Base 4 (i) Suppose we wanted to use a bias in order to represent negative numbers in base4. If we are working with a 4 digit base 4 number, what should we choose as our bias? (Our bias should create equal amounts of negative and positive numbers for our range. If this is not possible, select a bias that will result in 1 more negative number than positive numbers). Express your answer in decimal. Bias = - \_\_\_\_\_ (ii) Suppose rather than using a bias notation, we decide to do the following. For each base 4 number, we will reserve the most significant digit to strictly be used as a sign bit. A digit value of 1 will indicate a negative number, and a digit value of 0 will indicate a positive number. Any other values will result in an invalid number. For instance: $0003_4 = +3$ $1003_4 = -3$ $2003_4 = Invalid$ How many valid numbers can we represent with a 4 digit base 4 number using this scheme? (c) Given the following function in C: int shifter(int x, int shift) { if (x > 0) { return x >> shift; return -1 \* (x >> shift);Given y is a negative integer, and that shifter(y, 2) outputs 4, what is the range of values of y? hint: -8 >> 1 = -4(d) Implement the function unsigned int base\_convert(unsigned int num, unsigned int base). This function takes in non-negative integers num and base. You are guaranteed the following: - base is an integer in the range [2, 10], no need to error check this - num is comprised of "digits" with a value between 0 and base - 1. - All values fit inside an unsigned int. Your job is to make it so the function returns the decimal value of num in base base. For example, base\_convert(30, 4) would return 12, since $30_4$ is $12_{10}$ . You may not use additional lines (do not put multiple lines on the same line via;) but you may not need all the lines provided. unsigned int base\_convert(unsigned int num, unsigned int base) { unsigned int value = \_\_\_\_\_, power = \_\_\_\_\_; while (\_\_\_\_\_) { You are working on an e-commerce platform. Internally, orders are tracked through a struct called order\_t. Your task is to write a function to allocate and initialize a new order. There's a catch though! This platform must be robust to errors, so you are required to return an error value from this function in addition to the newly allocated order. The possible errors are defined for you as preprocessor directives. - (a) Write new\_order: Fill in the following code. Keep in mind the following requirements: - You must return BAD\_ARG if any inputs are invalid. The criteria for valid arguments is: - Unit price should be positive (no negative prices) - An order cannot be for more than MAX\_ORDER items - Inputs must not cause your function to crash (or execute undefined behavior) - You must return NO\_MEM if there are any errors while allocating memory - The tax rate is always initialized to TAX\_RATE - If your function returns OK, then new points to a valid struct that has been initialized with the provided values. | | <pre>/* Allocate and initialize a new order */ int new_order(order_t **new, int quantity, double unit_price) { /* Validate Arguments */</pre> | |----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | | /* Allocate "new" */ | | | /* Initialize "new" */ | | | return OK; | | b) | Calling new_order: How would you use new_order() to allocate and initialize blue_monday with a quantity of 10 and a unit price of 3.50 in the example below? order_t *blue_monday; double total; | | | <pre>ret_t ret; /* Fill in the arguments to new_order here */</pre> | | | ret = new_order(,); | | | <pre>if (ret == OK) { printf("Total: %lf\n",</pre> | # Problem 3 /MT1-3/RISCY (14 points) The function RISCY is known to take in two arguments, in a0 and a1. (a) Fill in the blanks such that the code below executes properly and evokes ecall to print the value in register s1. You may assume that ecall is a function that takes in two arguments a0 and a1. When a0 is 1, it prints the value in register a1. | RISCY: | # Prologue | | |---------|--------------------------------------------------------------------|------| | | | | | | | | | | | | | | | | | | addi s0, x0, 1 | | | | add s1, x0, x0 | | | Loop: | addi a0, a0, 4 | | | | beq a1, s0, Ret | | | | lw t1, -4(a0) | | | | lw t2, 0(a0) | | | | sub t1, t1, t2 | | | | bge t1, x0, Cont | | | | neg t1, t1 | | | Cont: | blt t1, s1, next | | | | mv s1, t1 | | | next: 3 | # print value in s1 for debugging purpose. | | | | SW | | | | SW | | | | addi a0, x0, 1 | | | | mv a1, s1 | | | | ecall # ecall takes in a0(=1 for print) and a1(=register to print) | int) | | | lw | | | | lw | | | | addi s0, s0, 1 | | | Do+ | j Loop | | | Ret | mv a0, s1 | | | | # Epilogue | | | | | | | | | | | | | | | | | | | | | | | | | | | | ir ra | | | Convert the RIS<br>Assume mv and n | | | | | | |------------------------------------|------------|-------------|-----------------|--------------------|---------| | the fields below. | 0 1 | | 1 | v | v | | | | | | | | | imm[12,10:5] | rs2 | rs1 | func3 | imm[4:1,11 | opcoo | | | | | | | | | Translate RISCY | into C cod | o Vou may c | ur may not need | l all of the lines | nrovide | | below. You can | | | | | | | takes in one argu | | | | _ | | | void printint | (int x); | | | | | | | | | | | | | | | | | | | | RISC | CY ( | a0, | a1) { | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | <u></u> | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ``` Problem 4 [MT1-4] CALL ``` (14 points) ``` #include <stdio.h> int main() { int i, sum = 0; for (i = 100; i !=0; i--) sum = sum + i * i; printf ("The sum of sq from 100 .. 1 is %d\n", sum); } ``` Consider the following C code and assembly code: | Address | Assembly | |--------------|-------------------------------------------------| | 0x80 | str: .string "The sum of sq from 100 1 is %d\n" | | | .text | | | main: | | 0x00 | | | 0x04 | | | 0x08 | mv a1, x0 | | 0x0c | li t1, 100 | | 0x10 | j check | | | loop: | | 0x14 | mul t2, t1, t1 | | 0x18 | add a1, a1, t2 | | 0x1c | addi t1, t1, -1 | | | | | | check: | | 0x20 | bnez t1 loop | | 0x24 | la a0, str | | 0x28 | jal printf # from stdio | | 0x2c | mv a0, x0 | | 0x30<br>0x34 | | | 0x34<br>0x38 | ret | | 0230 | 160 | Figure 1: Assembly Code with Address (a) Please fill in all lines in the above assembly code. | (b) | · - | do-instructions are eudo-instruction. | re in the | given | assembly code? ( | Count each occur- | |-----|----------------------------------|---------------------------------------|-----------|---------|------------------------------------------|--------------------| | | O 4 | | | 0 7 | 7 | | | | O 5 | | | 0 8 | 3 | | | | O 6 | | | 0 9 | ) | | | (c) | Create the symb | ool table and relo | cation to | able. | | | | | | Label | | Addı | ress | | | | | main | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | l | | | l | | | | | | Instruction | Addres | SS | Dependency | | | | | la a0, str | | | str | | | | | | | | | | | | | | | | | | | | | | | | | | | (d) | | | | | cheir immediate va<br>ur answer in decir | | | (e) | The assembler dresses. | takes two passes | over the | e code | e to resolve PC-R | elative target ad- | | | O True | | | 0 1 | False | | | (f) | The absolute ta | rget addresses ca | n be rese | olved a | at the assembler s | tage. | | | O True | | | 0 1 | False | | | (g) | Interpreted code | e should run faste | er than t | he cor | mpiled code. | | | | O True | | | O 1 | False | | | (h) | The compiler stifile to compile. | ill needs to go thr | ough the | e linke | r stage even if we o | only have 1 source | | | O True | | | O 1 | False | | ## Problem 5 [MT2-1] Circuits and Timing (9 points) In this question, you will be working with a circuit that takes in three 8-bit inputs. For all parts, assume the delays below: $$t_{clk-to-q} = 3 \text{ps}, \quad t_{setup} = 4 \text{ps}, \quad t_{shifter} = 1 \text{ps}$$ $t_{adder} = 5 \text{ps}, \quad t_{multiplier} = 6 \text{ps}, \quad t_{subtracter} = 4 \text{ps}$ Furthermore, assume that the inputs A, B, and C take on their new values exactly at the rising edge of every clock cycle and that all registers are initialized to zero. Figure 2: Non-pipelined circuit (a) What is the maximum possible hold time that still ensures the correctness of the non-pipelined circuit in figure 1? (Select only one) O 1ps O 5ps O 3ps **O** 7ps O 4ps (b) What is the minimum possible clock period that still ensures the correctness of the non-pipelined circuit in figure 1? You may assume that for this question that all flip-flops have a 0ps hold time requirement. (Select only one) O 13ps O 20ps **O** 16ps O 23ps Now consider the pipelined version of the circuit (shown below). You will be using this circuit for the remaining part of the question. All delays remain the same. You may assume that the hold time is 0ps for the following questions. Figure 3: Pipelined circuit | | | Figure 3: Pipenne | ea ci | rcuit | |-----|---|---------------------------------------------------------------------|-------|------------------------------------------| | (c) | | at is the minimum clock period of the period circuit's correctness? | pipel | lined circuit in figure 2 that maintains | | | 0 | 10ps | 0 | 15ps | | | 0 | 13ps | 0 | 18ps | | (d) | | v long does it take to compute the out<br>k period is 11ps. | put : | for a given set of inputs? Assume the | | | 0 | 22ps | 0 | 42ps | | | 0 | 27ps | 0 | 47ps | | | 0 | 28ps | 0 | 50ps | | | 0 | 31ps | 0 | other: | # Problem 6 [M2-2] Read and Write (15 points) Recall in class we learned that we can optimize our CPU pipeline by having register writes then reads within the same cycle. Let's call this implementation **write-read**. Consider a new implementation where register reads happen before register writes within the same cycle. Let's call this implementation **read-write**. Now consider the following RISC-V code and answer the following questions about a 5-stage RISC-V pipeline. **Assume no forwarding and no branch prediction**. You are given that there needs to be at least one stall after line 4 for both implementations. | | loop | : | | |---|------|------------|---------| | 1 | slli | t0 | a1 2 | | 2 | or | t2 | al tl | | 3 | add | t0 | t0 a0 | | 4 | lw | t1 | 4 (t0) | | 5 | beq | t1 | x0 loop | | 6 | addi | t2 | t2 5 | | 7 | sw | t2 | 8(t0) | | 8 | add | <b>a</b> 0 | t2 x0 | (a) Consider the code above and the **write-read** implementation. Which lines should be followed by a stall to guarantee correctness? (You are given that there needs to be at least one stall after line 4). For example, if an instruction on line A causes an instruction on line B to stall, bubble A. | 0 | 1 | 0 | 5 | |---|---|---|---| | 0 | 2 | 0 | 6 | | 0 | 3 | 0 | 7 | | | 4 | 0 | 8 | (b) Still considering the **write-read** implementation, how many stalls are needed before the instruction on line 5 executes (do not include any stalls that occur after this point)? | (c) | | v consider the <b>read-write</b> implementation instruction on line 5 executes (do not not)? | | • | |-----|-----|----------------------------------------------------------------------------------------------|------|-----------------------------| | | | | | | | (d) | bot | eall that you are given that there need h implementations. What type of hazer line 4? | | | | | 0 | Structural | 0 | Control | | | 0 | Data | | | | (e) | For | write-read, how many stalls do we n | ieed | between lines 4 and 5? | | | 0 | 1 | 0 | 3 | | | 0 | 2 | 0 | 4 | | (f) | For | read-write, how many nops do we no | eed | between lines 4 and 5? | | | 0 | 1 | 0 | 3 | | | 0 | 2 | 0 | 4 | | (g) | | re decide to reorder instructions, which op after line 4? Choose the line number | | _ | | | 0 | 1 | 0 | 5 | | | 0 | 2 | 0 | 6 | | | 0 | 3 | 0 | 7 | | | 0 | 4 | 0 | None - no reordering needed | | | | | | | ## Problem 7 /M2-3/ \$\$\$ (18 points) Assume we have a single L1 data cache having the following characteristics: - 4 KiB cache size - 16 byte blocks - Direct Mapped Assume the following piece of code is run in a 32-bit address space with sizeof(int) = 4: (a) Calculate the number of tag, index, and offset bits for this L1 cache. | Tag: | Index: | Offset: | |------|--------|---------| | | | | - (b) Now, what is the hit rate for Loop 1 in the data cache? Assume that we start with a cold cache from the start of main(). - (c) What is the hit rate for Loop 2 given that the cache is NOT reset after Loop 1? - (d) Assume that accessing memory takes 100 cycles, accessing data that is in the cache takes 5 cycles, Also assume for this part that Loop 1's hit rate is 60% and Loop 2's hit rate is 75%, which may or may not be the correct hit rates. What is the average memory access time (AMAT) in cycles for (Please reduce fractions): - (i) Loop 1: Final Exam Page 14 of 26 CS61C – SP 18 | ( | ii) Overall (an expression of REDUCED fractions is alright. You may use "T1" as the Loop 1 AMAT and "T2" as the Loop 2 AMAT in your calculation of this value): | |-------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | | we add in a L2 cache with the following characteristics: 16 KiB cache size 16 byte blocks Fully Associative | | We re | -run main() with cold L1 and L2 caches. | | (e) ' | What would be the local MISS rate of the L2 cache for Loop 1? | | | What would be the local MISS rate of the L2 cache for Loop 2? Assume the caches re NOT reset after Loop 1. Also, don't take into account the miss rate for Loop when calculating Loop 2. | (ii) Loop 2: | Problem | 8 | M2-4 | Datapath | |-------------------|---|------|-----------| | I I O O I O I I I | _ | 1 | - acapaci | (10 points) Recall the standard 5-stage, single cycle datapath contains stages for Instruction Fetch, Decode, Execute (ALU), Memory, and Write-back. Datapath designers are interested in reducing the phases necessary for execution such that instead of accessing both the Execute (ALU) phase and the Memory phase, instructions access either one or the other, but not both. This would create a 4-stage, single cycle datapath with the following stages: Instruction Fetch, Decode, Execute OR Memory, and Write-back. | Instr Fetch | Instr Decode | Execute (ALU) | Memory | Write-back | |-------------|--------------|---------------|--------|------------| | 100ps | 150ps | 200ps | 350 ps | 150ps | | (a) | Given the table above and the described | d datapath | above, | what is | s the time | it takes | |-----|--------------------------------------------|-------------|--------|---------|------------|----------| | | for a single instruction that utilizes all | l stages to | execut | e on th | e typical | 5-stage, | | | single cycle datapath? | | | | | | | (b) | What is the time it | takes for | a single | instruction | that | utilizes | all sta | ages to | execute | |-----|---------------------|-----------|------------|-------------|------|----------|---------|---------|---------| | | on the new 4-stage, | single cy | rcle dataj | path? | | | | | | | (c) | If the designers go | ahead with this | modification, | which instr | ructions w | ill NOT i | func- | |-----|---------------------|------------------|---------------|-------------|-------------|-----------|-------| | | tion correctly? W | hy? Please limit | your answer t | o two sente | ences or le | ess. | | | | | | | | | | | | (d) | Propose a program-level modification that will fix the issue. Do modification to the datapath. Please describe your modification is | | |-----|-------------------------------------------------------------------------------------------------------------------------------------|--| | | or less. | | | | | | | | | | | | | | Final Exam Page 16 of 26 CS61C – SP 18 | | For 16-bit half-precision floating point, how many different valid representation are there for NaN? | |---|--------------------------------------------------------------------------------------------------------| | | What is the smallest non-infinite number it can represent? You can leave you answer as an expression. | | | What is the largest non-infinite number it can represent? You can leave your an swer as an expression. | | , | How many floating point numbers are in the interval $[1, 2)$ (including 1 but excluding 2)? | IEEE 754-2008 introduces half precision, which is a binary floating-point representation that uses 16 bits: 1 sign bit, 5 exponent bits (with a bias of 15) and 11 significand bits. This format uses the same rules for special numbers that IEEE754 uses. Considering (13 points) Problem 9 | F-1 | Floating Point SIMD within a Register (SWAR) in RISCV: You are planning to obfuscate some messages before they get released to the world. Instead of doing it properly (via encryption), you want a simpler implementation. Your first idea is to add 1 to each character, e.g. turning "aabb" into "bbcc"? If we apply this method to "a quick brown fox jumps over the lazy dog", it becomes "b!rvjdl!cspxo!gpy!kvnqt!pwfs!uif!mb{z!eph"! It looks promising. And the implementation is plain and simple (both in C and RISCV): ``` void obfuscate(char* d, size_t n) { obfuscate: for (int i = 0; i < n; i++) { begz a1, END d[i] += 1; add t0, x0, x0 LOOP: } lb t1, 0(a0) addi t1, t1, 1 sb t1, 0(a0) addi t0, t0, 1 addi a0, a0, 1 blt t0, a1, LOOP END: ret ``` Things look great so far. Then you realize that you learned all kinds of crazy techniques to speedup a function in CS61C and this looks very similar to the many SIMD examples you have seen. Although RISCV does have SIMD instructions via a vector extension set, we want to implement our own version of RISCV SIMD. Our idea is to pack multiple characters into a single 32-bit integer. In fact, we do not even need to load and pack the data: four characters have the same width of an integer. Assume d is word aligned and that all input characters in the message are less than 254. Also assume n is the number of characters in the message and that register a0 holds the value of d and register a1 holds the value of n. (a) Complete the following implementation for a vectorized version of obfuscate: void obfuscate\_vec(char\* d, size\_t n) { ``` for (int i = 0; i < ____; ____) { *( (int*) (d + i) ) += INC; } /* handle tail cases */ for (int i = ____; i < n; i++) { d[i] += 1; }</pre> ``` (b) Refer to the constant INC in the code above. What should the value of INC be such that obfuscate\_vec works correctly? Write your answer in hexadecimal. ``` int INC = _____ ``` } (c) **Loop Unrolling:** You can optimize this procedure further! Loop unrolling is supposed to reduce the number of branch instructions. Complete the following: ``` void obfuscate_vec_unroll(char* d, size_t n){ obfuscate_vec_unrolled: begz a1, END int i; for (i = 0; i < _____; ____) { add t2, a1, x0 *((int*) (d + i)) += INC; *((int*) (d + i + 4)) += INC; srl t2, t2, ___ } sl1 t2, t2, _ /* handle tail cases */ beqz t2, TAIL add t0, x0, x0 for (int i = _; i < n; i++) { li t3, INC d[i] += 1; LOOP_VEC: } add t1, t1, t3 addi a0, a0, __ addi t1, t1, ___ addi a0, a0, addi t0, t0, blt t0, t2, LOOP_VEC TAIL: add t0, t2, x0 LOOP_TAIL: lbu t1, 0(a0) addi t1, t1, 1 sb t1, 0(a0) addi t0, t0, 1 addi a0, a0, 1 blt t0, a1, LOOP_TAIL END: ret ``` (d) Given a message of length n characters, how many instructions are needed after loop unrolling? Express your answer in terms of n, such as 3n + 4. In addition, what is the speed up when n is approaching infinity in comparison to the **original** non-optimized function obfuscate? Count pseudo-instructions as 1 instruction. You do not need to simplify your expressions. | # of Instructions: | S | Speedup: | |--------------------------|---|----------| | // OI IIID OI GC OIOIID. | | ,pecaap: | (e) You decide to further improve the code with thread parallelism using 4 threads! Fill in proper OpenMP directive to the blank below: void obfuscate\_vec\_unroll(char\* d, size\_t n) { ``` // insert OpenMP directive here for (int i = 0; i < <code from part(c)>; <code from part(c)>) { *((int*) (d + i)) += INC; *((int*) (d + i + 4)) += INC; } /* handle tail cases */ /* Location A */ for (int i = <code from part(c)>; i < n; i++) { d[i] += 1; }</pre> ``` Someone tells you to add #pragma omp parallel at Location A in the code above. If you do this, which statement is true about the second for loop (the tail case)? O Always Incorrect O Always Correct, slower than serial O Sometimes Correct - O Always Correct, faster than serial - (f) Denote the speedup when n is approaching infinity of obfuscate\_vec\_unroll from part (d) as "S". Suppose the overhead of running OpenMP is negligible in comparison to the rest of the code, and we can run four threads, what is the maximum speed up compared to the original non optimized function obfuscate? - (g) WSC and Amdahl's Law: The above program now runs in the cloud with many machines. obfuscation\_vec\_unroll is 90% of all execution (AFTER applying SWAR, unrolling, and OpenMP), and obfuscation\_vec\_unroll can be parallelized across machines. - (i) If we run obfuscate\_vec\_unroll on a cluster of 16 machines, what is the speedup? You may leave your answer as an expression. - (ii) What is the maximum possible speedup we can achieve if we have an unlimited number of machines? Final Exam Page 20 of 26 CS61C – SP 18 ## Problem 11 | F-3 | Pikachu Learns Spark (16 points) We are given the entire dataset of every Pokémon and we want to find the mean of all Pokémon id numbers by type. Some Pokémon have dual types so that Pokémon's id number will contribute to the average total of both types. For example, Kyurem is both a dragon and an ice type so his id number will contribute to both type's sum when considering the average. Fill in the blanks for the Python code below. Use the following Spark Python functions when necessary: map, flatMap, reduce, reduceByKey. | Sample input (pokemon_id, pokemon_name, pokemon_types): 646 Kyurem Dragon Ice 25 Pikachu Electric 257 Blaziken Fire Fighting | | |------------------------------------------------------------------------------------------------------------------------------------------------------|----| | Sample output (Type, Number): (Dragon, 587) (Electric, 412) | | | <pre>def parseLine(line): tokens = line.split(" ") types = tokens[2:] results = [] for type in types: results.append((</pre> | )) | | <pre>def reduceFunc(v1, v2):</pre> | | | return | | | <pre>def average(k, v):</pre> | | | return | | | <pre>pokemonData = sc.parallelize(pokemon)</pre> | | | <pre>out = pokemonData.flatMap()</pre> | | | ( | ) | | ( | ) | | Der<br>of ca | nan<br>achii<br>the j | ng in computer syste | part of a process' mes. If we think of he? Assume a mac | ma | (20 points) nory on disk) is yet another example in memory as a cache for disk, what e with 64 bit addresses, 16KB pages, | |--------------|-----------------------|-------------------------------------------------|---------------------------------------------------------|--------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------| | (a) | Ass | sociativity? | | | | | | 0 | Direct Mapped | | 0 | Fully Associative | | | 0 | N-Way Set Associa | tive | | | | (b) | Blo | ock size: | | | | | (c) | of t | the most significant b<br>of the field. For exa | it of the field and N<br>mple, if the tag cor | l is<br>nsis | form $[N:M]$ where N is the bit number the bit number of the least significant sts of the first 4 least-significant bits, able to paging, you may write "N/A". | | | Tag | g bits: | Index bits: | | Offset bits: | | (d) | Wr | ite policy? | | | | | | 0 | Write Through | | 0 | Write Back | | (e) | All | ocation policy? | | | | | | 0 | Write Allocate | | 0 | Write No Allocate | | a m<br>wha | yste<br>t to | ry constant called T<br>) and that arr will a | . You may assume lways have enough | th | sterious summation function. It uses<br>at T is defined (but you don't know<br>ments (the function will never access<br>with the following properties: | | • | 64 | bit addresses | | • 4 | byte words | | • | 4K | iB pages | | • 4 | GiB of main memory | | • | 1M | iB fully-associative of | eache with 64 byte | blo | cks | | • | 2 e | ntry fully associative | TLB | | | | • | 4 le | evel page table with | 8 byte entries | | | | • | Th | e OS uses LRU when | paging to disk | | | ``` #define NITER 10*1024*1024 #define T ??? // see below int MysterySum(int *arr) { int i = 0; int sum = 0; for(; i < NITER / 2; i++)</pre> int p = (i \% T)*4096; int b = i \% 4096; sum += arr[p + b]; } /* Timer starts here*/ for(; i < NITIR; i++) {</pre> int p = (i \% T)*4096; int b = i \% 4096; sum += arr[p + b]; /* Timer ends here */ return sum; } ``` #### (f) Performance of T Rank the following values of T based on how fast the second loop only executes (assuming the first loop has already ran). You should state whether pairs of values are < or =. For example, you should write 1 < 2 if T=1 causes the second loop to run strictly slower than T=2. Likewise, you could write 8=2 if 8 is about as fast as 2. T = 1, 2, 3, 4 #### (g) System Design What system parameter would you change in order to maximize system performance for T=27. You must mark only one of the following (pick the one with the largest performance gain): | 0 | Address Size | 0 | Cache Block Size | |---|------------------|---|-----------------------| | 0 | Page Size | 0 | TLB Capacity | | 0 | Word Size | 0 | TLB Associativity | | 0 | Main Memory Size | 0 | Page Table Depth | | 0 | Cache Capacity | 0 | Page Table Entry Size | ## (h) Page Table Walk Given the list of virtual addresses, find the corresponding physical addresses. For each address, you must also note whether the access was a TLB hit, Page Table hit, or Page Fault (by writing yes/no for each). If the access is a page fault, you should leave the PPN and PA fields blank. Do not add this entry to the TLB. Our virtual memory space has 16-byte pages and maintains a fully-associative, two-entry TLB with LRU replacement. The page table system is hierarchical and has two levels. The two most-significant bits of the VPN index the L1 table, and the two least-significant bits of the VPN index the L2 table. | Virtual Address | Virtual Page<br>Number | Physical Page<br>Number | Physical<br>Address | TLB Hit, Page<br>Table Hit, Page<br>Fault? | |-----------------|------------------------|-------------------------|---------------------|--------------------------------------------| | 0x10 | | | | | | 0x5C | | | | | | 0x39 | | | | | | 0x1F | | | | | | Page Table Base Register | 0x00 | |--------------------------|------| | | | #### Memory: | Address | Contents | | | |---------|----------|--|--| | 0x00 | 0x20 | | | | 0x04 | | | | | 0x08 | 0x10 | | | | 0x0C | | | | | 0x10 | | | | | 0x14 | 0x1C | | | | 0x18 | 0x28 | | | | 0x1C | | | | | 0x20 | | | | | 0x24 | 0x12 | | | | 0x28 | 0x09 | | | | 0x2C | 0x5C | | | #### TLB: | VPN | PPN | | |-----|-----|--| | | | | | | | | | Ansv | wer 1 | the following questions | | | | | | |------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|--------------------------------------|--|--|--| | (a) | | You have a computer that, well, stinks. It goes down on average 6 times a day and it takes 1 hour to get working again. What is the current system's availability? | | | | | | | | 0 | 0.5 | 0 | 0.7 | | | | | | 0 | 0.6 | 0 | 0.8 | | | | | (b) | Assume you have the computer from part (a) when the manufacturer offers you a deal. <b>a:</b> A new computer that only crashes 4 times per day or <b>b:</b> support that car reduce the time to fix to 6 minutes. Which one should you choose? | | | | | | | | | 0 | a | 0 | b | | | | | (c) | You have a processor that has a clock rate of 2GHz, a time to poll of 200 cycles for I/O, and you need to poll I/O at 100 Hz. If you use polling, what is the <b>percentag</b> of time you will need to spend polling? | | | | | | | | | 0 | 1% | 0 | 0.01% | | | | | | 0 | 0.1% | 0 | 0.001% | | | | | (d) | If t<br>Wh | he data comes in very infrequently do $y$ ? | o yo | u want to use interrupts or polling? | | | | | | 0 | interrupts | 0 | polling | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (8 points) Problem 13 [F-5] Potpourri Figure 4: Good Luck And Don't F\*\*\* It Up