## **MOESI Cache Coherency**

| state     | Cache up | Memory up | Others have | Can respond to | Can write without |
|-----------|----------|-----------|-------------|----------------|-------------------|
|           | to date? | to date?  | a copy?     | other's reads? | changing state?   |
| Modified  | Yes      | No        | No          | Yes, Required  | Yes               |
| Owned     | Yes      | Maybe     | Maybe       | Yes, Optional  | No                |
| Exclusive | Yes      | Yes       | No          | Yes, Optional  | No                |
| Shared    | Yes      | Maybe     | Maybe       | No             | No                |
| Invalid   | No       | Maybe     | Maybe       | No             | No                |

With the MOESI concurrency protocol implemented, accesses to cache accesses appear serializable. This means that the result of the parallel cache accesses appear the same as if there were done in serial from one processor in some ordering.

1. Consider the following access pattern on a two-processor system with a direct-mapped, write-back cache with one cache block and a two cache block memory. Assume the MOESI protocol is used, with write-back caches, write-allocate, and invalidation of other caches on write (instead of updating the value in the other caches).

|      |                   |                |                | Memory @ 0  | Memory @ 1  |
|------|-------------------|----------------|----------------|-------------|-------------|
| Time | After Operation   | P1 cache state | P2 cache state | up to date? | up to date? |
| 0    | P1: read block 1  | Exclusive (1)  | Invalid        | YES         | YES         |
| 1    | P2: read block 1  |                |                |             |             |
| 2    | P1: write block 1 |                |                |             |             |
| 3    | P2: write block 1 |                |                |             |             |
| 4    | P1: read block 0  |                |                |             |             |
| 5    | P2: read block 0  |                |                |             |             |
| 6    | P1: write block 0 |                |                |             |             |
| 7    | P2: read block 0  |                |                |             |             |
| 8    | P2: write block 0 |                |                |             |             |
| 9    | P1: read block 0  |                |                |             |             |

2. Consider if we run the following two loops in parallel (as two threads on two processors).

```
for(int i = 0; i < N; i += 2) array[i] += 1; for(int j = 1; j < N; j += 2) array[j] += 2;
```

| would we expect more,  | less, or the same | number of cache     | misses than if w    | e were to ru | n this serially |
|------------------------|-------------------|---------------------|---------------------|--------------|-----------------|
| (assume each processor | has its own cache | and all data is inv | valid to start with | n)?          |                 |

## Concurrency

1. Consider the following function:

- 2. A reader-writer lock is a lock which can either be obtained exclusively by one thread (a "write lock") or shared by an arbitrary number of threads (who shore a "read lock"). Consider implementing a reader-writer lock by choosing the following values for the lock (which will be one MIPS word):
  - 0: Unlocked
  - Positive number: read-locked; lock value is number of readers
  - -1: write-locked

Write MIPS assembly implementation of write\_lock, write\_unlock, read\_lock, and read\_unlock. When the lock cannot be obtained, have your functions loop until it becomes free.

## Summary of general speed-up techniques

- Data-Level parallelism / SIMD: compute multiple results at a time
- Thread-level parallelism / OpenMP: have multiple threads doing computations at a time
- I/D Cache locality (e.g. loop ordering, etc.): Maximize cache hits for higher speed
- Loop unrolling: minimizes for loop overheads
- Cache Blocking: increase cache usage for higher performance
- Code optimization (mostly compiler's job): interweave independent instructions to avoid CPU stalls (waiting for the results from the previous instruction)