Stanford CS149 I Parallel Computing I 2023 I Lecture 12 - Memory Consistency

23 Sep 2024 (6 months ago)

Cache Coherence

Coherence in a memory system with multiple caches that share cached memory locations from a shared address space means that for any particular address, all reads and writes made by all processors can be ordered in a sequential order that corresponds to the program order issued by any of the threads in the processor. (59s)
In a memory system with multiple caches, there are two invariants that need to be maintained to implement coherence: the single writer multiple reader invariant and the data value invariant. (1m26s)
The Modified Shared Invalid right back invalidation protocol maintains cache coherence by ensuring that only one cache line in the system is in the modified state at any given time, and that the memory is always up to date when a cache line is in the shared state. (5m19s)
When a cache line is in a modified state, it signifies that it is the only valid copy in the system. (10m24s)
When a processor has a cache line in the Modified state and receives a bus read exclusive request, it supplies the most recent data copy, transitions to the Invalid state, and sends a bus write back. (15m14s)
If Processor 1 has a cache line in the Shared state and Processor 3 has it in the Modified state, Processor 1 will retrieve the data from Processor 3's cache. (18m0s)
The transition from the Modified state to the Shared state is essential in the single-writer multiple-reader protocol to maintain the invariant and track state transitions. (21m58s)
The MSI (Modified, Shared, Invalid) protocol maintains cache coherence by ensuring only one processor can have a cache line in the Modified state, allowing it to write to that line. (26m28s)
The MSI protocol can be optimized with an Exclusive state, signifying that a cache line is clean and held by only one cache. (27m51s)
Reading a cache line not shared by other processors brings it into the Exclusive state, enabling upgrades from Exclusive to Modified without bus transactions, improving efficiency. (28m50s)
The MESI protocol, which uses a directory to store information about which processes have a particular cache line in their cache, is more efficient and scalable than bus-based systems. (31m29s)
Directory-based systems allow for scalable interconnects, such as networks or rings, which eliminate the need for broadcasting and serializing transactions. (32m6s)
Cache coherence can lead to an increased number of cache misses at different levels of the memory hierarchy, especially in NUMA systems where access to remote memory (DRAM) is slower than access to local memory. (36m1s)
NUMA (Non-Uniform Memory Access) impacts application performance as data access times vary based on the data's location relative to the processor. Accessing data locally is faster than accessing it remotely. (37m55s)
The average memory access time in multiprocessor or multi-parallel programs is generally higher than in sequential programs due to the increased latency associated with accessing different levels of the memory hierarchy. (39m39s)
False sharing, which occurs when data intended to be used by different threads resides on the same cache line, can significantly impact performance. An example is provided where using a struct to separate per-thread variables leads to a performance improvement from 14.2 seconds to 4.7 seconds on a four-core system. (43m30s)
As cache line size increases, true sharing misses decrease due to better spatial locality in truly shared data. (45m11s)
Conversely, false sharing misses increase as cache line size increases. (45m30s)
The standard cache line size in most processors is around 64. (51m44s)

Memory Consistency

Cache coherence is necessary for systems with caches, but a memory consistency model is necessary even for systems without caches. (57m6s)
Memory consistency focuses on the apparent ordering of reads and writes to different addresses by different processes, while coherence focuses on ensuring that all processes eventually see the same order of writes to a single address. (56m3s)
Modern multiprocessor systems reorder memory accesses to improve performance, which has implications for systems programmers and compiler writers. (57m32s)
There are four types of memory ordering: read to read, read to write, write to read, and write to write. (59m37s)
Sequential consistency, as defined by Leslie Lamport, ensures that all operations are executed in a sequential order, as if operating on a single shared memory, and that operations within a thread occur in program order. (1h4m41s)
Sequentially consistent systems maintain all four memory orderings: read to read, read to write, write to read, and write to write. (1h5m21s)
Sequential Consistency (SC) is intuitive for programmers but can hinder performance optimization. (1h7m32s)
Write buffers can improve performance by allowing reads to occur before writes are completed, but this violates SC. (1h9m22s)
Total Store Ordering (TSO) and Processor Consistency (PC) are weaker memory models that permit reordering of reads and writes to enhance performance, particularly by hiding write latency. (1h11m40s)
Coherency is needed due to caches, while consistency is needed to understand the meaning of programs. (1h14m40s)
Relaxed consistency, a model without specific orderings, is used in ARM processors, such as those found in cell phones. (1h15m47s)
Data races, which can lead to unintended behavior, occur with unsynchronized access to shared data, necessitating synchronization mechanisms like fences and locks. (1h17m22s)