Stanford CS149 I Parallel Computing I 2023 I Lecture 11 - Cache Coherence

22 Sep 2024 (9 months ago)

Spark: Distributed Computing Framework

Spark aims to enable efficient distributed computing for applications that heavily utilize intermediate data, such as iterative algorithms and queries. (1m33s)
Resilient Distributed Datasets (RDDs) are the key abstraction in Spark, representing read-only, ordered collections of records. (2m56s)
Spark optimizes for locality through techniques like fusion and tiling to enhance parallel performance. (4m1s)
Transformations can have wide dependencies that require communication between nodes in a system, for example, a group by key operation. (7m41s)
If data is partitioned using the same hash partitioner, the runtime system can detect narrow dependencies and perform fusion, reducing communication overhead. (11m2s)
Spark maintains fault tolerance through lineage, which is a log of transformations applied to RDDs, allowing for data recreation from persistent storage. (12m11s)
If a node crashes during computation, the system recovers by rerunning the log of operations to recreate lost data partitions. This leverages data replication in HDFS and the coarse-grained nature of the operation log. (15m45s)
Spark offers significant performance improvements, often orders of magnitude faster than Hadoop, due to its use of in-memory processing and reduced reliance on disk access for iterative tasks. (18m30s)
Spark's ecosystem includes domain-specific frameworks like Spark SQL for database processing, Spark MLlib for machine learning, and Spark GraphX for graph operations, all built upon the underlying RDD abstraction. (20m30s)
Spark can be extended beyond transformations and actions to perform tasks like graph analysis and database operations. (23m26s)
If data fits in memory, using a distributed system like Spark is not recommended due to overheads, as a single system will be more efficient. (26m27s)

Cache Memory: Exploiting Locality

Modern processors dedicate a significant portion of their chip area to cache (30% or more) to enhance performance by exploiting locality. (28m31s)
Cache lines are used to exploit spatial locality in program behavior, which is the tendency to access memory locations that are close together. (29m58s)
The Intel Skylake chip, found in the myth machines, has a 32 kilobyte L1 data cache that is 8-way set associative, meaning that any given cache line can only be stored in one of eight places in the cache. (34m59s)
Conflict misses occur when the limited associativity of a cache prevents a cache line from being stored in a particular location, even if there are other available locations in the cache. (36m33s)
Set-associative caches are organized into sets, which can be visualized as buckets, with each set containing a fixed number of cache lines. (37m26s)
As cache size increases, the set associativity does not always increase proportionally. (38m40s)
A cache line consists of data and metadata. The metadata includes a tag, which stores the memory address of the data, and a dirty bit, indicating whether the data in the cache has been modified. (39m31s)
In a write-back cache, data is written to the cache first and only later written to the main memory. In contrast, a write-through cache writes data to both the cache and main memory simultaneously. (41m31s)
Write-allocate, used with write-back caches, fetches the entire cache line from main memory upon a write miss and then writes the data. (43m39s)
Right back caches write data to the cache and set a dirty bit, while right through caches write data to both the cache and memory simultaneously, eliminating the need for a dirty bit. (44m40s)
The tag in a cache represents the address of a cache line, not just a single variable, allowing for spatial locality. (49m25s)
Higher set associativity in a cache reduces miss rates but increases the complexity of cache lookups. (50m35s)

Cache Coherence: Maintaining Data Consistency

When multiple processors with individual caches are connected to a main memory, a memory coherence problem can occur. (52m52s)
If multiple processors read and write to the same memory address, inconsistencies can arise because updates to one cache might not be reflected in other caches or the main memory. (56m1s)
Determining the "last" value written to a memory address becomes complex in a multi-processor system because simultaneous writes from different processors need a clear order to ensure data consistency. (59m0s)
For any given memory location, access should be serialized to maintain coherence. This means that reads and writes to a specific address should happen in the order issued by the processor, ensuring subsequent reads reflect the last write to that location. (1h0m51s)
Two key invariants are crucial for cache coherence: the single writer multiple reader invariant, which allows only one processor to modify a cache line at a time while permitting multiple readers, and the data value invariant, which ensures that all readers see the last written value during a read-only epoch. (1h2m1s)
While software-based solutions using virtual memory pages are possible, they are slow and prone to issues like false sharing. Hardware-based solutions operating at the cache line level, such as snooping and directory-based systems, offer a more efficient approach to maintaining cache coherence. (1h6m47s)
A single shared cache for multiple processors can lead to a performance bottleneck due to bandwidth limitations and interference between processors. (1h7m52s)
Shared caches can experience both constructive interference, such as pre-fetching data for subsequent iterations, and destructive interference, where one processor's cache misses are caused by another processor. (1h8m46s)

Snooping Cache Coherence: Using a Bus

Snooping cache coherence schemes use an interconnect, such as a bus, to connect processors and their caches, allowing them to communicate and maintain data consistency. (1h11m2s)
Buses have two important properties for cache coherence: they allow only one transaction at a time, providing serialization, and they broadcast all transactions to all connected devices. (1h14m12s)
A bus in computer architecture functions as a broadcast medium, where all processors receive communication from a single source. (1h15m0s)
The MSI protocol, used in cache coherence, utilizes three states: Modified, Shared, and Invalid. (1h18m27s)
The MSI protocol ensures exclusive access for write operations by using a combination of processor operations (processor read, processor write) and bus transactions (bus read, bus read exclusive, bus write back). (1h19m14s)