Stanford CS149 I Parallel Computing I 2023 I Lecture 17 - Transactional Memory 2

29 Sep 2024 (9 months ago)

Transactional Memory Implementations

Transactional memory systems can be implemented in a variety of ways by combining different data versioning policies (eager or lazy) and conflict detection policies (optimistic or pessimistic). (3m10s)
Software transactional memory implementations often use a hybrid approach, employing eager optimistic for reads and pessimistic for writes. (3m41s)

Software transactional memory systems rely on software barriers, which are calls to the transactional memory system inserted into the original code to track reads and writes, although many of these calls can be optimized away. (5m26s)
When code is executed both inside and outside of transactions, two copies are required: one with instrumentation and one without. (6m58s)
Transactional memory systems require data structures to track state and ensure correct operation. These structures include a per-thread TransDescriptor (containing information about undo logs, write buffers, conflict detection, read sets, and write sets) and a per-data-element TransactionRecord (containing metadata about data access). (8m7s)
There are two ways to map data to the TransactionRecord: per-object (lower overhead, potential for optimization, but possibility of false conflicts) and per-data-element (reduced false conflicts, potentially improved concurrency, but increased overhead). (9m47s)

The Intel developed transactional memory algorithm, McRT, uses eager versioning with optimistic reads and pessimistic writes. (14m16s)
McRT is based on versioning or timestamping to determine what data is shared by multiple transactions. (14m40s)
The STM read in McRT is optimistic in the sense that even though you try and validate the data you are about to read, once you have done that you are golden until some other transaction decides to commit. (17m22s)
The STM write is pessimistic in that you now have to check and grab. (20m58s)

Transaction processing involves acquiring a lock before writing data, making it pessimistic for writes but optimistic for reads as it doesn't check for readers during the write operation. (21m44s)
Reset validation involves checking if data is locked or if its version is greater than the local timestamp; if either is true, the transaction aborts. (22m30s)
When committing a Software Transactional Memory (STM), the timestamp is incremented by two: one to mark the data as no longer locked and the other to update the version number. (24m21s)

X1 commits its transaction, incrementing the timestamp of 'bar' from 5 to 7. (30m16s)
X2 waits for X1 to commit its transaction. (32m42s)
X2's transaction aborts because its read set validation fails. (33m25s)

Software Transactional Memory (STM) systems can experience overhead due to barriers, but decomposing and optimizing these barriers can lead to performance improvements. (35m22s)
A study on hashmap and tree map data structures showed that using STM without compiler optimization resulted in 70% to 80% overhead on a single processor. (37m14s)
Software Transactional Memory (STM) can experience significant overhead due to barriers and commit operations, as demonstrated in examples like "vacation" and "K means". (41m55s)

Hardware Transactional Memory (HTM) leverages existing hardware components like caches and coherency protocols to mitigate the overhead associated with STM. (43m18s)
HTM implementations typically utilize caches for data versioning (similar to an undo log) and enhance coherency protocols for conflict detection, eliminating the need for explicit barriers. (43m34s)
To enhance metadata on cache lines for transactional memory, two bits are added: an R bit (read state) and a W bit (write state). (44m36s)
During the commit phase of a transaction, if an upgrade to a cache line is observed and that line is part of the transaction's read set (R bit set), a conflict is detected. (50m31s)
If an upgrade to a cache line is observed during the commit phase, but the line is not present in the transaction's cache, no conflict exists. (51m30s)
The hardware system being discussed uses a lazy optimistic policy for data versioning and conflict detection. (53m6s)
In this system, writes are buffered and only published when a transaction commits. At the commit point, the written data is compared with the read sets of other transactions to detect conflicts. (53m31s)
The system does not use exclusive states for cache lines during transaction execution; isolation is maintained until the commit point. Reads are performed with standard read requests, not read-exclusive. (54m22s)

Transactional memory support in Intel architectures faced challenges, including software development issues and frequent transaction aborts. (1h7m9s)
Security vulnerabilities discovered in Intel's transactional memory implementation led to its exploitation and hindered its adoption. (1h7m45s)
Despite past setbacks, Intel's roadmap still includes plans for transactional memory, although no concrete implementations have emerged. (1h7m56s)
Intel's latest implementations do not include the instructions to support transactional memory that are defined in their instruction set architecture (ISA) manual. (1h8m16s)

While many companies have experimented with transactional memory, its commercial use is limited because achieving good performance requires software implementation, compiler support, and code cloning, which can reduce system robustness. (1h11m51s)
Transactional memory is most effectively used in commercial systems when localized to specific data structures or components within a larger parallel system, often implemented in software rather than hardware. (1h12m57s)

Real-world applications often exhibit a mix of parallelism types, including thread-based and SIMD, as well as predictable and unpredictable memory access patterns. (1h15m15s)
Modern processors, such as the Intel Core i7 (Skylake architecture), are heterogeneous, incorporating CPU cores, integrated GPUs, and other specialized components. (1h16m27s)
While heterogeneity allows for more efficient resource and energy use, it introduces programming challenges, requiring different programming models for CPU cores (e.g., multithreading), GPUs (e.g., CUDA), and other specialized components. (1h17m29s)