Stanford CS149 I Lecture 6 - Performance Optimization II: Locality, Communication, and Contention

17 Sep 2024 (7 months ago)

Early Multi-Threaded Chips

The UltraSparc, designed by Sun Microsystems, was an early multi-threaded chip that used a crossbar switch to connect cores to memory. (5m3s)
Intel CPUs have a complex network that connects cores and memory, often using a ring architecture for communication. (3m39s)

Data Access Speed in Modern Systems

In modern computer systems, data access speed can vary greatly depending on the location of the data, especially in large systems like GPUs. (6m56s)

Message Passing: A Computing Model

Message passing is a computing model where different threads or computers operate in their own address spaces and exchange information by explicitly sending messages, similar to communication on the internet. (8m1s)
In contrast to shared memory systems where all threads have direct access to the same addresses, message passing requires explicit communication to copy data between address spaces. (11m4s)

Distributed Memory Systems

A distributed memory system is described where each processor has its own memory and data is exchanged between them using a network. (12m31s)
In this system, a grid of data is divided and stored across multiple processors, with each processor responsible for a portion of the grid. (13m30s)
To update a grid element, a processor needs data from its neighbors, which may reside in different processors' memory; to facilitate this, processors over-allocate memory to store copies of data from their neighbors, referred to as "ghost rows" or "ghost cells". (15m7s)

Message Passing Code Example

In this message passing code example, each thread (except thread zero) sends a partial sum of local diff values to thread zero. (21m27s)
Thread zero waits to receive all the partial sums, calculates if the computation is complete (signified by the boolean variable done), and sends the done value back to all other threads. (21m36s)
Synchronization is achieved through this specific communication pattern of sending and receiving messages, eliminating the need for locks or barriers. (22m39s)

Blocking Send Operations

A blocking send operation involves copying data from the sender's address space, transmitting it over the network, and waiting for an acknowledgment from the receiver before returning. (24m30s)
If all threads in a program attempt to send data to the thread behind them using blocking sends, a deadlock will occur. (28m5s)
A potential solution to this deadlock is to pair threads up and have one thread in each pair send data while the other receives data. (29m6s)

Asynchronous Send and Receive Functions

Asynchronous send functions return immediately after being called. The function will provide a handle to check if the message has been sent. (30m43s)
Asynchronous receive functions return immediately with a handle. The handle can be used to check if the message has been received. (32m31s)
There is no guarantee that messages sent asynchronously will be received in the same order they were sent unless the message passing API being used has a specific configuration or flag set. (34m59s)

Message Passing: IDs and Communication

Messages in message passing systems have IDs and can be sent, received, or awaited. Receiving messages can involve waiting for messages from specific senders or with specific IDs. (36m17s)
Communication in computing can occur between various components, including cores, memory, and even different computers. It can involve data movement between processors, registers, caches, and different levels of memory hierarchy. (36m40s)

Message Passing vs. Shared Memory

Message passing can be used as an alternative to shared address space communication for exchanging data between threads. It can be particularly useful in distributed systems without shared memory and can simplify concurrency reasoning. (39m27s)
Shared memory does not force discipline in program design, potentially making performance tuning more challenging. (41m54s)

High-Performance Networking and Data Copying

High-performance networking implementations can reduce data copying in message passing by allowing the network interface card (NIC) to directly access and transmit data from memory. (43m2s)

Arithmetic Intensity and Performance

Arithmetic intensity, the ratio of math operations to data read, is a crucial factor in determining performance when latency is not a primary concern. (45m18s)
Arithmetic intensity is the ratio of computation to communication. Higher arithmetic intensity is preferable as it leads to better performance. (49m14s)
Distributing work in a tiled format rather than chunks of rows leads to higher arithmetic intensity. This is because a square shape has the highest ratio of area to perimeter. (49m28s)

Caches and Communication

Caches can impact communication and performance. In the grid solver example, with a cache line size of four elements and a cache of six lines, moving horizontally after computing a red dot results in no cache misses. (53m25s)
Arithmetic intensity is the ratio of work to bandwidth. For every four elements of output, three new cache lines are loaded. (55m10s)
Inherent communication can be reduced by changing the assignment of work to processors. (55m51s)
Cache blocking is a technique for reducing artifactual communication by changing the order in which iteration over an array occurs. (58m3s)
Arithmetic intensity can be improved by rewriting code to perform all calculations within a loop iteration, reducing the total number of loads and stores. (1h1m16s)
Deep learning compilers optimize code by transforming vector and tensor operations to improve data locality and reduce memory accesses. (1h2m57s)

Contention and Performance Bottlenecks

Contention for shared resources, such as waiting in line during office hours, can lead to performance bottlenecks even when the amount of work and travel time is the same for each request. (1h5m21s)
Appointment-driven scheduling for office hours is more efficient for students, taking only 10 minutes of their time. (1h5m54s)
Replicating tasks among workers can help avoid contention for memory access, which can slow down performance. (1h6m11s)

Roofline Graphs and Performance Analysis

Roofline graphs can be used to visualize a program's performance in terms of operational intensity (flops per byte) and achieved performance (gigaflops), helping to determine if a program is compute-bound or memory-bound. (1h11m14s)
For programs with an arithmetic intensity of 1 or higher, the performance of the X2 processor remains constant. (1h11m36s)
The X4 processor, with four times the compute of the X2 processor, achieves approximately four times higher peak performance. (1h14m29s)
Increasing the parallel capability of a computer only results in peak performance if applications possess sufficient mathematical operations to utilize the added compute. (1h15m8s)

Browse more from
Parallel Computing

Summarize anything forget nothing

Rated 4.9 on Product Hunt

Get Started

Stanford CS149 I Lecture 6 - Performance Optimization II: Locality, Communication, and Contention

Early Multi-Threaded Chips

Data Access Speed in Modern Systems

Message Passing: A Computing Model

Distributed Memory Systems

Message Passing Code Example

Blocking Send Operations

Asynchronous Send and Receive Functions

Message Passing: IDs and Communication

Message Passing vs. Shared Memory

High-Performance Networking and Data Copying

Arithmetic Intensity and Performance

Caches and Communication

Contention and Performance Bottlenecks

Roofline Graphs and Performance Analysis

Browse more from
Parallel Computing

Stanford CS149 I Parallel Computing I 2023 I Lecture 7 - GPU architecture and CUDA Programming

Stanford CS149 I Parallel Computing I 2023 I Lecture 8 - Data-Parallel Thinking

Stanford CS149 I 2023 I Lecture 9 - Distributed Data-Parallel Computing Using Spark

Stanford CS149 I 2023 I Lecture 3 - Multi-core Arch Part II + ISPC Programming Abstractions

Stanford CS149 I Parallel Computing I 2023 I Lecture 10 - Efficiently Evaluating DNNs on GPUs

Stanford CS149 I Parallel Computing I 2023 I Lecture 11 - Cache Coherence

Summarize anything forget nothing

Rated 4.9 on Product Hunt

Stanford CS149 I Lecture 6 - Performance Optimization II: Locality, Communication, and Contention

Early Multi-Threaded Chips

Data Access Speed in Modern Systems

Message Passing: A Computing Model

Distributed Memory Systems

Message Passing Code Example

Blocking Send Operations

Asynchronous Send and Receive Functions

Message Passing: IDs and Communication

Message Passing vs. Shared Memory

High-Performance Networking and Data Copying

Arithmetic Intensity and Performance

Caches and Communication

Contention and Performance Bottlenecks

Roofline Graphs and Performance Analysis

Browse more from Parallel Computing

Stanford CS149 I Parallel Computing I 2023 I Lecture 7 - GPU architecture and CUDA Programming

Stanford CS149 I Parallel Computing I 2023 I Lecture 8 - Data-Parallel Thinking

Stanford CS149 I 2023 I Lecture 9 - Distributed Data-Parallel Computing Using Spark

Stanford CS149 I 2023 I Lecture 3 - Multi-core Arch Part II + ISPC Programming Abstractions

Stanford CS149 I Parallel Computing I 2023 I Lecture 10 - Efficiently Evaluating DNNs on GPUs

Stanford CS149 I Parallel Computing I 2023 I Lecture 11 - Cache Coherence

Summarize anything forget nothing

Rated 4.9 on Product Hunt

Browse more from
Parallel Computing