Stanford CS149 I Lecture 6 - Performance Optimization II: Locality, Communication, and Contention

17 Sep 2024 (4 months ago)
Stanford CS149 I Lecture 6 - Performance Optimization II: Locality, Communication, and Contention

Early Multi-Threaded Chips

Data Access Speed in Modern Systems

  • In modern computer systems, data access speed can vary greatly depending on the location of the data, especially in large systems like GPUs. (6m56s)

Message Passing: A Computing Model

  • Message passing is a computing model where different threads or computers operate in their own address spaces and exchange information by explicitly sending messages, similar to communication on the internet. (8m1s)
  • In contrast to shared memory systems where all threads have direct access to the same addresses, message passing requires explicit communication to copy data between address spaces. (11m4s)

Distributed Memory Systems

  • A distributed memory system is described where each processor has its own memory and data is exchanged between them using a network. (12m31s)
  • In this system, a grid of data is divided and stored across multiple processors, with each processor responsible for a portion of the grid. (13m30s)
  • To update a grid element, a processor needs data from its neighbors, which may reside in different processors' memory; to facilitate this, processors over-allocate memory to store copies of data from their neighbors, referred to as "ghost rows" or "ghost cells". (15m7s)

Message Passing Code Example

  • In this message passing code example, each thread (except thread zero) sends a partial sum of local diff values to thread zero. (21m27s)
  • Thread zero waits to receive all the partial sums, calculates if the computation is complete (signified by the boolean variable done), and sends the done value back to all other threads. (21m36s)
  • Synchronization is achieved through this specific communication pattern of sending and receiving messages, eliminating the need for locks or barriers. (22m39s)

Blocking Send Operations

  • A blocking send operation involves copying data from the sender's address space, transmitting it over the network, and waiting for an acknowledgment from the receiver before returning. (24m30s)
  • If all threads in a program attempt to send data to the thread behind them using blocking sends, a deadlock will occur. (28m5s)
  • A potential solution to this deadlock is to pair threads up and have one thread in each pair send data while the other receives data. (29m6s)

Asynchronous Send and Receive Functions

  • Asynchronous send functions return immediately after being called. The function will provide a handle to check if the message has been sent. (30m43s)
  • Asynchronous receive functions return immediately with a handle. The handle can be used to check if the message has been received. (32m31s)
  • There is no guarantee that messages sent asynchronously will be received in the same order they were sent unless the message passing API being used has a specific configuration or flag set. (34m59s)

Message Passing: IDs and Communication

  • Messages in message passing systems have IDs and can be sent, received, or awaited. Receiving messages can involve waiting for messages from specific senders or with specific IDs. (36m17s)
  • Communication in computing can occur between various components, including cores, memory, and even different computers. It can involve data movement between processors, registers, caches, and different levels of memory hierarchy. (36m40s)

Message Passing vs. Shared Memory

  • Message passing can be used as an alternative to shared address space communication for exchanging data between threads. It can be particularly useful in distributed systems without shared memory and can simplify concurrency reasoning. (39m27s)
  • Shared memory does not force discipline in program design, potentially making performance tuning more challenging. (41m54s)

High-Performance Networking and Data Copying

  • High-performance networking implementations can reduce data copying in message passing by allowing the network interface card (NIC) to directly access and transmit data from memory. (43m2s)

Arithmetic Intensity and Performance

  • Arithmetic intensity, the ratio of math operations to data read, is a crucial factor in determining performance when latency is not a primary concern. (45m18s)
  • Arithmetic intensity is the ratio of computation to communication. Higher arithmetic intensity is preferable as it leads to better performance. (49m14s)
  • Distributing work in a tiled format rather than chunks of rows leads to higher arithmetic intensity. This is because a square shape has the highest ratio of area to perimeter. (49m28s)

Caches and Communication

  • Caches can impact communication and performance. In the grid solver example, with a cache line size of four elements and a cache of six lines, moving horizontally after computing a red dot results in no cache misses. (53m25s)
  • Arithmetic intensity is the ratio of work to bandwidth. For every four elements of output, three new cache lines are loaded. (55m10s)
  • Inherent communication can be reduced by changing the assignment of work to processors. (55m51s)
  • Cache blocking is a technique for reducing artifactual communication by changing the order in which iteration over an array occurs. (58m3s)
  • Arithmetic intensity can be improved by rewriting code to perform all calculations within a loop iteration, reducing the total number of loads and stores. (1h1m16s)
  • Deep learning compilers optimize code by transforming vector and tensor operations to improve data locality and reduce memory accesses. (1h2m57s)

Contention and Performance Bottlenecks

  • Contention for shared resources, such as waiting in line during office hours, can lead to performance bottlenecks even when the amount of work and travel time is the same for each request. (1h5m21s)
  • Appointment-driven scheduling for office hours is more efficient for students, taking only 10 minutes of their time. (1h5m54s)
  • Replicating tasks among workers can help avoid contention for memory access, which can slow down performance. (1h6m11s)

Roofline Graphs and Performance Analysis

  • Roofline graphs can be used to visualize a program's performance in terms of operational intensity (flops per byte) and achieved performance (gigaflops), helping to determine if a program is compute-bound or memory-bound. (1h11m14s)
  • For programs with an arithmetic intensity of 1 or higher, the performance of the X2 processor remains constant. (1h11m36s)
  • The X4 processor, with four times the compute of the X2 processor, achieves approximately four times higher peak performance. (1h14m29s)
  • Increasing the parallel capability of a computer only results in peak performance if applications possess sufficient mathematical operations to utilize the added compute. (1h15m8s)

Overwhelmed by Endless Content?