Stanford CS149 I Parallel Computing I 2023 I Lecture 4 - Parallel Programming Basics
15 Sep 2024 (2 months ago)
Parallel Programming Concepts
- There are four concepts to understand: multicore, SIMD, multi-threading, and super scaler. (30s)
- An ispc function differs from a normal C function in that it runs multiple copies of the same function with a different program index (IND) for each copy. (4m35s)
ispc Function Execution
- Each copy of the function runs sequentially, and the return value is given when all copies have completed. (5m35s)
- Gang size is determined at compile time and influences the width of the vector instructions used. (6m21s)
ispc Compiler Optimization
- An implementation of the
ispc_sinx
function could theoretically use a for loop to iterate through program instances, but the ispc
compiler utilizes vector instructions for efficiency. (7m56s)
- The
ispc
compiler does not spawn threads and the gang size can impact performance. (14m13s)
Work Assignment in ispc
- The speaker presents two different programs that achieve the same outcome but with different mappings of program instances to work, highlighting the programmer's control over work assignment. (12m21s)
- The first program is preferred because it accesses consecutive indices in memory, which are likely to be on the same cache line, making it more efficient. (13m10s)
High-Level Programming Constructs
- The
for each
construct in ispc
is a high-level mechanism that allows the compiler to decide how to map iterations onto program instances for good parallelism, potentially leading to better performance than manually specifying the mapping. (17m29s)
- Programmers should utilize higher-level programming constructs whenever possible, as they often provide efficient solutions. If necessary, lower-level optimizations can be explored. (18m46s)
Parallelizing Code with Independent Iterations
- The
ispc
compiler excels at parallelizing code with independent iterations, such as applying the absolute value to elements of an array. However, it's crucial to ensure the program's logic remains valid when iterations are executed out of order. (20m49s)
- Attempting to compute a sum using a per-program-instance variable within a
for each
loop in ispc
leads to incorrect results and compilation errors. This is because each program instance would have its own copy of the sum variable, and returning a value from multiple instances is not supported. (23m49s)
Race Conditions in Parallel Programming
- There is a potential race condition in a program when multiple program instances attempt to simultaneously update a single
sum
variable using the +=
operator. (24m38s)
- The
ispc
library provides a specialized cross-instance function to safely compute the sum of values from multiple program instances, addressing the race condition by accumulating partial sums locally and then performing a final reduction. (25m21s)
SIMD Vector Instructions
- The
ispc
compiler can transform code with cross-lane operations into a program that utilizes SIMD vector instructions, enabling parallel execution on a single core. (26m2s)
- ispc can generate fast code for operations on vectors, such as adding tensors or mapping a lambda onto a collection. (31m12s)
- A C program with three tests, where 'do work' is a function that simulates a task, demonstrates the performance differences between sequential execution, spawning a thread per task, and using a thread pool. (34m24s)
- Spawning a thread per task is significantly slower than sequential execution or using a thread pool, especially when the task is computationally inexpensive. (37m28s)
ispc Thread Pool
- The ispc programming language creates a thread pool sized appropriately for the machine and divides up tasks amongst the pool. (38m2s)
- There is a point at which parallelizing a function with small tasks is outweighed by the cost of context switching. (38m52s)
Parallel Programming Principles
- Programmers are generally responsible for decomposing a problem into smaller pieces that can be performed in parallel. (42m32s)
- Amdahl's Law, even with infinite processors, the maximum speedup is limited to 1/s. For example, if 1% of a program is sequential, the maximum speedup is limited to 100x, even with infinite processors. (47m42s)
Impact of Sequential Code on Large Systems
- Impact of Sequential Code on Large Systems: On systems with a large number of processors (e.g., supercomputers), it becomes crucial to minimize the sequential portion of code. If the sequential portion is not sufficiently small, the potential benefits of using such a large system diminish significantly. (48m17s)
- Work Assignment in Parallel Programming: A key aspect of parallel programming is assigning work to different processing units (cores, processors, etc.). This assignment can be done manually by the programmer or automatically by the programming language, compiler, or runtime system. The goal is to keep all processing units busy and maximize efficiency. (48m51s)
ispc Task Assignment
- In ispc, tasks are decomposed and assigned to threads for work. The system manages the assignment of tasks to idle threads, optimizing performance. (50m5s)
- Synchronization is crucial in parallel programming to ensure that tasks are assigned to only one worker and workers receive the next task in order. (51m20s)
Mapping Threads to Hardware
- Mapping threads or program instances to hardware execution contexts is a key aspect of parallel programming. This mapping can be handled by the operating system or, in the case of ispc, by the compiler. (51m52s)
Parallelizing the Red-Black Gauss-Seidel Algorithm
- It is not possible to parallelize the code by running separate chunks concurrently because the values in one chunk depend on the values calculated in previous chunks. (56m35s)
- One way to parallelize the code is to use a checkerboard pattern, updating all the red cells in parallel based on the black cells, and then updating all the black cells in parallel based on the red cells. (59m10s)
Work Assignment in the Red-Black Algorithm
- When parallelizing the checkerboard pattern, there are two assignment options: dividing the grid in a blocked way across processors or dividing them in an interleaved way. (1h1m13s)
- If a blocked assignment of array elements is used in the red-black Gauss-Seidel algorithm, less data needs to be exchanged between processors because neighboring processors already have the information they need. (1h2m4s)
Parallel Implementation of the Red-Black Algorithm
- A parallel implementation of the red-black Gauss-Seidel algorithm can be written using a "for all" loop, where each iteration of the loop can be executed in parallel. (1h3m56s)
- In a lower-level implementation of the algorithm, each thread in a shared address space can be responsible for updating a block of rows, and synchronization constructs like locks and barriers can be used to ensure correctness. (1h5m46s)
Synchronization in Parallel Programming
- Updating a shared variable, such as
diff
in this example, requires using locks to ensure atomicity and prevent race conditions. (1h8m8s)
- Instead of locking within the innermost loop, it is more efficient to create a private copy of the variable (
my_diff
), update it without locks, and then accumulate the results into the global variable (diff
) at the end of each iteration. (1h12m9s)
Barriers in Parallel Programming
- Barriers are synchronization mechanisms that force all threads to wait at a specific point in the code until all threads have reached that point. This code uses three barriers to ensure that all threads have finished their work before accumulating the results and checking for termination. (1h12m54s)
- All threads must complete the update to
diff
before any thread checks for system convergence. This ensures the diff
value is correct. (1h14m34s)
- A barrier is needed to prevent threads from clearing the
diff
value before other threads have checked it. (1h15m20s)
Contention Reduction in Parallel Programming
- Using a single global
diff
variable introduces contention. Replicating the diff
variable, similar to using local copies (my_diff
), can remove this contention and potentially reduce the number of barriers needed. (1h16m49s)