Stanford CS149 I 2023 I Lecture 5 - Performance Optimization I: Work Distribution and Scheduling

16 Sep 2024 (2 months ago)
Stanford CS149 I 2023 I Lecture 5 - Performance Optimization I: Work Distribution and Scheduling

Program Optimization

  • A numerical program was optimized by dividing it into phases, updating cells in a grid based on neighbors, and repeating iterations until convergence. (17s)
  • The program used three barriers to synchronize threads: ensuring all threads finished accumulating partial sums, ensuring all threads checked the convergence value, and preventing updates before resetting or checking. (55s)
  • The program can be optimized by using an array for the diff variable to eliminate dependencies and reduce the number of barriers to one. (3m59s)

Work Balancing in Parallel Programming

  • Balancing work in parallel programming is crucial for performance optimization, ensuring that no single core is overloaded while others remain idle. (8m20s)
  • A simple approach to balancing work is static assignment, where the workload is divided evenly among the available cores or threads before execution. (8m48s)

Static Assignment

  • Static assignment can be effective when the workload is uniform, such as dividing an image into rows where each row has a similar computational cost. (10m52s)
  • Static assignment is a suitable approach when the cost of work chunks is predictable or when the average cost of work is consistent. (13m10s)

Semi-Static Assignment

  • Semi-static assignment involves assigning work to workers upfront based on the current state, maintaining that assignment for a period, and then potentially adjusting it if the workload distribution changes. (16m10s)

Dynamic Assignment

  • Dynamic assignment is employed when the time required for a task is unpredictable, making it challenging to pre-assign work effectively. (16m50s)
  • The code example demonstrates a work assignment strategy where threads dynamically grab work from a shared array using a counter and a lock to ensure each thread gets a unique piece of work. (19m17s)
  • The work itself is defined as computing the primality of a set of numbers, with each number's primality check representing a single unit of work. (20m28s)
  • Dividing work into smaller, more numerous pieces and placing them in a work queue allows for better work distribution and avoids potential imbalances compared to pre-assigning fixed chunks of work. (23m50s)

Cost of Dynamic Allocation

  • The cost of dynamic allocation in a program can be evaluated by comparing the time spent on useful work (e.g., testing primality) versus overhead (e.g., synchronization). (24m21s)
  • To optimize a program, one can measure the time spent in different sections, such as the total execution time, time spent in specific functions (e.g., testPrimality), and time spent on synchronization primitives (e.g., locks). (26m22s)
  • If a significant portion of the execution time is spent on overhead, alternative strategies like static assignment can be explored to reduce overhead and potentially improve performance. (29m10s)

Optimizing Dynamic Allocation

  • To avoid excessive overhead from having too many small tasks, it is suggested to increase the workload granularity by processing multiple pieces of data at once. (30m44s)
  • Dynamic scheduling may result in uneven workload distribution, especially if tasks at the end of the queue have longer processing times. Sorting tasks by cost and assigning the largest tasks first can help mitigate this issue. (32m49s)
  • To alleviate contention for shared resources in a work queue with multiple threads, it is suggested to distribute the work queue by making copies for each thread, similar to creating local copies of variables. (34m25s)

Task Dependencies

  • A common scheduling problem in various systems involves managing tasks with dependencies, where the completion of one task might enable the execution of others. (36m9s)

Parallel Programming Approaches

  • Aside from data parallelism, another approach to parallel programming involves explicitly creating worker threads and assigning them tasks, as opposed to focusing on independent data operations. (38m37s)

Quick Sort and Parallelism

  • Quick sort, a divide and conquer algorithm, offers potential parallelism in its recursive calls, allowing the left and right partitions of an array to be sorted concurrently. (40m52s)

Silk Programming System

  • Silk is a programming system that exists in most modern C/C++ compilers that makes it easier to write divide and conquer parallel programs. (42m6s)
  • The keyword spawn placed in front of a function call in Silk will invoke a function just like a normal function call, but the caller may continue executing asynchronously with the callee. (42m28s)
  • The keyword sync in Silk is a barrier for all spawn functions that guarantees any spawned calls have completed and returned before continuing from the function. (43m45s)
  • Silk is a programming model that allows programmers to reveal work that can be done concurrently. (47m38s)
  • A valid implementation of Silk can ignore all spawn and sync keywords and run everything sequentially. (53m41s)
  • Another valid implementation is to spawn a new thread for each silk spawn keyword and then use join for each silk sync. (49m6s)

Threadpool Implementation

  • A threadpool implementation is suggested where a predetermined number of threads are initiated at the start of the program and assigned tasks from a work queue. (54m17s)

Scheduling Tasks in a Multi-threaded Environment

  • Two approaches to scheduling tasks in a multi-threaded environment are discussed: running the continuation first (queueing the function call for later stealing) and running the child first (queueing the continuation for later stealing). (58m11s)
  • An example with a for loop spawning multiple asynchronous calls (Fu) is presented to illustrate the effects of both scheduling approaches on work distribution. (58m35s)

Run Child First Scheme

  • In the runch child first scheme, thread zero places the remaining iterations of a loop into its work queue. Other threads can then steal these iterations, leading to the work bouncing between threads. (1h1m10s)
  • When applying the runch child first scheme to a recursive algorithm like quicksort, the size of tasks in the queue varies. Smaller tasks are at the bottom of the queue, while larger tasks are at the top. (1h3m6s)
  • It is advantageous for idle threads to steal work from the top of the queue for two reasons: stealing larger tasks reduces the need for frequent synchronization, and it allows thread zero to primarily manage the bottom of the queue, promoting data locality. (1h5m15s)

Work-Stealing Scheduler

  • When using a work-stealing scheduler, threads operate independently on subtasks, pulling work from their local queues without communication, leading to increased parallelism as they recursively generate more work. (1h6m1s)
  • In this model, local threads push and pop work from the bottom of their queues, while remote threads steal large chunks of work from the top, minimizing interference. (1h6m14s)
  • An idle thread can steal work from any queue with available tasks; randomly selecting a queue with work is theoretically optimal, as targeting the busiest queue can lead to contention and potential inefficiencies. (1h8m26s)

Greedy Join Scheduling

  • Silk uses a system where threads constantly try to steal work from each other to maximize efficiency. This is referred to as "greedy join scheduling". (1h16m13s)
  • When a thread completes a unit of work ("block") that was stolen from another thread, it updates a reference count associated with that block. (1h15m56s)
  • The thread that finishes the last piece of work for a given block continues execution from where the block left off. (1h16m3s)

Assignment and Next Meeting

  • Students are expected to implement a concept in their assignment. (1h17m20s)
  • This concept is common in distributed systems. (1h17m25s)
  • The next meeting will be on Thursday. (1h17m32s)

Overwhelmed by Endless Content?