Stanford CS149 I Parallel Computing I 2023 I Lecture 7 - GPU architecture and CUDA Programming
18 Sep 2024 (4 months ago)
Early GPU Development and Purpose
- NVIDIA and AMD were originally producing chips for gaming purposes. (41s)
- Early GPUs were designed to solve the problem of rendering images from a mathematical description of a scene. (2m37s)
GPU Architecture and Processing Power
- Modern GPUs are able to render complex scenes in real time, such as the nanite demo from Epic. (3m39s)
- Triangle meshes are used to represent surfaces in computer graphics, and GPUs use math to determine where these vertices project onto the screen. (4m20s)
- In the early 2000s, programming languages were developed to run for every pixel on screen, allowing for complex material simulations. (6m12s)
- GPUs began incorporating multicore and SIMD architectures to handle the increasing demand for processing power in computer graphics, particularly for rendering high-resolution images at high frame rates. (7m30s)
Early Attempts at General-Purpose Computing
- Around 2004, researchers began exploring the use of GPUs for general-purpose computing, recognizing their potential for parallel processing beyond graphics rendering. (10m24s)
- Early attempts to utilize GPUs for general-purpose computing involved "hacks" such as rendering triangles to trigger function calls that performed non-graphical computations, treating the graphics pipeline as a means to execute parallel code. (9m43s)
Introduction of CUDA
- Prior to the introduction of CUDA in 2007, interacting with GPUs for general-purpose computing was limited to graphics-specific tasks, primarily using OpenGL to draw triangles and compute pixel colors. (13m41s)
- CUDA introduced a new abstraction, "compute mode," which allows developers to execute custom functions (kernels) on the GPU in an SPMD (Single Program, Multiple Data) fashion, similar to ISPC, by launching multiple copies of the kernel. (15m2s)
CUDA Programming Concepts
- When programming in CUDA, developers can conceptualize the execution of code in a similar way to ISPC, where a function is run multiple times with
n
instances. (18m37s)
- In CUDA, thread IDs are multi-dimensional, often represented as a grid, such as 4x3, for convenience in graphics and tensor operations. (19m17s)
- CUDA organizes instances into blocks, with each block containing a specific number of threads, and multiple thread blocks can be created simultaneously. (19m34s)
CUDA Memory Model and Communication
- The implementation of running CUDA threads will be executed on the GPU. The call of Matrix ad doubleb is the point at which communication will start on the GPU. (24m53s)
- The CUDA memory model has a void main that runs on the CPU with its own address space. The GPU has its own memory address space that can be accessed by CUDA threads. (26m9s)
- CUDA memcpy moves data from the CPU memory to the GPU memory. This is a slow copy as it moves bytes over the PCIe bus. (28m22s)
CUDA Thread Management
- CUDA threads are not created individually, but rather in bulk launches, similar to a hypothetical C++ API that allows the creation of multiple threads at once. (31m16s)
- CUDA utilizes a memory model where CPU threads and GPU CUDA threads have distinct address spaces. (32m50s)
- CUDA introduces the concept of thread blocks, where each block has its own address space accessible only to threads within that block. (33m0s)
Shared Memory in CUDA
- A program can be written to launch thread blocks with a size of 128, where each block computes 128 outputs and requires 130 inputs. (37m1s)
- Each thread in a block allocates a shared array of 130 elements, with the
shared
keyword indicating a per-thread block allocation. (37m18s)
- The threads cooperatively load data into the shared array, with a barrier (
syncthreads
) ensuring all data is loaded before computation. (39m20s)
- Shared variables are backed by storage comparable to a high-performance L1 cache, enabling fast access for threads within the same block. (39m56s)
CUDA Thread Block Management
- The number of threads launched in a CUDA program is up to the programmer. (43m35s)
- CUDA threads are organized into blocks, and the number of threads in a block cannot exceed the capacity of a single core. (43m46s)
- CUDA, unlike ISPC, provides synchronization constructs such as per-thread block barriers and atomic operations. (45m34s)
CUDA Compilation and Execution
- When a CUDA program is compiled, metadata about its execution requirements, such as threads per block and shared memory allocation, is generated. (48m13s)
- A GPU contains multiple cores that execute thread blocks, distributing work similarly to a thread pool. (49m17s)
GPU Core Architecture
- A Volta v00 GPU core, or subcore, consists of a fetch and decode unit, 16 arithmetic logic units (ALUs) for SIMD operations, and storage for 128 threads' execution contexts. (50m52s)
- On GPUs, SIMD execution occurs implicitly when 32 consecutive threads, called a warp, execute the same instruction, unlike CPUs where the compiler generates SIMD instructions. (53m37s)
Warp Execution and Scheduling
- A warp comprises 32 threads and executes instructions in a SIMD (Single Instruction, Multiple Data) manner, meaning all threads in a warp execute the same instruction on different data. (55m38s)
- A warp scheduler issues instructions to different units (floating point, integer, etc.) every other clock cycle to maximize efficiency. (56m11s)
- Each of the 32 threads in a warp has its own program counter (PC), but SIMD execution only occurs when all threads in a warp have the same PC value. (58m8s)
Thread Blocks and Warps
- A thread block is a programming concept, while a warp is a hardware implementation detail; programmers create thread blocks, and NVIDIA GPUs use warps to execute them. (1h0m13s)
- A single core in the GPU contains 64 warps, which translates to execution context for 2000 CUDA threads (64 warps * 32 threads/warp). (1h0m49s)
- A single Streaming Multiprocessor (SM) can be thought of as four separate cores that share the same shared memory storage. (1h1m33s)
- Each SM can handle 2000 CUDA threads with 128 kilobytes of shared storage. (1h1m46s)
Block Size Flexibility and Optimization
- Nvidia GPUs allow for flexibility in block size, so code written for older architectures with smaller block sizes can still run on newer architectures with larger block sizes. (1h6m19s)
- NVIDIA prefers automatic scheduling of SIMD operations, potentially changing block size and reordering for optimization. (1h6m48s)
- A CUDA core running a thread block with 128 threads and 512 bytes of storage allocates the threads and memory from the shared memory, distributing them across slices for concurrent execution. (1h7m31s)
Shared Memory Constraints
- When a thread block cannot be scheduled due to insufficient resources, despite having enough execution contexts, it is because there is not enough shared memory available. (1h12m6s)
- Nvidia GPUs will not schedule thread blocks that exceed the available shared memory resources. (1h12m17s)
Partial Thread Block Execution and Synchronization
- Cuda programs cannot execute partial thread blocks because it can lead to deadlocks, especially when synchronization mechanisms like barriers are used. (1h15m34s)
- Using atomic operations on global memory across multiple thread blocks is acceptable in Cuda. (1h17m20s)
- One thread block can instruct the updating of a variable in memory, while another thread block waits for the update to occur. (1h17m53s)
- Thread blocks can interact, but assumptions cannot be made about the order in which operations are performed. (1h18m26s)
- Threads within a thread block can be assumed to run concurrently, allowing for the use of barriers for synchronization. (1h18m34s)