Stanford CS149 I Parallel Computing I 2023 I Lecture 7 - GPU architecture and CUDA Programming

18 Sep 2024 (2 months ago)

Early GPU Development and Purpose

NVIDIA and AMD were originally producing chips for gaming purposes. (41s)
Early GPUs were designed to solve the problem of rendering images from a mathematical description of a scene. (2m37s)

GPU Architecture and Processing Power

Modern GPUs are able to render complex scenes in real time, such as the nanite demo from Epic. (3m39s)
Triangle meshes are used to represent surfaces in computer graphics, and GPUs use math to determine where these vertices project onto the screen. (4m20s)
In the early 2000s, programming languages were developed to run for every pixel on screen, allowing for complex material simulations. (6m12s)
GPUs began incorporating multicore and SIMD architectures to handle the increasing demand for processing power in computer graphics, particularly for rendering high-resolution images at high frame rates. (7m30s)

Early Attempts at General-Purpose Computing

Around 2004, researchers began exploring the use of GPUs for general-purpose computing, recognizing their potential for parallel processing beyond graphics rendering. (10m24s)
Early attempts to utilize GPUs for general-purpose computing involved "hacks" such as rendering triangles to trigger function calls that performed non-graphical computations, treating the graphics pipeline as a means to execute parallel code. (9m43s)

Introduction of CUDA

Prior to the introduction of CUDA in 2007, interacting with GPUs for general-purpose computing was limited to graphics-specific tasks, primarily using OpenGL to draw triangles and compute pixel colors. (13m41s)
CUDA introduced a new abstraction, "compute mode," which allows developers to execute custom functions (kernels) on the GPU in an SPMD (Single Program, Multiple Data) fashion, similar to ISPC, by launching multiple copies of the kernel. (15m2s)

CUDA Programming Concepts

When programming in CUDA, developers can conceptualize the execution of code in a similar way to ISPC, where a function is run multiple times with n instances. (18m37s)
In CUDA, thread IDs are multi-dimensional, often represented as a grid, such as 4x3, for convenience in graphics and tensor operations. (19m17s)
CUDA organizes instances into blocks, with each block containing a specific number of threads, and multiple thread blocks can be created simultaneously. (19m34s)

CUDA Memory Model and Communication

The implementation of running CUDA threads will be executed on the GPU. The call of Matrix ad doubleb is the point at which communication will start on the GPU. (24m53s)
The CUDA memory model has a void main that runs on the CPU with its own address space. The GPU has its own memory address space that can be accessed by CUDA threads. (26m9s)
CUDA memcpy moves data from the CPU memory to the GPU memory. This is a slow copy as it moves bytes over the PCIe bus. (28m22s)

CUDA Thread Management

CUDA threads are not created individually, but rather in bulk launches, similar to a hypothetical C++ API that allows the creation of multiple threads at once. (31m16s)
CUDA utilizes a memory model where CPU threads and GPU CUDA threads have distinct address spaces. (32m50s)
CUDA introduces the concept of thread blocks, where each block has its own address space accessible only to threads within that block. (33m0s)

Shared Memory in CUDA

A program can be written to launch thread blocks with a size of 128, where each block computes 128 outputs and requires 130 inputs. (37m1s)
Each thread in a block allocates a shared array of 130 elements, with the shared keyword indicating a per-thread block allocation. (37m18s)
The threads cooperatively load data into the shared array, with a barrier (syncthreads) ensuring all data is loaded before computation. (39m20s)
Shared variables are backed by storage comparable to a high-performance L1 cache, enabling fast access for threads within the same block. (39m56s)

CUDA Thread Block Management

The number of threads launched in a CUDA program is up to the programmer. (43m35s)
CUDA threads are organized into blocks, and the number of threads in a block cannot exceed the capacity of a single core. (43m46s)
CUDA, unlike ISPC, provides synchronization constructs such as per-thread block barriers and atomic operations. (45m34s)

CUDA Compilation and Execution

When a CUDA program is compiled, metadata about its execution requirements, such as threads per block and shared memory allocation, is generated. (48m13s)
A GPU contains multiple cores that execute thread blocks, distributing work similarly to a thread pool. (49m17s)

GPU Core Architecture

A Volta v00 GPU core, or subcore, consists of a fetch and decode unit, 16 arithmetic logic units (ALUs) for SIMD operations, and storage for 128 threads' execution contexts. (50m52s)
On GPUs, SIMD execution occurs implicitly when 32 consecutive threads, called a warp, execute the same instruction, unlike CPUs where the compiler generates SIMD instructions. (53m37s)

Warp Execution and Scheduling

A warp comprises 32 threads and executes instructions in a SIMD (Single Instruction, Multiple Data) manner, meaning all threads in a warp execute the same instruction on different data. (55m38s)
A warp scheduler issues instructions to different units (floating point, integer, etc.) every other clock cycle to maximize efficiency. (56m11s)
Each of the 32 threads in a warp has its own program counter (PC), but SIMD execution only occurs when all threads in a warp have the same PC value. (58m8s)

Thread Blocks and Warps

A thread block is a programming concept, while a warp is a hardware implementation detail; programmers create thread blocks, and NVIDIA GPUs use warps to execute them. (1h0m13s)

Streaming Multiprocessors (SMs)

A single core in the GPU contains 64 warps, which translates to execution context for 2000 CUDA threads (64 warps * 32 threads/warp). (1h0m49s)
A single Streaming Multiprocessor (SM) can be thought of as four separate cores that share the same shared memory storage. (1h1m33s)
Each SM can handle 2000 CUDA threads with 128 kilobytes of shared storage. (1h1m46s)

Block Size Flexibility and Optimization

Nvidia GPUs allow for flexibility in block size, so code written for older architectures with smaller block sizes can still run on newer architectures with larger block sizes. (1h6m19s)
NVIDIA prefers automatic scheduling of SIMD operations, potentially changing block size and reordering for optimization. (1h6m48s)
A CUDA core running a thread block with 128 threads and 512 bytes of storage allocates the threads and memory from the shared memory, distributing them across slices for concurrent execution. (1h7m31s)

Shared Memory Constraints

When a thread block cannot be scheduled due to insufficient resources, despite having enough execution contexts, it is because there is not enough shared memory available. (1h12m6s)
Nvidia GPUs will not schedule thread blocks that exceed the available shared memory resources. (1h12m17s)

Partial Thread Block Execution and Synchronization

Cuda programs cannot execute partial thread blocks because it can lead to deadlocks, especially when synchronization mechanisms like barriers are used. (1h15m34s)
Using atomic operations on global memory across multiple thread blocks is acceptable in Cuda. (1h17m20s)
One thread block can instruct the updating of a variable in memory, while another thread block waits for the update to occur. (1h17m53s)
Thread blocks can interact, but assumptions cannot be made about the order in which operations are performed. (1h18m26s)
Threads within a thread block can be assumed to run concurrently, allowing for the use of barriers for synchronization. (1h18m34s)

Stanford CS149 I Parallel Computing I 2023 I Lecture 7 - GPU architecture and CUDA Programming

Early GPU Development and Purpose

GPU Architecture and Processing Power

Early Attempts at General-Purpose Computing

Introduction of CUDA

CUDA Programming Concepts

CUDA Memory Model and Communication

CUDA Thread Management

Shared Memory in CUDA

CUDA Thread Block Management

CUDA Compilation and Execution

GPU Core Architecture

Warp Execution and Scheduling

Thread Blocks and Warps

Streaming Multiprocessors (SMs)

Block Size Flexibility and Optimization

Shared Memory Constraints

Partial Thread Block Execution and Synchronization

Browse more from
Parallel Computing

Stanford CS149 I Lecture 6 - Performance Optimization II: Locality, Communication, and Contention

Stanford CS149 I 2023 I Lecture 3 - Multi-core Arch Part II + ISPC Programming Abstractions

Stanford CS149 I Parallel Computing I 2023 I Lecture 1 - Why Parallelism? Why Efficiency?

Stanford CS149 I Parallel Computing I 2023 I Lecture 8 - Data-Parallel Thinking

Stanford CS149 I 2023 I Lecture 9 - Distributed Data-Parallel Computing Using Spark

Stanford CS149 I Parallel Computing I 2023 I Lecture 10 - Efficiently Evaluating DNNs on GPUs

Overwhelmed by Endless Content?

Stanford CS149 I Parallel Computing I 2023 I Lecture 7 - GPU architecture and CUDA Programming

Early GPU Development and Purpose

GPU Architecture and Processing Power

Early Attempts at General-Purpose Computing

Introduction of CUDA

CUDA Programming Concepts

CUDA Memory Model and Communication

CUDA Thread Management

Shared Memory in CUDA

CUDA Thread Block Management

CUDA Compilation and Execution

GPU Core Architecture

Warp Execution and Scheduling

Thread Blocks and Warps

Streaming Multiprocessors (SMs)

Block Size Flexibility and Optimization

Shared Memory Constraints

Partial Thread Block Execution and Synchronization

Browse more from Parallel Computing

Stanford CS149 I Lecture 6 - Performance Optimization II: Locality, Communication, and Contention

Stanford CS149 I 2023 I Lecture 3 - Multi-core Arch Part II + ISPC Programming Abstractions

Stanford CS149 I Parallel Computing I 2023 I Lecture 1 - Why Parallelism? Why Efficiency?

Stanford CS149 I Parallel Computing I 2023 I Lecture 8 - Data-Parallel Thinking

Stanford CS149 I 2023 I Lecture 9 - Distributed Data-Parallel Computing Using Spark

Stanford CS149 I Parallel Computing I 2023 I Lecture 10 - Efficiently Evaluating DNNs on GPUs

Overwhelmed by Endless Content?

Browse more from
Parallel Computing