Pitfalls of Unified Memory Models in GPUs

04 Nov 2024 (8 months ago)

Introduction and Motivation

The presenter starts by asking the audience about their knowledge of GPUs and then shifts to a tangent about two cars, an MG MGB GT and a modern Volvo, to illustrate the difference in complexity and repairability between old and new technology (19s).
The presenter notes that despite the advancements in technology, the way humans interact with these vehicles has not changed much, and this idea will be relevant throughout the talk (1m9s).
The presenter shares their personal experience of starting a new job and being tasked with making a program faster, despite having no prior knowledge of GPUs or CUDA (1m45s).
The presenter's initial attempt at writing a CUDA program resulted in a simple function that sometimes worked and sometimes didn't, depending on the hardware and software (2m31s).

Initial GPU Programming Challenges

The presenter is puzzled by the inconsistent behavior of the program and sets out to answer two questions: why the program sometimes works and why the otherwise performant program is slow (3m12s).
The presenter briefly mentions the differences between CPUs and GPUs, noting that CPUs typically have many threads that run, but does not fully elaborate on this point (3m32s).

Understanding GPUs and Concurrency

Writing code for GPUs is declarative, where the programmer specifies what they want to achieve, and the hardware decides how to execute it, often implicitly making the code concurrent (3m54s).
A CPU is like an office where multiple people work on relatively independent tasks, whereas a GPU is like a factory with specialized equipment for specific tasks, making one thing very well (4m23s).
In a factory analogy, storage space contains raw materials, similar to memory in computing, and dedicated space for each person or team can be thought of as shared workspaces (4m49s).
Implicit concurrency in GPUs is achieved by specifying a small program that divides up a range and expects it to run in parallel, with each thread processing a portion of the range (5m41s).

Streams and Data Processing

The concept of a moving assembly line, introduced by Henry Ford, revolutionized manufacturing by having workers stay fixed while the pieces moved, and this idea is preferred in computing when possible (6m37s).
In computing, the moving assembly line concept is preferred when hardware can perform specific tasks with no dependency between each other, and this is encapsulated via the notion of a stream (7m7s).
A stream is an ordered sequence, and this concept is important in understanding how GPUs process data (7m9s).
GPUs can handle tasks in a consistent way through the use of streams, which allow for the execution of logically consistent sets of operations, enabling the hardware to be utilized more efficiently when tasks spend most of their time reading and writing from memory (7m12s).

GPU Performance and Big O Notation

A high computation to data ratio is ideal for GPUs, as seen in the example of matrix multiplication, where the computation time is O(n^3) and the data is O(n^2), resulting in a significant speedup when using GPUs (8m15s).
However, the Big O notation can be misleading, as asymptotically better algorithms may not be faster on GPUs due to the importance of constants in these settings (8m57s).

Memory Management in GPUs

The goal is to write a mem copy function that can handle any type of copy efficiently, regardless of whether the memory is on the CPU or GPU (9m17s).
There are three types of memory to consider on a GPU: memory allocated using Cuda malloc, which is prioritized and remains in the same location; memory allocated using Cuda malloc, which is allocated in system RAM; and memory that comes from other sources (9m51s).
Cuda malloc is a special type of memory that is constantly prioritized and will never migrate, requiring manual decision to move it to a different location (10m9s).
Unified memory models in GPUs allow for direct access to system memory by the graphics card, making it a desirable feature (10m44s).

Unified Memory Model

Managed memory or unified memory is a type of memory allocation where the physical storage may be either on a hardware device (GPU) or in system memory, and its location can change depending on the program's needs (11m4s).
This type of memory allocation is achieved through a series of system calls, including opening a file descriptor for a special device provided by Nvidia, allocating a large slab of memory via mmap, and deallocating memory (12m5s).
The Nvidia-provided device serves as an interface between system memory and device memory, allowing for the allocation and deallocation of memory (12m11s).
When allocating memory, the program specifies the address where the memory should exist, ensuring that the pointers are the same across both the system and the device (13m10s).
Deallocating memory involves remapping the memory and freeing what's left over, which can be seen in the Nvidia System Profiler (13m38s).
The Nvidia System Profiler is a useful tool for understanding what happens when using unified memory models, showing the program's memory allocation and deallocation calls (14m3s).

System Calls and Page Faults

Using tools like estrace can help understand the underlying system calls and memory allocation processes involved in unified memory models (11m49s).
When a program runs, it immediately encounters a page fault due to the way the driver enforces memory allocation, resulting in the allocation of physical storage and handling of the page fault, which can take a significant amount of time (14m29s).
The time it takes to handle the page fault can be substantial, and in some cases, it can take up a great deal of the program's running time, especially when allocating more memory (15m12s).
The hardware tries to help by allocating more memory each time a page fault occurs, resulting in an irregular pattern of page faults with increasing spacing between them (15m32s).
This pattern repeats, but after a certain point, the hardware seems to forget that the program is running into page faults and stops helping, resulting in a continued pattern of page faults (15m59s).

Data Movement and Coherency

The physical location of data in unified memory models is invisible to a program and may be changed at any time, even if the program is accessing the data through a shared pointer (16m27s).
This is a rare occurrence in programming, where memory can move in a way that is entirely opaque and matters significantly, unlike caches where data is just copied and stored (16m57s).
The access to data through a virtual address will remain valid and coherent from any processor, regardless of locality, which is necessary due to the complexity of the system (17m23s).
The requirement for accesses to remain valid has significant implications for the performance of programs using unified memory models (17m45s).

Memory Copy Performance Issues

Existing functions, such as cudaMemcpy, can be used to copy data in CUDA, which may be a solution to the issues encountered with unified memory models (17m57s).
A memory copy operation is performed on the CPU, which should be fast since the pointers are accessible via the CPU, but it falls into a pitfall, resulting in poor performance (18m5s).
The performance issue is illustrated in a graph, showing a significant decrease in throughput as the size of the data being copied increases, with a drop from over 600 megabytes per second to around 0.085 megabytes per second (18m25s).
The reason for this poor performance is due to CPU page faults, which occur when the CPU tries to access a page of memory that is not currently available, resulting in a request to the GPU to allocate space (19m44s).
The page faults are caused by the interaction between the device (graphics card) and the CPU, which is limited by the page size of the CPU, resulting in a large number of page faults (around 37,000) (20m32s).

Page Faults and Performance Impact

Each page fault results in a CIS call, which is expensive and leads to poor performance, with the page faults being 4 kilobyte in size, which may be improved by increasing the page size (21m15s).
The page faults occur regularly and repeatedly, resulting in a significant impact on performance, as shown in a zoomed-in view of the profile (22m0s).
The GPU's hardware attempts to help with memory management by transferring more memory than requested, resulting in a pattern of page faults and speculative prefetches that repeat and are unequally spaced (22m17s).
Each red line represents a page fault for 4K, while purple lines represent a speculative prefetch, with the width of each block growing as the number of page faults increases (22m53s).
The hardware handles this process automatically, without the need for user intervention, and the amount of memory transferred can range from 64 kilobytes to 2 GB (23m2s).

CUDA Managed Memory and Prefetching

Using CUDA's managed memory with specified copying from the device to the CPU can result in slightly faster performance, but still exhibits a stepwise curve due to prefetches (23m41s).
Copying from the device to the host results in a faster line, reaching up to 10 GB per second, due to the GPU's capacity for dealing with larger pages up to 2 megabytes (24m15s).

Using CUDA Functions for Memory Management

It is recommended to use CUDA functions instead of standard functions when writing code that needs to deal with memory management generically, as they can provide faster performance (24m40s).
Using CUDA functions can result in a speedup of roughly twice as fast as standard functions, even in the slowest case, with a simple one-line change (24m50s).

Memory Migration and Performance

The cost of memory management comes from handling page faults and physical memory mapping, leading to the question of whether moving memory to the same location before copying would be more efficient (25m21s).
CUDA allows for memory migration, which can potentially improve performance by eliminating the need for copying between GPU and CPU (25m41s).
Unified memory models in GPUs allow for memory to be moved between devices, and the size of the memory to be specified, enabling the splitting of arrays across multiple devices, including the CPU and GPU, which can lead to a different type of parallelism (26m13s).
This parallelism allows for updates to be made on an array on the GPU and CPU simultaneously, thanks to the guarantees provided by the unified memory model (26m45s).

Prefetching and Performance Bottlenecks

Prefetching the source pointer to the device can result in a significant improvement in performance, with a flat line of around 16 GB per second, but this performance drops drastically when the destination array can no longer fit in memory (27m1s).
The performance drop is due to the occurrence of page faults, which happen when the system tries to write to or read something that is not already in memory (27m38s).
Prefetching both pointers ahead of time can result in a huge amount of performance, but this also leads to a significant drop in performance when neither pointer fits in memory anymore (28m3s).
The system's ability to manage memory proactively is compromised when both pointers are prefetched, leading to poor performance (28m34s).
The system's decision on what to evict from memory can be flawed, leading to poor performance, as it may evict important data (29m19s).

Managing Memory Accesses

A significant issue with unified memory models in GPUs is the constant reading of data from memory that is not necessary, resulting in continued page faults for reads and writes, indicating the need for more careful management of memory copies (29m38s).
A solution to this issue is to manage memory accesses more efficiently, as shown in a profile from a running program where memory accesses are optimized to reduce bandwidth consumption and page faults (30m3s).
If device memory is full, it will never get evicted, and using the Cuda malloc API can cause the graph to shift arbitrarily to the left by allocating more card memory (30m45s).
Moving large amounts of data, such as 512 megabytes, repeatedly can cause slow bandwidth due to constant eviction of necessary data (30m56s).

Optimizing Memory Copies

To manage memory copies more carefully, it is necessary to define a prefetch size, create multiple streams to cue operations properly, and calculate the number of prefetches needed (31m23s).
A loop can be used to start the copy process, prefetching the previous data and sending it back to the CPU, and then prefetching the blocks needed from system memory back onto the card (31m49s).
The CPU needs to be involved in remapping data back into its own memory, and this can be done by explicitly sending the data back to the CPU and using a different stream for prefetching (32m30s).
Using a different stream for prefetching may not significantly improve performance, but it is still a better approach (33m1s).
By combining various prefetches, sending data back to its original location, and managing memory, performance can be significantly improved, with the custom code potentially beating standard CUDA functions for certain tasks (33m56s).

Complexity of Unified Memory

The use of a single pointer across devices and system memory was initially thought to simplify things, but it has proven to be more complicated than expected (34m43s).
The complexity of managing memory and pointers is due in part to the fact that computers are still being programmed with concepts and tools designed 50 years ago, such as those used for the PDP1 (35m27s).
The idea of changing programming languages to express hardware more explicitly was discussed in the 1990s, but it was met with resistance, and the tools used today are still largely the same (35m46s).
It is impossible for a compiler to statically determine whether a function or code will work due to the inability to succinctly express the properties of a pointer, making it difficult to ensure safe and reliable operation (36m7s).
The hardware is capable of handling complex memory operations, but the lack of explicit expression of pointer properties makes it difficult to write safe and reliable code (36m30s).
The resistance to change in programming tools and concepts has not been beneficial, and it is time to reconsider the way computers are programmed to better match the capabilities of modern hardware (36m43s).

Profiling and Code Simplicity

To understand performance issues in code, it's essential to profile the code using different tools and try to understand why things are happening the way they are, as there is always a reason for it, no matter how complex it may seem (36m50s).
When writing code, it's crucial to consider simplifying things to make it easier for others to understand, and to prioritize performance when in doubt (37m39s).

Unified Memory: Benefits and Drawbacks

Unified memory models in GPUs, sold by companies like Nvidia and AMD, can make it easier to port old CPU applications to GPUs, but they may not be worth using in performance-critical environments (38m22s).
Unified memory models can be useful for migration or compatibility reasons, or when more memory is needed than is available on the GPU, but they may not be better than manually handling memory copies (38m48s).
In some cases, the overhead of unified memory models may not be related to page faults, but rather to coherency issues, where the device hardware needs to ensure a consistent view of memory across the system (39m40s).
Even if the same page is requested multiple times on the same device, the hardware may still incur overhead to ensure coherency, and counting page faults may not be enough to overcome these issues (40m11s).

Summarize anything forget nothing

Rated 4.9 on Product Hunt

Get Started