Stanford CS149 I Parallel Computing I 2023 I Lecture 10 - Efficiently Evaluating DNNs on GPUs

21 Sep 2024 (6 months ago)

Rendering Overlapping Circles

Assignment 3 involves rendering images of potentially overlapping, semi-transparent circles, with the order of rendering impacting the final image. (1m18s)
A naive parallelization approach, where each circle is rendered in parallel, will produce an incorrect result due to the order dependency caused by transparency. (4m24s)

Parallel Algorithm Design

The challenge lies in designing a parallel algorithm that maintains the correct rendering order, potentially by identifying which circles overlap for each pixel. (5m23s)

Deep Neural Networks: Structure and Operations

Deep neural networks can be understood as circuits or functions composed of interconnected neurons. (9m12s)
Each neuron performs a dot product between its input vector and a set of weights, adds a bias, and applies a non-linear function (like ReLU, which is simplified as "Max" in this context). (9m34s)
These neurons are organized in layers, where the outputs of one layer become the inputs for the next. (10m51s)
Layers can be fully connected (every output from layer i connects to every input of layer i+1) or convolutional (using sliding windows of inputs). (11m2s)
Deep Neural Networks (DNNs) can be understood as matrix and vector operations, simplifying to dense matrix algebra. (12m23s)

Convolution in Deep Neural Networks

Convolution, a key DNN operation, processes input data using weighted combinations of neighboring elements, exemplified by image blurring through averaging surrounding pixel values. (13m19s)
Convolution with learned weights, as in ImageNet, enables feature detection by emphasizing or suppressing specific input elements, illustrated by horizontal or vertical gradient detection. (15m20s)

Deep Neural Network Architectures

Deep Neural Networks (DNNs) like ResNet, Unet, and Inception, are composed of numerous convolutional layers. These layers perform convolutions on input images to generate output images, forming the primary computational workload of DNNs. (18m37s)
ResNet and Inception architectures are designed to be more efficient than earlier convolutional neural networks (CNNs) by reducing the required memory and floating-point operations. (19m22s)
MobileNet, designed for mobile phones, exemplifies the ongoing efforts to create efficient architectures. It features a specific arrangement of filters and layers with varying sizes and output dimensions, showcasing the intricate design choices involved in optimizing DNNs for resource-constrained devices. (20m57s)

Deep Neural Network Efficiency and Optimization

Deep neural networks (DNNs) have become more accurate over time, but also more computationally expensive. (23m17s)
While DNN accuracy has plateaued in recent years, the number of weights and size of filters has decreased, leading to a reduction in memory and computation requirements. (24m10s)

Matrix Multiplication in Deep Neural Networks

Convolutions, a key component of DNNs, can be expressed as matrix multiplications, which can be efficiently implemented in libraries like NumPy. (28m25s)
To implement convolution as a matrix vector product, input pixels are copied into a matrix that is the width multiplied by the height in terms of rows and nine elements across. (29m14s)
To perform multiple convolutions with multiple filters, the weights of each convolution are stacked as columns in the matrix, resulting in a matrix-matrix product. (30m46s)
If the input tensor has multiple channels, the matrices involved in the computation become much larger, with dimensions determined by the number of channels and filter sizes. (32m1s)

Matrix Multiplication Optimization Techniques

Matrix multiplication can be expressed hierarchically in terms of submatrix multiplications on blocks. (38m22s)
Arithmetic intensity can be improved by performing matrix multiplication on blocks, with larger block sizes leading to higher arithmetic intensity, up to the limit of cache size. (40m0s)
Implementing matrix multiplication on large matrices without blocking will result in bandwidth limitations, while blocking can significantly improve performance. (41m21s)

Memory Management and Optimization

CPUs use a cache, which is managed by the hardware and stores lines from the address space, making access non-contiguous. (41m42s)
GPUs utilize CUDA shared memory, functioning as a scratchpad, where threads can directly load and store contiguous blocks of data from the address space. (43m2s)
SIMD instructions can be used to optimize matrix multiplication by performing operations on multiple data elements simultaneously, but require careful consideration of data layout and instruction dependencies. (45m24s)
Different block sizes for matrices may be more efficient for different strategies. Different layers in a neural network may benefit from different matrix multiplication implementations. (47m39s)

Implicit Matrix Multiplication

One problem with matrix multiplication in deep neural networks is the need to duplicate data many times to create the matrices, which can lead to memory issues, especially during backpropagation. (48m45s)
Implicit matrix multiplication is a technique that avoids decompressing data into large matrices by calculating the location of the required data in the original input tensor on demand. This approach involves more calculations but can save memory. (50m23s)
To achieve optimal performance on GPUs, large matrices are necessary, which is why small batch sizes in machine learning can lead to reduced performance. (53m33s)

Deep Learning Libraries and Optimization

While manual optimization of convolution operations is possible, deep learning libraries like cuDNN and oneAPI offer pre-optimized implementations for various layer types, including the computationally intensive conv2D layer. (57m50s)
NVIDIA's cuDNN library provides a range of algorithms and parameters for convolution operations, allowing for fine-tuning and optimization based on specific input tensors and desired performance trade-offs. (58m33s)

Implicit GEMM and Operation Fusion

Implicit GEM is the default algorithm for general matrix multiplication (GEMM) in convolutional neural networks (CNNs). It treats convolutions as large matrix multiplications without explicitly creating the matrices. (59m9s)
CNNs often involve multiple layers performed sequentially, leading to frequent data movement between memory and processing units. This data movement can create a bottleneck, especially for operations like scaling, bias addition, and max pooling, which are bandwidth-bound. (1h0m52s)
Fusing operations like scaling, bias addition, and max pooling with matrix multiplication can significantly reduce data movement and improve performance. This fusion can be achieved by performing these operations inline within the matrix multiplication process. (1h1m58s)

Attention Operation Optimization

The attention operation in a neural network involves tensors Q, K, and V, representing queries, keys, and values, respectively. These tensors have dimensions n by D, n by D, and M by D, respectively. (1h5m10s)
The attention calculation involves an outer product of Q and K, resulting in an M by n matrix. A softmax operation is then applied to each row of this matrix, which involves scaling elements based on the maximum value in the row. This matrix is large and poses computational challenges. (1h6m15s)
A technique to improve efficiency involves factoring the softmax calculation and processing the matrix block by block, reducing the memory footprint and enabling more efficient computation. (1h9m45s)
Softmax can be computed in chunks by keeping a running sum of the maximum value, which allows for the fusion of matrix multiplication, softmax computation, and the final matrix product. This reduces memory requirements from n SAR to block size squared, enabling larger data on chips and processing of longer sequences. (1h10m44s)

Automated Optimization Frameworks

Optimizations like fusing batch normalization or resizing and padding into matrix multiplication were initially manual but are now being automated by frameworks like Jax, which analyze tensor loop nests to generate optimized code. (1h12m41s)

GPUs and Deep Neural Network Computation

GPUs are suitable for deep neural network computations due to their high parallelism, arithmetic intensity, and single instruction, multiple data (SIMD) capabilities, making them efficient for matrix multiplication operations. (1h15m27s)
GPUs are general purpose processors, but their architecture can be suboptimal for Deep Neural Network (DNN) evaluation because they are designed to amortize non-math work over large math operations. (1h16m4s)
Architects include SIMD (Single Instruction Multiple Data) instructions in processors to amortize non-math work, such as instruction stream control and data access, over the same operation. (1h16m53s)
Nvidia's Tensor Cores are specialized processing units designed for efficient matrix multiplication, offering significantly higher computational throughput for DNN tasks compared to general-purpose CUDA cores. (1h18m50s)