Stanford CS149 I Parallel Computing I 2023 I Lecture 15 - Domain Specific Programming Languages

26 Sep 2024 (2 months ago)
Stanford CS149 I Parallel Computing I 2023 I Lecture 15 - Domain Specific Programming Languages

Higher-Level Abstractions in Programming

  • The goal of the current lecture is to introduce higher-level abstractions in programming, moving beyond the low-level systems like C++, CUDA, and ISPC that have been the focus of the course so far. (1m59s)
  • The lecture will use two case studies to illustrate these concepts: Halide, a programming language used for image processing in Google Android phones and Instagram, and Liszt, a research language developed at Stanford. (2m17s)

Domain-Specific Languages (DSLs)

  • Domain-specific languages (DSLs) prioritize productivity and performance, even if it means sacrificing flexibility in what can be expressed. (6m18s)
  • Examples of DSLs include PyTorch for tensor operations, SQL for database queries, MATLAB for numerical computing, and OpenGL for graphics programming. (6m52s)

Halide: A DSL for Image Processing

  • Halide, a DSL used in image processing, is an example of a language designed for performance, used in applications like the Google Android camera app and Instagram's image filters. (11m6s)
  • Halide is a functional programming language that allows developers to define image processing operations as expressions, similar to mathematical formulas. (37m54s)
  • A Halide program can be represented as a directed acyclic graph (DAG), where each function is a node and dependencies between functions are represented as edges. (39m56s)
  • The Halide compiler can translate a Halide program into C code, where each function becomes a loop that iterates over the image pixels. (43m26s)
  • Halide is a domain-specific language that allows users to explore optimization choices and generate different code versions by modifying scheduling directives. (45m9s)
  • Halide's scheduling directives provide primitives for specifying iteration patterns, such as row major, column major, blocked, vectorized, and thread parallel, enabling users to express various permutations for code optimization. (47m22s)
  • In Halide, out.tile(x, y, xi, yi, 256, 32) instructs the compiler to compute output in tiles of 256 by 32, naming the tile loops x and y and the inner loops within each tile xi and yi. (48m19s)
  • Halide is a language that allows for the specification of algorithms and transformations on loop nests. (50m0s)
  • Halide enables the programmer to specify optimization strategies, such as vectorization and parallelization, at a high level. (50m32s)
  • Halide's functional nature and lack of pointer chasing provide the compiler with knowledge of dependencies, allowing for loop nest manipulation and optimization without compromising program correctness. (54m32s)
  • Halide is a programming language that helps programmers who already have knowledge of performance optimization to explore different optimization strategies efficiently. (56m10s)
  • At Google, a small team of around three programmers specialized in writing Halide schedules, highlighting the complexity of the task. (56m36s)
  • Automated search techniques, inspired by game theory and AI, have been successfully applied to automate the process of generating efficient Halide schedules, achieving performance comparable to human experts. (57m45s)

Liszt: A DSL for FPGA Circuits

  • Researchers at Stanford created a language similar to Halide that generates FPGA circuits directly, bypassing the need for a programmable processor. (1h1m27s)

DSLs and Intelligent Optimizations

  • Domain-specific languages (DSLs) like TensorFlow, PyTorch, and SQL, while limited in scope, allow for intelligent optimizations because the compiler understands the specific data structures and operations used. (1h2m30s)
  • A project at Stanford developed a DSL for simulating jet engines using finite element analysis, where the compiler could optimize operations on meshes without knowing the specific mesh representation. (1h3m54s)

LTH: A DSL for Physics Simulations

  • LTH is a domain-specific language used by physicists to describe operations on meshes, where data like temperature is stored on vertices. (1h7m49s)
  • The LTH compiler can automatically identify parallelism, data locality, and synchronization requirements in the code, unlike general-purpose languages like C++ or Java. (1h8m22s)
  • When compiling for a cluster of computers, the LTH compiler distributes the mesh across the nodes, manages data exchange between nodes, and handles ghost cells for data consistency. (1h10m31s)
  • When compiling for a GPU, the compiler can assign a CUDA thread per edge in the mesh and use atomics or alternative strategies to handle multiple threads writing to the same vertex. (1h11m50s)
  • A graph is created where edges that share a vertex in a mesh are connected by an edge. This graph is then used to parallelize computations by coloring it such that no two adjacent nodes have the same color. Each color represents a parallel for loop that can be executed without locks or atomics. (1h12m35s)

The Power of DSLs

  • Domain Specific Languages (DSLs) are effective when the structure of their primitives aligns with the user's natural thinking process. For instance, physicists think in terms of operations on meshes, image processing individuals focus on operations on pixels, and machine learning practitioners think about operations on tensors. (1h15m21s)
  • Successful DSLs often have a small number of primitives that can be composed, enabling users to leverage them in ways not initially anticipated by the creators. This composability contributes to the longevity and wider adoption of these systems. (1h17m42s)

The Unexpected Power of LLMs

  • Large language models (LLMs) were initially designed for a specific purpose. (1h18m26s)
  • Users are discovering novel and unanticipated applications for LLMs, surpassing the expectations of their creators at companies like OpenAI. (1h18m32s)
  • The ability of users to demonstrate new uses for a system highlights its power and potential. (1h18m36s)

Image Blurring Optimization

  • An example of image blurring using a C code implementation of a 2D convolution filter is presented. (12m6s)
  • It is highlighted that this 2D convolution filter can be separated into two 1D convolutions, which can significantly reduce the number of operations, especially for larger filter sizes. (15m15s)
  • While this optimization reduces computational work, it increases memory footprint due to the need for a temporary buffer. (17m26s)
  • The implementation being discussed processes the image in chunks to reduce memory footprint and improve cache utilization. (22m51s)
  • The algorithm blurs three rows horizontally at a time, stores them in a temporary buffer, and then blurs those rows vertically to produce a single row of the final output. (22m19s)
  • While this approach reduces memory allocation and improves cache hits, it leads to redundant computations as each row of the temporary buffer is used only three times before being discarded and recomputed. (23m54s)
  • There is a method to improve the efficiency of processing an image by using a rolling buffer and sliding it down two rows at a time. (25m39s)
  • Instead of processing one row of output at a time, processing multiple rows (e.g., 10) simultaneously and computing the required intermediate rows (e.g., 12) can reduce overhead. (29m0s)
  • Increasing the chunk size, which represents the number of rows processed together, can improve performance but only if the chunk fits in the cache. (30m1s)
  • Computations can be parallelized by dividing the output into chunks and processing each chunk independently. If necessary, these chunks can be further divided horizontally to increase parallelization. (31m24s)
  • An alternative parallelization approach involves using a sliding window technique, where a smaller temporary buffer processes the data in sections, and parallelism is achieved by dividing the output into columns. (32m19s)
  • Halide can automatically handle boundary conditions, such as when applying a blur effect to an image, which simplifies the coding process. (37m21s)
  • Halide can generate code that separates boundary condition calculations from the main image processing operations, potentially improving efficiency. (37m31s)

Remaining Lectures

  • The remaining lectures in the course will cover topics such as transactional memory, specialized hardware, and energy efficiency. These lectures will be more conceptual, with exam questions focused on general understanding rather than detailed problem-solving. (48s)

Overwhelmed by Endless Content?