Stanford CS149 I 2023 I Lecture 9 - Distributed Data-Parallel Computing Using Spark
20 Sep 2024 (3 months ago)
Warehouse-Scale Computers
- Warehouse-sized computers, pioneered by Luis Barroso, are essentially clusters of commodity PCs connected together with ethernet, offering a scalable and relatively inexpensive computing solution. (4m59s)
- Warehouse-scale computers were initially built with cheap components, but to improve application development, they now utilize expensive, high-bandwidth networks similar to those in supercomputers. (7m17s)
- A typical rack in such a system consists of 20-40 servers, each with a multi-core CPU, connected via a top-of-rack switch, with the number of servers limited by power capacity. (8m40s)
- A notable bottleneck in these systems is the network bandwidth, which can be significantly lower than local disk access speeds, especially between racks. (12m25s)
- Network bandwidth in warehouse-scale computers has significantly improved, reaching speeds comparable to solid-state drives (around 2 gigabytes per second). This means accessing data from a remote node's disk can be as fast as accessing it locally. (14m18s)
Distributed File Systems
- Distributed file systems like Google File System (GFS) and Hadoop Distributed File System (HDFS) are used to store data persistently in a distributed system. These systems are designed for large files and primarily handle appending data and reading it, with infrequent in-place updates. (19m4s)
- A distributed file system is introduced, which divides large files into chunks or blocks, typically ranging from 64 to 256 megabytes in size. (20m18s)
- These blocks are replicated and distributed across multiple racks to prevent data loss in case of failures. (20m48s)
- A master node manages the metadata and directory of the replicas, providing a global namespace for the distributed file system. (21m1s)
Data-Parallel Programming
- Data parallel programming ideas can be used to program distributed computers, which are computers composed of multiple separate operating system instances. (53s)
- One of the main reasons to use a cluster of machines is to gain IO bandwidth, especially when processing huge amounts of data (hundreds of terabytes). (3m47s)
- Communication between nodes with different operating systems in a distributed system is achieved through message passing, which involves sending data between threads in different address spaces. (14m57s)
- Message Passing Interface (MPI) is discussed as a method for distributed data-parallel computing, but it is noted for its complexity and lack of inherent fault tolerance. (26m43s)
- The map function, used in data-parallel operations, is highlighted for its ease of parallelization due to the lack of dependencies between elements and its side-effect-free nature, which ensures that the input remains unchanged. (28m37s)
MapReduce
- The map-reduce programming model is explained, with the mapper function processing individual log file lines to identify entries from mobile clients and update a result map, while the reducer function aggregates values associated with unique keys to produce a final sum. (31m9s)
- MapReduce jobs can be used to count word occurrences in a file, with one map task per block of the input file. (34m46s)
- The parallelism in reducer tasks comes from associating a certain number of keys to each reducer task. (36m17s)
- To ensure a correct reduction when parallelizing across keys, all keys of the same value must be sent to the same reducer task. (37m36s)
- A hash function based on key value can be used to determine where data should be sent for reduction. (42m36s)
- A job scheduler can exploit data locality by running mapper jobs close to input blocks and reducer jobs near where most of the data is located. (46m22s)
- If a mapper node fails, the system can recover by running the tasks on a different node with access to the replicated data blocks. (47m20s)
- MapReduce can handle slow or failing machines by replicating tasks and using the results from the first completed task, while terminating the others. (49m6s)
- MapReduce is easy to understand and implement, but has limitations in its programming model, only allowing for a linear arrangement of map and reduce functions. (52m8s)
- MapReduce is inefficient for iterative algorithms, like PageRank, and for handling ad hoc queries on large datasets, as both scenarios require frequent access to the distributed file system. (53m17s)
Spark
- A 2011 paper argued that data locality in data centers was becoming irrelevant for two reasons: network bandwidth was increasing, and most big data application working sets could fit in memory. (55m32s)
- A potential problem with using memory instead of storage systems for intermediate data is the loss of computation in case of power failure. (58m41s)
- Spark addresses the problem of fault-tolerant, in-memory, distributed computing by introducing the concept of a resilient distributed dataset (RDD). (1h1m23s)
- Resilient Distributed Datasets (RDDs) are immutable, ordered collections of records, created through transformations from other RDDs or persistent storage. (1h1m46s)
- RDDs can be created from data in a distributed file system and transformed using operations like filter, map, and reduceByKey. (1h2m21s)
- The
persist
command can be used to keep an RDD in memory for efficient reuse in subsequent operations. (1h6m3s)
- Resilient Distributed Datasets (RDDs) can be implemented more efficiently by considering dependencies and applying optimizations like fusion and tiling. (1h13m20s)
- Narrow dependencies exist when an RDD partition depends on only one other partition, allowing for automatic fusion by the system. (1h16m7s)
- Wide dependencies occur when an RDD partition depends on multiple other partitions, such as in a group by key operation, making automatic fusion more challenging. (1h16m28s)
Upcoming Lectures
- The next lecture will cover efficient implementation of DNN and will be presented by Kavon. (1h17m26s)
- The following Tuesday's lecture will finish the Spark discussion and begin covering cache coherency. (1h17m38s)