Stanford CS149 I 2023 I Lecture 9 - Distributed Data-Parallel Computing Using Spark
Warehouse-Scale Computers
- Warehouse-sized computers, pioneered by Luis Barroso, are essentially clusters of commodity PCs connected together with ethernet, offering a scalable and relatively inexpensive computing solution.
- Warehouse-scale computers were initially built with cheap components, but to improve application development, they now utilize expensive, high-bandwidth networks similar to those in supercomputers.
- A typical rack in such a system consists of 20-40 servers, each with a multi-core CPU, connected via a top-of-rack switch, with the number of servers limited by power capacity.
- A notable bottleneck in these systems is the network bandwidth, which can be significantly lower than local disk access speeds, especially between racks.
- Network bandwidth in warehouse-scale computers has significantly improved, reaching speeds comparable to solid-state drives (around 2 gigabytes per second). This means accessing data from a remote node's disk can be as fast as accessing it locally.
Distributed File Systems
- Distributed file systems like Google File System (GFS) and Hadoop Distributed File System (HDFS) are used to store data persistently in a distributed system. These systems are designed for large files and primarily handle appending data and reading it, with infrequent in-place updates.
- A distributed file system is introduced, which divides large files into chunks or blocks, typically ranging from 64 to 256 megabytes in size.
- These blocks are replicated and distributed across multiple racks to prevent data loss in case of failures.
- A master node manages the metadata and directory of the replicas, providing a global namespace for the distributed file system.
Data-Parallel Programming
- Data parallel programming ideas can be used to program distributed computers, which are computers composed of multiple separate operating system instances.
- One of the main reasons to use a cluster of machines is to gain IO bandwidth, especially when processing huge amounts of data (hundreds of terabytes).
- Communication between nodes with different operating systems in a distributed system is achieved through message passing, which involves sending data between threads in different address spaces.
- Message Passing Interface (MPI) is discussed as a method for distributed data-parallel computing, but it is noted for its complexity and lack of inherent fault tolerance.
- The map function, used in data-parallel operations, is highlighted for its ease of parallelization due to the lack of dependencies between elements and its side-effect-free nature, which ensures that the input remains unchanged.
MapReduce
- The map-reduce programming model is explained, with the mapper function processing individual log file lines to identify entries from mobile clients and update a result map, while the reducer function aggregates values associated with unique keys to produce a final sum.
- MapReduce jobs can be used to count word occurrences in a file, with one map task per block of the input file.
- The parallelism in reducer tasks comes from associating a certain number of keys to each reducer task.
- To ensure a correct reduction when parallelizing across keys, all keys of the same value must be sent to the same reducer task.
- A hash function based on key value can be used to determine where data should be sent for reduction.
- A job scheduler can exploit data locality by running mapper jobs close to input blocks and reducer jobs near where most of the data is located.
- If a mapper node fails, the system can recover by running the tasks on a different node with access to the replicated data blocks.
- MapReduce can handle slow or failing machines by replicating tasks and using the results from the first completed task, while terminating the others.
- MapReduce is easy to understand and implement, but has limitations in its programming model, only allowing for a linear arrangement of map and reduce functions.
- MapReduce is inefficient for iterative algorithms, like PageRank, and for handling ad hoc queries on large datasets, as both scenarios require frequent access to the distributed file system.
Spark
- A 2011 paper argued that data locality in data centers was becoming irrelevant for two reasons: network bandwidth was increasing, and most big data application working sets could fit in memory.
- A potential problem with using memory instead of storage systems for intermediate data is the loss of computation in case of power failure.
- Spark addresses the problem of fault-tolerant, in-memory, distributed computing by introducing the concept of a resilient distributed dataset (RDD).
- Resilient Distributed Datasets (RDDs) are immutable, ordered collections of records, created through transformations from other RDDs or persistent storage.
- RDDs can be created from data in a distributed file system and transformed using operations like filter, map, and reduceByKey.
- The
persist
command can be used to keep an RDD in memory for efficient reuse in subsequent operations. - Resilient Distributed Datasets (RDDs) can be implemented more efficiently by considering dependencies and applying optimizations like fusion and tiling.
- Narrow dependencies exist when an RDD partition depends on only one other partition, allowing for automatic fusion by the system.
- Wide dependencies occur when an RDD partition depends on multiple other partitions, such as in a group by key operation, making automatic fusion more challenging.