Stanford CS149 I 2023 I Lecture 9 - Distributed Data-Parallel Computing Using Spark

20 Sep 2024 (9 months ago)

Warehouse-Scale Computers

Warehouse-sized computers, pioneered by Luis Barroso, are essentially clusters of commodity PCs connected together with ethernet, offering a scalable and relatively inexpensive computing solution. (4m59s)
Warehouse-scale computers were initially built with cheap components, but to improve application development, they now utilize expensive, high-bandwidth networks similar to those in supercomputers. (7m17s)
A typical rack in such a system consists of 20-40 servers, each with a multi-core CPU, connected via a top-of-rack switch, with the number of servers limited by power capacity. (8m40s)
A notable bottleneck in these systems is the network bandwidth, which can be significantly lower than local disk access speeds, especially between racks. (12m25s)
Network bandwidth in warehouse-scale computers has significantly improved, reaching speeds comparable to solid-state drives (around 2 gigabytes per second). This means accessing data from a remote node's disk can be as fast as accessing it locally. (14m18s)

Distributed File Systems

Distributed file systems like Google File System (GFS) and Hadoop Distributed File System (HDFS) are used to store data persistently in a distributed system. These systems are designed for large files and primarily handle appending data and reading it, with infrequent in-place updates. (19m4s)
A distributed file system is introduced, which divides large files into chunks or blocks, typically ranging from 64 to 256 megabytes in size. (20m18s)
These blocks are replicated and distributed across multiple racks to prevent data loss in case of failures. (20m48s)
A master node manages the metadata and directory of the replicas, providing a global namespace for the distributed file system. (21m1s)

Data-Parallel Programming

Data parallel programming ideas can be used to program distributed computers, which are computers composed of multiple separate operating system instances. (53s)
One of the main reasons to use a cluster of machines is to gain IO bandwidth, especially when processing huge amounts of data (hundreds of terabytes). (3m47s)
Communication between nodes with different operating systems in a distributed system is achieved through message passing, which involves sending data between threads in different address spaces. (14m57s)
Message Passing Interface (MPI) is discussed as a method for distributed data-parallel computing, but it is noted for its complexity and lack of inherent fault tolerance. (26m43s)
The map function, used in data-parallel operations, is highlighted for its ease of parallelization due to the lack of dependencies between elements and its side-effect-free nature, which ensures that the input remains unchanged. (28m37s)

MapReduce

The map-reduce programming model is explained, with the mapper function processing individual log file lines to identify entries from mobile clients and update a result map, while the reducer function aggregates values associated with unique keys to produce a final sum. (31m9s)
MapReduce jobs can be used to count word occurrences in a file, with one map task per block of the input file. (34m46s)
The parallelism in reducer tasks comes from associating a certain number of keys to each reducer task. (36m17s)
To ensure a correct reduction when parallelizing across keys, all keys of the same value must be sent to the same reducer task. (37m36s)
A hash function based on key value can be used to determine where data should be sent for reduction. (42m36s)
A job scheduler can exploit data locality by running mapper jobs close to input blocks and reducer jobs near where most of the data is located. (46m22s)
If a mapper node fails, the system can recover by running the tasks on a different node with access to the replicated data blocks. (47m20s)
MapReduce can handle slow or failing machines by replicating tasks and using the results from the first completed task, while terminating the others. (49m6s)
MapReduce is easy to understand and implement, but has limitations in its programming model, only allowing for a linear arrangement of map and reduce functions. (52m8s)
MapReduce is inefficient for iterative algorithms, like PageRank, and for handling ad hoc queries on large datasets, as both scenarios require frequent access to the distributed file system. (53m17s)

Spark

A 2011 paper argued that data locality in data centers was becoming irrelevant for two reasons: network bandwidth was increasing, and most big data application working sets could fit in memory. (55m32s)
A potential problem with using memory instead of storage systems for intermediate data is the loss of computation in case of power failure. (58m41s)
Spark addresses the problem of fault-tolerant, in-memory, distributed computing by introducing the concept of a resilient distributed dataset (RDD). (1h1m23s)
Resilient Distributed Datasets (RDDs) are immutable, ordered collections of records, created through transformations from other RDDs or persistent storage. (1h1m46s)
RDDs can be created from data in a distributed file system and transformed using operations like filter, map, and reduceByKey. (1h2m21s)
The persist command can be used to keep an RDD in memory for efficient reuse in subsequent operations. (1h6m3s)
Resilient Distributed Datasets (RDDs) can be implemented more efficiently by considering dependencies and applying optimizations like fusion and tiling. (1h13m20s)
Narrow dependencies exist when an RDD partition depends on only one other partition, allowing for automatic fusion by the system. (1h16m7s)
Wide dependencies occur when an RDD partition depends on multiple other partitions, such as in a group by key operation, making automatic fusion more challenging. (1h16m28s)

Upcoming Lectures

The next lecture will cover efficient implementation of DNN and will be presented by Kavon. (1h17m26s)
The following Tuesday's lecture will finish the Spark discussion and begin covering cache coherency. (1h17m38s)

Browse more from
Parallel Computing

Summarize anything forget nothing

Rated 4.9 on Product Hunt

Get Started

Stanford CS149 I 2023 I Lecture 9 - Distributed Data-Parallel Computing Using Spark

Warehouse-Scale Computers

Distributed File Systems

Data-Parallel Programming

MapReduce

Spark

Upcoming Lectures

Browse more from
Parallel Computing

Stanford CS149 I Parallel Computing I 2023 I Lecture 10 - Efficiently Evaluating DNNs on GPUs

Stanford CS149 I Parallel Computing I 2023 I Lecture 8 - Data-Parallel Thinking

Stanford CS149 I Parallel Computing I 2023 I Lecture 7 - GPU architecture and CUDA Programming

Stanford CS149 I Parallel Computing I 2023 I Lecture 11 - Cache Coherence

Stanford CS149 I Lecture 6 - Performance Optimization II: Locality, Communication, and Contention

Stanford CS149 I Parallel Computing I 2023 I Lecture 12 - Memory Consistency

Summarize anything forget nothing

Rated 4.9 on Product Hunt

Stanford CS149 I 2023 I Lecture 9 - Distributed Data-Parallel Computing Using Spark

Warehouse-Scale Computers

Distributed File Systems

Data-Parallel Programming

MapReduce

Spark

Upcoming Lectures

Browse more from Parallel Computing

Stanford CS149 I Parallel Computing I 2023 I Lecture 10 - Efficiently Evaluating DNNs on GPUs

Stanford CS149 I Parallel Computing I 2023 I Lecture 8 - Data-Parallel Thinking

Stanford CS149 I Parallel Computing I 2023 I Lecture 7 - GPU architecture and CUDA Programming

Stanford CS149 I Parallel Computing I 2023 I Lecture 11 - Cache Coherence

Stanford CS149 I Lecture 6 - Performance Optimization II: Locality, Communication, and Contention

Stanford CS149 I Parallel Computing I 2023 I Lecture 12 - Memory Consistency

Summarize anything forget nothing

Rated 4.9 on Product Hunt

Browse more from
Parallel Computing