Stanford CS149 I 2023 I Lecture 9 - Distributed Data-Parallel Computing Using Spark

20 Sep 2024 (2 months ago)
Stanford CS149 I 2023 I Lecture 9 - Distributed Data-Parallel Computing Using Spark

Warehouse-Scale Computers

  • Warehouse-sized computers, pioneered by Luis Barroso, are essentially clusters of commodity PCs connected together with ethernet, offering a scalable and relatively inexpensive computing solution. (4m59s)
  • Warehouse-scale computers were initially built with cheap components, but to improve application development, they now utilize expensive, high-bandwidth networks similar to those in supercomputers. (7m17s)
  • A typical rack in such a system consists of 20-40 servers, each with a multi-core CPU, connected via a top-of-rack switch, with the number of servers limited by power capacity. (8m40s)
  • A notable bottleneck in these systems is the network bandwidth, which can be significantly lower than local disk access speeds, especially between racks. (12m25s)
  • Network bandwidth in warehouse-scale computers has significantly improved, reaching speeds comparable to solid-state drives (around 2 gigabytes per second). This means accessing data from a remote node's disk can be as fast as accessing it locally. (14m18s)

Distributed File Systems

  • Distributed file systems like Google File System (GFS) and Hadoop Distributed File System (HDFS) are used to store data persistently in a distributed system. These systems are designed for large files and primarily handle appending data and reading it, with infrequent in-place updates. (19m4s)
  • A distributed file system is introduced, which divides large files into chunks or blocks, typically ranging from 64 to 256 megabytes in size. (20m18s)
  • These blocks are replicated and distributed across multiple racks to prevent data loss in case of failures. (20m48s)
  • A master node manages the metadata and directory of the replicas, providing a global namespace for the distributed file system. (21m1s)

Data-Parallel Programming

  • Data parallel programming ideas can be used to program distributed computers, which are computers composed of multiple separate operating system instances. (53s)
  • One of the main reasons to use a cluster of machines is to gain IO bandwidth, especially when processing huge amounts of data (hundreds of terabytes). (3m47s)
  • Communication between nodes with different operating systems in a distributed system is achieved through message passing, which involves sending data between threads in different address spaces. (14m57s)
  • Message Passing Interface (MPI) is discussed as a method for distributed data-parallel computing, but it is noted for its complexity and lack of inherent fault tolerance. (26m43s)
  • The map function, used in data-parallel operations, is highlighted for its ease of parallelization due to the lack of dependencies between elements and its side-effect-free nature, which ensures that the input remains unchanged. (28m37s)

MapReduce

  • The map-reduce programming model is explained, with the mapper function processing individual log file lines to identify entries from mobile clients and update a result map, while the reducer function aggregates values associated with unique keys to produce a final sum. (31m9s)
  • MapReduce jobs can be used to count word occurrences in a file, with one map task per block of the input file. (34m46s)
  • The parallelism in reducer tasks comes from associating a certain number of keys to each reducer task. (36m17s)
  • To ensure a correct reduction when parallelizing across keys, all keys of the same value must be sent to the same reducer task. (37m36s)
  • A hash function based on key value can be used to determine where data should be sent for reduction. (42m36s)
  • A job scheduler can exploit data locality by running mapper jobs close to input blocks and reducer jobs near where most of the data is located. (46m22s)
  • If a mapper node fails, the system can recover by running the tasks on a different node with access to the replicated data blocks. (47m20s)
  • MapReduce can handle slow or failing machines by replicating tasks and using the results from the first completed task, while terminating the others. (49m6s)
  • MapReduce is easy to understand and implement, but has limitations in its programming model, only allowing for a linear arrangement of map and reduce functions. (52m8s)
  • MapReduce is inefficient for iterative algorithms, like PageRank, and for handling ad hoc queries on large datasets, as both scenarios require frequent access to the distributed file system. (53m17s)

Spark

  • A 2011 paper argued that data locality in data centers was becoming irrelevant for two reasons: network bandwidth was increasing, and most big data application working sets could fit in memory. (55m32s)
  • A potential problem with using memory instead of storage systems for intermediate data is the loss of computation in case of power failure. (58m41s)
  • Spark addresses the problem of fault-tolerant, in-memory, distributed computing by introducing the concept of a resilient distributed dataset (RDD). (1h1m23s)
  • Resilient Distributed Datasets (RDDs) are immutable, ordered collections of records, created through transformations from other RDDs or persistent storage. (1h1m46s)
  • RDDs can be created from data in a distributed file system and transformed using operations like filter, map, and reduceByKey. (1h2m21s)
  • The persist command can be used to keep an RDD in memory for efficient reuse in subsequent operations. (1h6m3s)
  • Resilient Distributed Datasets (RDDs) can be implemented more efficiently by considering dependencies and applying optimizations like fusion and tiling. (1h13m20s)
  • Narrow dependencies exist when an RDD partition depends on only one other partition, allowing for automatic fusion by the system. (1h16m7s)
  • Wide dependencies occur when an RDD partition depends on multiple other partitions, such as in a group by key operation, making automatic fusion more challenging. (1h16m28s)

Upcoming Lectures

  • The next lecture will cover efficient implementation of DNN and will be presented by Kavon. (1h17m26s)
  • The following Tuesday's lecture will finish the Spark discussion and begin covering cache coherency. (1h17m38s)

Overwhelmed by Endless Content?