Distributed System VII

这一篇介绍Popular Systems, 包括最近比较火的MapReduce/Hadoop，Spark和Distributed Machine Learning. 不是重点，随便写写的一个。

Data Intensive Computing & MapReduce

讲High-performance computing(HPC), Cluster computing和MapReduce.

HPC

HPC Machine	Programming Model	HPC Operation	Fault Tolerance

Machine: computer nodes, networks, storage servers
Programming model: programs described at low level, rely on small number of software packages
Operation: long-livedprocesses, make use of spatial locality, hold all program data in memory, high bandwidth communication
Fault Tolerance: checkpoint, restore, rerformance scaling

MapReduce

MapReduce, 我不喜欢的一个技术。对大数据通过Map, Shuffle, Reduce进行处理。Map过程输出<key, value>键值对到Reduce过程，Reduce过程对key相同的键值对归并到一台机器处理。Map过程全部结束后Reduce过程才会开始，每个过程的时间由最慢的机器决定，所有的数据以及中间结果全部存在硬盘中。

MapReduce API

Requirements
- Programmer must supply Mapper & Reducer classes
Mapper
- Steps through file one line at a time
- Code generates sequence of <key, value> pairs
- Default types for keys & values are strings
Reducer
- Given key + iterator that generates sequence of values
- Generate one or more <key, value> pairs

Implementation

Built on Top of Parallel File System
- Google: GFS, Hadoop: HDFS
- Provides global naming
Breaks work into tasks
- Master schedules tasks on workers dynamically
- Typically #tasks » #processors
Net Effect
- Input: Set of files in reliable file system
- Output: Set of files in reliable file system

Operations

Mappingg	Shuffling	Reducing

Dnamically map input file blocks onto mappers, each generates key/value pairs from blocks, each writes R files on local file system	Each Reducer handles 1/R possible key values, fetches file from M mappers, sorts all entries to group values by keys	Each Reducer executes reducer function for each key, writes output values to parallel file system

Characters
- Computation broken into many,short-lived tasks
- Use disk storage to hold intermediate results
Strengths
- Great flexibility in placement,scheduling, and load balancing
- Can access large data sets
Weaknesses
- Higher overhead
- Lower raw performance
Fault Tolerance: data integrity/recovering from failure
- Assume reliable file system
- Detect failed worker: heartbeat mechanism
- Reschedule failed task
Stragglers (Tasks that take long time to execute)
- Might be bug, flaky hardware, or poor partitioning
- When done with most tasks, reschedule any remaining executing tasks

Spark and Distributed ML

Spark介绍，和MapReduce对比。主要介绍SparkRDD，运用in memory的计算和数据共享的方式，因此比起MapReduce在速度上有很大的优势。具体的是将计算分解成为一系列计算，如果在某一步发生了错误可以根据之前的计算和结果恢复。 RDD数据是只读的，也就是说不可更改。因为数据全部存在内存，在进行iteration的计算时有很大的优势。

RDD and Lineage

RDD: Resilient Distributed Datasets

Provide interface based on coarse-grained operations
Efficient fault recovery using lineage
- Each operation is applied to many elements
- Low each operation
- Recompute lost partitions on failure

Operations

Transformations: create new RDD from existing one
Actions: return value to caller
Persist RDD to memory

Consistency & Fault Recovery

Immutable datasets: recreate/checkpoints
High overhead: copying data (no mutate-in-place)
Needs lots of memory (might not be able to run your workload)

MapReduce Vs Spark

	MapReduce	Spark RDD
Framework
Typical Applications	Sequence steps, each requires map/reduce; Series of data transformations; Iterating until convergence	Streaming analytics, Iterative ML algorithms, Graph/social data
Strengths	Simple for users; very general; good for large-scale data analysis; highly expressive and scalable	Overcome disk I/O bottleneck, fast in memory operations
Limitations	No locality of data; each map/reduce must complete befor next; Huge I/O to disks and over network	Storage for web applications; incremental web crawlers; data can’t fit in memory

Distributed ML

介绍了分布式机器学习方面的一些困境，总觉得最后结论是ML不适合分布式来着…失忆∠( ᐛ 」∠)＿

Trade-off: stale state & throughput

Distributed variants highly complex and not always faster!
Challenges: communication overhead and stragglers
Key ideas: P2P+selective communication, bounded-delay BSP

参考资料:

MapReduce paper

Word Count

Matricx multiplication

RDD paper

FEATURED TAGS

Java Hashset Linkedhashset BFS Array DFS Hashmap Design ICC DP Stack Python Database BinarySearch ArrayList SegmentTree BIT TopologicalSort Bit BST TreeSet TreeMap Trie String LinkedList KMP D&C Greedy Math Backtracking Tree UnionFind TwoPointers C/C++ GC Web Sort Graph DS DL PP ML Sampling Minimax Heap

Distributed System VII

15640 DS7 - Popular Systems