220 likes | 360 Views
This overview explores the Map/Reduce framework within database management systems for large-scale data processing. It discusses key concepts, including the architecture of distributed file systems (DFS) and the role of commodity hardware in cluster computing. The implementation of Map/Reduce, pioneered by Google, allows for efficient parallel processing of massive datasets across thousands of servers. Key phases such as Map and Reduce are outlined, alongside challenges like straggler tasks and tuning parameters for optimal performance. Furthermore, we touch on declarative languages like PIG Latin and HiveQL that enhance querying capabilities.
E N D
CS 540 Database Management Systems Map/Reduce
Cluster Computing • Large number of commodity servers, connected by high speed, commodity network • Rack holds a small number of servers • Data center holds many racks • Massive parallelism: • 100s, or 1000s, or 10000s servers – Many hours • Failure becomes a fact of life: • If medium-time-between-failure is 1 year • Then 10000 servers have one failure / hour
Distributed File System (DFS) • Large files in order of TBs or PBs • Each file is partitioned into chunks, e.g. 64MB • Each chunk is replicated multiple times over different racks for fault tolerance • DFS implementations • Google’s DFS (GFS) • Hadoop’s DFS (HFS).
Map/Reduce • Google researchers introduced Map/Reduce framework in a paper published in 2004. • A high level programming model and implementation for large scale parallel data processing. • Apache Hadoop is an open source variant of Map/Reduce.
Map/Reduce Programs • Read and process a lot of data • MAP: • Extract some relevant information each tuple. • Shuffle and Sort the output tuples. • Reduce: • Aggregate the information over a bag of tuples • Summarize, filter, transform • Write the results
Data Model • File as a bag of (key, value) • Like key/value stores • A map/reduce program • Input: a bag of (input_key, value) • Output: a bag of (output_key, value) • Input and output may have different keys.
Map Step • User provides the MAP function • Input: (input key, value) • Output: bag of (intermediate key, value) • System applies the map function in parallel to all (input key, value) pairs in the input file.
Reduce Step • User provides the REDUCE function • Input: (intermediate key, bag of values) • Output: bag of output values • System groups all pairs with the same intermediate key, and passes the bag of values to the REDUCE function
Example • Counting the number of occurrences of each word in a large collection of documents map(String key, String value{ //key: document id // value: document contents for each word w in value Output-interim(w, ‘1’); } reduce(String key,Iteratorvalues){ //key: a word // values: a bag of counts for each v in values result += parseInt(v); Output(String.valueOf(result));
Schedule MAP REDUCE
Map Reduce Phases HDFS HDFS Local Storage
Map/Reduce Job versus Task • A Map/Reduce Job • One single “query”, e.g. count the words in all docs. • More complex programs may consists of multiple jobs. • A Map (or a Reduce) Task • A group of instantiations of the map ( or reduce) function, which are scheduled on a single worker.
Implementation • Master node: • partitions input file into M splits, by key. • assigns workers (=servers) to the M map tasks. • keeps track of their progress. • Workers write their output to local disk, partition into R regions. • Master assigns workers to the R reduce tasks. • Reduce workers read regions from the map workers’ local disks.
Implementation • Master pings workers periodically • If down then reassigns the task to another worker. • Straggler: a server that takes unusually long time to complete one of the last tasks. • The cluster scheduler has assigned other tasks on the server. • Bad disk forces frequent correctable errors • Stragglers are a main reason for slowdown • Map/Reduce solution: pre-emptive backup execution of the last few remaining in-progress tasks
Tuning • It is very difficult. • Choice of #M and #R: • Larger is better for load balancing • Limitation: master needs O(M×R) memory • Typical choice: • M: number of chunks • R: much smaller; rule of thumb: R=1.5 * number of servers • Over 100 other parameters: partition function, sort factor,…. around 50 of them affect running time. • Active research area
Map/Reduce Discussion • Advantage: hides scheduling and parallelization details. • Disadvantage: very limited queries • Difficult to write more complex tasks, thousands of line of code, debugging is not easy. • Usually need multiple map-reduce operations. • Solution: declarative query languages • PIG Latin (Yahoo!) • HiveQL( Facebook) • …
Declarative Languages over Map/Reduce • PIG Latin (Yahoo!) • New language, similar to Relational Algebra • Open source • HiveQL(Facebook) • SQL-like language • Open source • SQL: Big-Query (Google) • SQL on Map/Reduce • Proprietary
SQL Operations • Map: Group By • Reduce: Aggregate • How to compute join between R(A,B) and S(B,C) • Map: group R by R.B, group S by S.B • Input: a tuple R(a,b) or a tuple S(b,c) • Output: (b,R(a,b)) or (b,S(b,c)) • Reduce: • Input: (b,{R(a1,b),R(a2,b),...,S(b,c1),S(b,c2),...}) • Output: {R(a1,b),R(a2,b),...} × {S(b,c1),S(b,c2),...} • It relies on MR framework for partitioning • We can do better (covered in the next course).
More Operations • This simple algorithms relies on MR framework for partitioning • It does not know about join. • We can do better. • Other algorithms • Multi-way joins • Computational algorithms • Matrix multiplication • Covered in the next course: big data analytics.
Parallel DBMS versus Map/Reduce • Parallel DBMS • Relational data model and schema • Declarative query language: SQL • Many pre-defined operators • Can easily combine operators into complex queries • Query optimization, indexing, and physical tuning • Pipelines data from one operator to the next • Does more than just running queries • Updates and transactions • Constraints • security. • …
Parallel DBMS versus Map/Reduce • Map/Reduce • Data model is a file of (key, value) pairs. • No need to transform and load data. • Easy to write user-defined operators. • Can easily add more nodes to the cluster. • Intra-query fault-tolerance because it stored the of results on disk. • Handles problems such as stragglers • More scalable, but needs more nodes
Lessons • Usability, usability, usability! • The main cause of popularity of Map/Reduce. • It is easy to use by developers. • There is still a lot of space for improvement. • Sometimes, we have to re-build a framework • They did not extend parallel databases