1 / 51

MapReduce

MapReduce. Computer Engineering Department Distributed Systems Course Asst. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 201 3. What does Scalable Mean?. Operationally In the past: works even if data does not fit in main memory Now: can make use of 1000s of cheap computers

brina
Download Presentation

MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce Computer Engineering Department Distributed Systems Course Asst. Prof. Dr. AhmetSayar • Kocaeli University - Fall 2013

  2. What does Scalable Mean? • Operationally • In the past: works even if data does not fit in main memory • Now: can make use of 1000s of cheap computers • Algorithmically • In the past: if you have N data items, you must do no more than Nm operations – polynomial time algorithms • Now: if you have N data items, you should do no more than Nm / k operations, for some large k • Polynomial time algorithms must be parallelized • Soon: if you have N data items, you should do no more than N logN operations

  3. Example: Find matching DNA Sequences • Given a set of sequences • Find all sequences equal to “GATTACGATTATTA”

  4. Sequential (Linear) search • Time = 0 • Match? NO

  5. Sequential (Linear) search • 40 Records, 40 comparison • N Records, N comparison • The algorithmic complexity is order N: O(N)

  6. What if Sorted Sequences? • GATATTTTAAGC < GATTACGATTATTA • No Match – keep searching in other half… • O(log N)

  7. New Task: Read Trimming • Given a set of DNA sequences • Trim the final n bps of each sequence • Generate a new dataset • Can we use an index? • No we have to touch every record no matter what. • O(N) • Can we do any better?

  8. Parallelization O(?)

  9. New Task: Convert 405K TIFF Images to PNG

  10. Another Example: Computing Word Frequency of Every Word in a Single document

  11. There is a pattern here … • A function that maps a read to a trimmed read • A function that maps a TIFF image to a PNG image • A function that maps a document to its most common word • A function that maps a document to a histogram of word frequencies.

  12. Compute Word Frequency Across all Documents

  13. (word, count)

  14. MAP REDUCE • How to split things into pieces • How to write map and reduce

  15. Map Reduce • Programming model • Google: paper published 2004 • Free variant: Hadoop – java – Apache • Map-reduce: high-level programming model and implementation for large-scale data processing.

  16. Example: Upper-case Mapper in ML

  17. Example: Explode Mapper

  18. Example: Filter Mapper

  19. Example: Chaining Keyspaces • Output key is int

  20. Data Model • Files • A File = a bag of (key, value) pairs • A map-reduce program: • Input: a bag of (inputkey, value) pairs • Output: a bag of (outputkey, value) pairs

  21. Step 1: Map Phase • User provides the Map function: • Input: (input key, value) • Output: bag of (intermediate key, value) • System applies the map function in parallel to all (input key, value) pairs in the input file

  22. Step 2: Reduce Phase • User provides Reduce function • Input: (intermediate key, bag of values) • Output: bag of output (values) • The system will group all pairs with the same intermediate key, and passes the bag of values to the reduce function

  23. Reduce • After the map phase is over, all the intermediate values for a given output key are combined together into a list • Reduce() combines those intermediate values into one or more final values for that same output key • (in practice, usually only one final value per key)

  24. Example: Sum Reducer

  25. In summary • Input and output : each a set of key/value pairs • Programmer specifies two function • Map(in_key, in_value) -> list(out_key, intermediate_value) • Process input key/value pair • Produces set of intermediate pairs • Reduce (out_key, list(intermediate_value)) -> list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one) • Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell

  26. Example: What does this do? • Word count application of map reduce

  27. Example: Word Length Histogram

  28. Example: Word Length Histogram • Big = Yellow = 10+letters • Medium = Red = 5..9 letters • Small = Blue = 2..4 letters • Tiny = Pink = 1 letter

  29. More Examples: Building an Inverted Index • Input • Tweet1, (“I love pancakes for breakfast”) • Tweet2, (“I dislike pancakes”) • Tweet3, (“what should I eat for breakfast”) • Tweet4, (“I love to eat”) • Desired output • “pancakes”, (tweet1, tweet2) • “breakfast”, (tweet1, tweet3) • “eat”, (tweet3, tweet4) • “love”, (tweet1, tweet4)

  30. More Examples: Relational Joins

  31. Relational Join MapReduce: Before Map Phase

  32. Relational Join MapReduce: Map Phase

  33. Relational Join MapReduce: Reduce Phase

  34. Relational Join in MapReduce, Another Example MAP: REDUCE:

  35. Simple Social Network Analysis: Count Friends SHUFFLE MAP

  36. Taxonomy of Parallel Architectures

  37. Cluster Computing • Large number of commodity servers, connected by high speed, commodity network • Rack holds a small number of servers • Data center: holds many racks • Massive parallelism • 100s, or 1000s servers • Many hours • Failure • If medium time between failure is 1 year • Then, 1000 servers have one failure / hour

  38. Distributed File System (DFS) • For very large files: TBs, PBs • Each file is partitioned into chunks, typically 64MB • Each chunk is replicated several times (>2), on different racks, for fault tolerance • Implementations: • Google’s DFS: GFS, proprietary • Hadoop’s DFS: HDFS, open source

  39. HDFS: Motivation • Based on Google’s GFS • Redundant storage of massive amounts of data on cheap and unreliable computers • Why not use an existing file system? • Different workload and design priorities • Handles much bigger dataset sizes than other file systems

  40. Assumptions • High component failure rates • Inexpensive commodity components fail all the time • Modest number of HUGE files • Just a few million • Each is 100MB or larger; multi-GB files typical • Files are write-once, mostly appended to • Perhaps concurrently • Large streaming reads • High sustained throughput favored over low latency

  41. Hdfs Design Decisions • Files are stored as blocks • Much larger size than most filesystems (default is 64MB) • Reliability through replication • Each block replicated across 3+ DataNodes • Single master (NameNode) coordinates access, metadata • Simple centralized management • No data caching • Little benefit due to large data sets, streaming reads • Familiar interface, but customize API • Simplify the problem; focus on distributed apps

  42. Based on GFS Architecture

  43. Referanslar • https://class.coursera.org/datasci-001/lecture • https://www.youtube.com/watch?v=xWgdny19yQ4

More Related