1 / 19

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. Presented by : Mohammed Ali Alawi Shehab. Outlines . Introduction Heterogeneous & Homogeneous databases Map-Reduce Map-Reduce-Merge Optimizations Enhancements Conclusions. Introduction.

marv
Download Presentation

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Map-Reduce-Merge: Simplified Relational Data Processingon Large Clusters Presented by : Mohammed Ali Alawi Shehab

  2. Outlines Introduction Heterogeneous & Homogeneous databases Map-Reduce Map-Reduce-Merge Optimizations Enhancements Conclusions

  3. Introduction • Search engines process and manage a vast amount of data collected from WWW. • To reduce cost of DBMS usually built large clusters of shared-nothing commodity nodes. • Ex : Google File System (GFS). • Hadoop: is open-source implementation. • Google ,Yahoo !, Facebook ,Amazon and others users.

  4. Heterogeneous & Homogeneous databases • Homogeneous: different nodes have same technology at each of the locations. • Heterogeneous: different nodes may have differentand incompatible technology at each of the locations. • Technology examples: • operating system used • data structures used • database application

  5. Map-Reduce Map process: function to process input key/value pairs and get list of results. Reduce process: function to merge all intermediate pairs associated with the same key and then generate outputs. Map-Reduce framework is best at handling homogeneous datasets Multiple heterogeneous datasets does not quite fit into the Map-Reduce framework

  6. Map-Reduce count… • search engine stores: • Crawled URLs in crawler database • Inverted indexes in index database • Click or execution logs in log databases • These databases are huge and distributed over a large cluster of nodes.

  7. www.google.jo

  8. www.google.com.sa

  9. Map-Reduce Diagram

  10. Map-Reduce Features and Principles • Low-Cost: • High-performance • Symmetric multiprocessing (SMP) • Scalable RAIN Cluster • Fault-Tolerant yet Easy to Administer • Replicate data and launch backup tasks • New nodes can be plugged in at any time • High Throughput

  11. Map-Reduce Features and Principles count …. • Shared-Disk Storage yet Shared-Nothing Computing • Map and Reduce tasks share integrated GFS that makes thousands of disks behave like one. • Distributed Partitioning/Sorting Framework • Partition function Distributed outputs from mapper to reducer by the key/value

  12. Map-Reduce-Merge • Do not change map & reduce functions • Merge: join reduced outputs • It is more efficient and easier to process data relationships among heterogeneous datasets

  13. Map-Reduce-Merge Diagram

  14. Map-Reduce &Map-Reduce-Merge Map-Reduce Map-Reduce-Merge

  15. Example Mapping Reducing Merging

  16. Optimizations 2 • Optimal Reduce-Merge Connections: • Remote read between Mapper (M) and Reducer (R) is (R * (MA+MB)) if datasets size on A & B are same. • If datasets size on A & B are Not same then Remote read is R+R • Remote read between Reducers & Mergers is 2R

  17. Enhancements Library: types of Merges phase Configure Workflow.

  18. Conclusions Map-Reduce and GFS represent a rethinking of data processing This “simplified” philosophy drives down hardware and software cost Map-Reduce does not directly supportjoins of heterogeneous datasets, so adding Merge phase.

  19. Thanks for listening

More Related