1 / 19

Joins in mapreduce

Joins in mapreduce. Shamik bose. motivation. MapReduce is the framework of choice for big data analytics Legacy systems do not perform very well for very large amounts of data Joins are computationally expensive, yet unavoidable in applications. Types of joins. Reduce-side Join

dinesh
Download Presentation

Joins in mapreduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Joins in mapreduce Shamik bose Department of Computer Science, Florida State University

  2. motivation • MapReduce is the framework of choice for big data analytics • Legacy systems do not perform very well for very large amounts of data • Joins are computationally expensive, yet unavoidable in applications Department of Computer Science, Florida State University

  3. Types of joins • Reduce-side Join • Map-side Join • Broadcast Join • Fuzzy Join Department of Computer Science, Florida State University

  4. Reduce-side join: map operation • Join takes place on Reduce side • Map is used to pre-process the data • Reads one tuple at a time • The join key (column) is the key to the map function • Rest of tuple is value • The key and value are both tagged with the name of the parent dataset Department of Computer Science, Florida State University

  5. Tag for dataset 0 Dataset 0 Dataset 1 Tag for dataset 1 Department of Computer Science, Florida State University

  6. Reduce side join: partitioning and grouping • Default partitioner is overridden • Partitioning done only on the key, not on the tag • Tuples with same key go to same reducer • Grouping function also overridden • Ensures that two keys with different tags are not treated differently by the reducer Department of Computer Science, Florida State University

  7. Department of Computer Science, Florida State University

  8. Reduce side join: reduce operation • Reducer invokes reduce( )function for each key group • Tuples of dataset that arrives first are buffered • Each reduce function joins the values from the buffer with the data from the stream • To make sure that values from the same dataset are not joined with each other, the tag values are required Department of Computer Science, Florida State University

  9. Reduce side join for multi-way joins • Can be carried out in two ways • One-shot join • Cascade join • One-shot join similar to reduce-side join with few changes • List of tables passed as argument to job • Map and grouping phase similar • Reducer dynamically create buffers to hold all but the last dataset • To reduce chances of memory overflow, buffers are periodically written to disk • Cascade join • Iterative version of reduce-side join • Requires setting up multiple jobs for each two-way join Department of Computer Science, Florida State University

  10. Map side join: overview • Alternative to reduce-side join • Requirements on datasets • Datasets must be sorted with same comparator • Partitioned using the same partitioner • Number of partitions in all datasets must be same Department of Computer Science, Florida State University

  11. Map side join: operation • All constraints can be satisfied by simple Hadoop jobs • Datasets passed through IdentityMapper() and IdentityReducer() • Does not pre-process data • Ensures that data conforms to the constraints • Join takes place as follows • Each mapper considers a dataset partition • The corresponding partition from the other dataset is scanned • If matches for the join key are found, they are joined Department of Computer Science, Florida State University

  12. Broadcast join: overview • One of the datasets should be small enough to fit in main memory • eg. List of items by a manufacturer (S) against sales records (R) • S << R • Also called in-memory join • The overhead of transferring data from Mappers to Reducers can be avoided • Small dataset replicated on every machine Department of Computer Science, Florida State University

  13. Broadcast join: operation • Hadoop directives –files or –archive used to send small dataset to each machine when the job is invoked • The map() function is called for each tuple from the larger dataset • Each <key,value> pair is matched against the smaller dataset • If matches are found, then they are joined and returned to the invoking function • Further optimization if local dataset is stored into hash table Department of Computer Science, Florida State University

  14. Map side v/s reduce side v/s broadcast Department of Computer Science, Florida State University

  15. Fuzzy joins • Most joins in relational databases are equi-joins • For unclustered data, similarity joins are also necessary • All elements from a dataset that are within a similarity threshold are returned • Quite a few predicates are available • Hamming Distance • Edit Distance • Jacard Distance Department of Computer Science, Florida State University

  16. Hamming distance • For a hamming distance algorithm, the problem statement is as follows Given a set S of b-bit strings and a threshold d, find the set ) • HD(s1,s2) is the number of points at which the two strings are dissimilar • Eg. ‘cat’ and ‘cot’ have a hamming distance of 1 Department of Computer Science, Florida State University

  17. Applications of fuzzy joins • Used for recommendation engines • Collaborative filtering • Clustering algorithms Department of Computer Science, Florida State University

  18. References • JairamChandar, Join Algorithms using Map-Reduce, University of Edinburgh • Foto N. Afrati, Jeffrey D. Ullman, Optimizing Joins in a Map-Reduce Environment, ACM EBDT 2010 • Foto N. Afrati, Anish Das Sarma, David Menestrina, Aditya Parameswaran, Jeffrey D. Ullman, Fuzzy Joins using MapReduce Department of Computer Science, Florida State University

  19. Questions? Department of Computer Science, Florida State University

More Related