1 / 22

Incremental Recomputations in MapReduce

Incremental Recomputations in MapReduce. Thomas Jörg University of Kaiserslautern. Motivation. MapReduce Program. Base data. Result data. Bigtable / HBase. Motivation. View Definition. . Base data. Materialized view. Motivation. incrementalMapReduce Program.

rue
Download Presentation

Incremental Recomputations in MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incremental Recomputationsin MapReduce Thomas JörgUniversity of Kaiserslautern

  2. Motivation MapReduce Program Base data Resultdata Bigtable / HBase

  3. Motivation View Definition  Base data Materializedview

  4. Motivation incrementalMapReduce Program MapReduce Program  Base data Resultdata Bigtable / HBase

  5. Agenda • Related Work • Case study • Incremental view maintenance • Summary Delta Algorithm • Conclusion and future work

  6. Related Work • Caching intermediate results • DryadInc • Incoop • Incremental programming models • Google Percolator • Continuous bulk processing (CBP) L. Popa, et al.: DryadInc: Reusing work in large-scale computations. HotCloud 2009 P. Bhatotia, et al.: Incoop: MapReduce for Incremental Computations. SoCC 2011 D. Peng and F. Dabek: Large-scale Incremental Processing Using Distributed Transactions and Notifications. OSDI 2010 D. Logothetis et al.: Stateful Bulk Processing for Incremental Analytics. SoCC 2010

  7. Challenges • Programming model • SQL / relational algebra vs. MapReduce • Efficient access paths • No secondary indexes in Hbase • Support for transactions • Only single-row transactions in Hbase

  8. Case Study • Word histograms • Reverse web-link graphs • Term-vectors per host • Count of URL access frequency • Inverted Indexes J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

  9. <html> ... </html> Computing Reverse Web-Link Graphs <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> Thomas Jörg, Technische Universität Kaiserslautern 9 <html> ... </html> <html> ... </html> <html> ... </html>

  10. Sample Web-Link Graph a.htm b.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html>

  11. Computing Reverse Web-Link Graphs Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> b.htm, a.htm b.htm, {a.htm, b.htm} • b.htm, a.htm b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html> a.htm, b.htm a.htm, {b.htm} • b.htm, b.htm

  12. Summary Delta Algorithm CREATE VIEW Parts AS SELECT partID, SUM(qty*price) AS revenue, COUNT(*) AS tplcnt FROM Orders GROUP BY partID SELECT partID, SUM(revenue) AS revenue, SUM(tplcnt) AS tplcnt FROM ( (SELECT partID, SUM(qty*price) AS revenue, COUNT(*) as tplcnt FROM Orders_Insertions GROUP BY partID) UNION ALL (SELECT partID, -SUM(qty*price) AS revenue, -COUNT(*) as tplcnt FROM Orders_Deletions GROUP BY partID) ) GROUP BY partID I. S. Mumick et al.: Maintenance of Data Cubes and Summary Tables in a Warehouse. SIGMOD Conference 1997 W. Labio et al.: Performance Issues in Incremental Warehouse Maintenance. VLDB 2000

  13. Computing Reverse Web-Link Graphs Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> b.htm, a.htm b.htm, {a.htm, b.htm} • b.htm, a.htm b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html> a.htm, b.htm a.htm, {b.htm} • b.htm, b.htm

  14. Achieving Self-Maintainability Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> b.htm, [a.htm, 1] b.htm, {[a.htm, 2], [b.htm, 1]} • b.htm, [a.htm, 1] b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html> a.htm, [b.htm, 1] a.htm, {[b.htm, 1]} • b.htm, [b.htm, 1]

  15. Sample Web-Link Graph a.htm b.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> <html> <a href="b.htm"> ...</a> <a href="a.htm"> ...</a> </html> <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html>

  16. Summary Delta Algorithm in MapReduce a.htm (deleted) Map Shuffle Reduce <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> b.htm, [a.htm, -1] • b.htm, [a.htm, -1] b.htm, {[a.htm, -1]} a.htm, {[a.htm, +1]} a.htm (inserted) <html> <a href="b.htm"> ...</a> <a href="a.htm"> ...</a> </html> b.htm, [a.htm, +1] • a.htm, [a.htm, +1]

  17. Delta Installation Approaches MapReduce Base deltas Materialized view Increment Installation    Materialized view MapReduce Base deltas Materialized view Overwrite Installation

  18. Case Study – Lessons Learned • Numerical aggregation • Word histogram • URL access frequency • Set aggregation • Reverse web-link graph • Inverted index • Multiset aggregation • Term-vector per host

  19. General Solution • Self-maintainable aggregates • Computed in three steps • Translation • Grouping • Aggregation • commutative and associative binary function • inverse elements • Abelian group

  20. Case Study – Lessons Learned • Numerical aggregation • Word histogram • URL access frequency • Set aggregation • Reverse web-link graph • Inverted index • Multiset aggregation • Term-vector per host Translation function: Translate web pages into (word, 1) Aggregation function: Abelian group (Natural numbers, +) Translation function: Translate web pages into (link target, link source) Aggregation function: Abelian group (Power-multiset of URLs, multiset union)

  21. Evaluation y-axis: Elapsed time [min] x-axis: Updates in basedocuments [%]

  22. Conclusion & Future Work • View Maintenance in MapReduce • Case study • Summary delta algorithm • Self-maintainable aggregations • Future Work • Broader class of MapReduce programs • High-level MapReduce languages, e.g. Jaql or PigLatin

More Related