1 / 25

In-situ MapReduce for Log Processing

In-situ MapReduce for Log Processing. 공과 대학 컴퓨터학과 데이터베이스 연구실 김 윤호. Index. Introduction Design overview Lossy MapReduce processing Prototype Evaluation Conclusion. 1. Introduction. Log Click log System and network log Application log E-commerce and credit card company

manton
Download Presentation

In-situ MapReduce for Log Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. In-situ MapReduce for Log Processing 공과 대학 컴퓨터학과 데이터베이스 연구실 김 윤호

  2. Index • Introduction • Design overview • Lossy MapReduce processing • Prototype • Evaluation • Conclusion

  3. 1. Introduction • Log • Click log • System and network log • Application log • E-commerce and credit card company • Infrastructure provider

  4. 1. Introduction • Store-first-query-later CentralizedCompute Cluster Log Log Log Log Log Log ……

  5. 1. Introduction • Two drawbacks • Scale and Timeliness • Sacrifice availability orReturn in complete results

  6. 1. Introduction • Strict consistency

  7. 1. Introduction • Systematic method CentralizedCompute Cluster Log Log Log Log Log Log ……

  8. 1. Introduction • in-situ” MapReduce (iMR) architecture • Move analysis to the servers • MapReduce for continuous data • Ability to trade fidelity for latency CentralizedCompute Cluster …… Log Log Log Log Log Log

  9. 1. Introduction • Differ from a dedicated Hadoop cluster Distributed file system share Node …… Node

  10. 1. Introduction • Continuous MapReduce model • Lossy MapReduce processing • Architectural lessons • Best-effort distributed stream processor, Mortar • Sub-windows or panes • Impact of failures on result fidelity and latency • Load cancellation and shedding policies

  11. 2. Design overview • iMR - complement, not replace • Scalable • Responsive • Available • Efficient • Compatible

  12. 2. Design overview • Identical MapReduce job in iMR • Map • Reduce • iMR jobs emit a stream of results computed over continuous input.

  13. 2. Design overview • Aggregation trees for efficiency • Distribute processing load • Reduce network traffic

  14. 2. Design overview time • Sliding windows • Range of data Log entries …… Map / Combine Reduce

  15. 2. Design overview time • Problem • Overlapping data • Wastes CPU, Network Overlapping data …… Map / Combine

  16. 2. Design overview • Eliminate redundant work • Panes (sub-windows) • Root combines panesto produce window • Saves CPU & network resources P1 P2 P3 P4 …… time Map / Combine P1 P2 P3 P4 Reduce

  17. 3. LossyMapReduce processing • Data loss may occur • Nodeof network failures • Consequence of result latency requirements • Data loss is unavoidable to ensure timeliness • How to represent and calculate result qualityto allow users to interpret partial results? • How to use this metricto trade result fidelity for improved result latency?

  18. 3. LossyMapReduce processing • Completeness metric C2 • Distribution of log data across • Space (Log server nodes) • Time (the window range) • Root maintains C2 like a scoreboard. space time

  19. 3. Lossy MapReduce processing • Area (A) with earliest results • The most freedom to decrease latency • Appropriate for uniformly distributed events • Area (A) with random sampling • Less freedom to decrease latency • Appropriate even for non-uniform data • Spatial completeness (X, 100%) • Useful when events are local to a node • Temporal completeness (100%, Y) • Useful for correlating events across servers

  20. 3. LossyMapReduce processing • Result eviction: trading fidelity for availability • Latency eviction • Returnincomplete results to meet the deadline • Fidelity eviction • Evict when the results meet the quality requirement

  21. 3. LossyMapReduce processing • Loadcancellation and shedding • Load cancellation • Internal nodes don’t waste cycles creating or merging panes that will never be used. • Load shedding • Prevent wasted effort when individual nodes are heavily loaded.

  22. 4. Prototype • Builds upon Mortar • a distributed stream processing system • Extendedto support • MapReduce API • Paned-based processing • Fault tolerance mechanisms

  23. 5. Evaluation • HDFS log analyze • User must decide whether that is an acceptable tradeoff

  24. 5. Evaluation • In-situ performance • Hadoop can improve job throughput. • And iMR can deliver useful results.

  25. 6. Conclusion • Log analysis steps • Dedicated clusters => Data source • Continuous in-situ processing • C2 framework • Trading fidelity for availability

More Related