1 / 69

Big Data Analysis Technology

Big Data Analysis Technology. University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June 12, 2013 Tobias Hardes (6687549) – Tobias.Hardes@gmail.com. Table of content. Introduction Definitions Background Example

raziya
Download Presentation

Big Data Analysis Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June 12, 2013 Tobias Hardes (6687549) – Tobias.Hardes@gmail.com

  2. Table of content • Introduction • Definitions • Background • Example • Related Work • Research • Main Approaches • Association Rule Mining • MapReduce Framework • Conclusion

  3. 4 Big keywords

  4. Big Data vs. Business Intelligence • How can we predict cancer early enough to treat it successfully? • How Can I make significant profit on the stock market next month? Docs.oralcle.com • Which is the most profitable branch of our supermarket? • In a specific country? • During a specific period of time

  5. Background home.web.cern.ch

  6. Big Science – The LHC • 600 million times per second, particles collide within the Large Hadron Collider (LHC) • Each collision generate new particles • Particles decay in complex way • Each collision is detected • The CERN Data Center reconstruct this collision event • 15 petabytes of data stored every year • Worldwide LHC Computing Grid (WLCG) is used to crunch all of the data home.web.cern.ch

  7. Data Stream Analysis • Just in time analysis of data. • Sensor networks • Analysis for a certain time (last 30 seconds) http://venturebeat.com

  8. Complexeventprocessing (CEP) • Provides queries for streams • Usage of „Event Processing Languages“ (EPL) • selectavg(price)fromStockTickEvent.win:time(30 sec) Tumbling Window (Slide = WindowSize) Sliding Window (Slide < WindowSize) Window Slide https://forge.fi-ware.eu

  9. Complex Event Processing - Areas of application • Just in time analysis  Complexity of algorithms • CEP is used with Twitter: • Identify emotional states of users • Sarcasm?

  10. Related Work

  11. Big Data in companies

  12. Principles • Statistics • Probability theory • Machine learning • Data Mining • Association rule learning • Cluster analysis • Classificiation

  13. AssociationRule Mining – Cluster analysis Association Rule Mining Is soda purchased with bananas? • Relationships between items • Find associations, correlations or causal structures • Apriori algorithm • Frequent Pattern (FP)-Growth algorithm

  14. Cluster analysis – Classification Cluster Analysis • Classification of similar objects into classes • Classes are defined during the clustering • k-Means • K-Means++

  15. Research andfuturework • Performance, performance, performance… • Passes of the data source • Parallelization • NP-hard problems • …. • Accuracy • Optimized solutions

  16. Example • Apriori algorithm: n+1 database scans • FP-Growth algorithm: 2 database scans

  17. Distributed computing – Motivation • Complex computational tasks • Serveralterabytes of data • Limited hardware resources • Google‘sMapReduce framework Prof. Dr. Erich Ehses (FH Köln)

  18. Main approaches http://ultraoilforpets.com

  19. Structure • Association rule mining • Apriori algorithm • FP-Growth algorithm • Googles MapReduce

  20. Association rule mining • Identify items that are related to other items • Example: Analysis of baskets in an online shop or in a supermarket http://img.deusm.com/

  21. Terminology • A stream or a database with n elements: S • Item set: • Frequency of occurrence of an item set: Φ(A) • Association rule B : • Support: • Confidence:

  22. Example • Rule: „If a basket contains cheese and chocolate, then it also contains bread“ • 6 of 60 transactions contains cheese and chocolate • 3 of the 6 transactions contains bread

  23. Common approach • Disjoin the problem into two tasks: • Generation of frequent item sets • Find item sets that satisfy a minimum support value • Generation of rules • Find Confidence rules using the item sets

  24. Aprioalgorithm – Frequent item set • Input: • Minimum support: min_sup • Datasource: S

  25. Apriori – Frequent item sets (I) • Generation of frequent item sets : min_sup = 2 2 2 2 A B C 1 3 D 1 1 1 4 4 2 4 2 3 3 2 https://www.mev.de/ {}

  26. Apriori – Frequent item sets (II) • Generation of frequent item sets : min_sup = 2 2 ACD BCD 1 Candidates L3 Candidates L2 AB 1 AC 2 AD 2 BC 3 BD CD 2 2 A 2 B 4 C 4 D 3 L1 https://www.mev.de/ {}

  27. Apriori Algorithm – Rulegeneration • Uses frequent item sets to extract high-confidence rules • Based on the same principle as the item set generation • Done for all frequent item set Lk

  28. Example: Rulegeneration

  29. Summary Apriori algorithm • n+1 scansofthedatabase • Expensive generationofthecandidate item set • Implements level-wise search using frequent item property. • Easy toimplement • Someopportunities for specialized optimizations

  30. FP-Growth algorithm • Used for databases • Features: • Requires 2 scans of the database • Uses a special data structure – The FP-Tree • Build the FP-Tree • Extract frequent item sets • Compression of the database • Devide this database and apply data mining

  31. Construct FP-Tree

  32. Extractfrequentitemsets (I) • Bottom-up strategy • Start with node „e“ • Then look for „de“ • Each path is processedrecursively • Solutions are merged

  33. Extractfrequentitemsets (II) • Is e frequent? • Is de frequent? • … • Is ce frequent? • …. • Is be frequent? • …. • Is ae frequent? • ….. • Using subproblems to identify frequent itemsets Φ(e) = 3 – Assume the minimum support was set to 2

  34. Extractfrequentitemsets (III) Update the support count along the prefix path Remove Node e Check the frequency of the paths Find item setswith de, ce, aeorbe

  35. Apriori vs. FP-Growth • FP-Growth hassomeadvantages • Twoscansofthedatabase • No expensive computationofcandidates • Compressed datastructure • Easiertoparallelize W. Zhang, H. Liao, and N. Zhao, “Research on the fp growth algorithm about association rule mining

  36. MapReduce • MapandReducefunctionsareexpressedby a developer • map(key, val) • Emitsnewkey-values p • reduce(key, values) • Emits an arbitraryoutput • Usually a keywithonevalue

  37. MapReduce – Word count

  38. User Programm (1)fork (7) return (1)fork (1)fork Master (2) assign (2) assign (4) localwrite (5) RPC (3) read worker Worker forbluekeys worker (6) write worker worker worker Worker forredkeys worker worker worker Worker foryellowkeys Map phase Intermediate files Reduce phase Input files Shuffle Output files

  39. Conclusion: MapReduce (I) • MapReduceis design as a batchprocessingframework • Nousagefor ad-hoc analysis • Usedforvery large datasets • Usedfortime intensive computations • OpenSourceimplementation: Apache Hadoop http://hadoop.apache.org/

  40. Conclusion

  41. Conclusion (I) • Big Data is important for research and in daily business • Different approaches • Data Stream analysis • Complex event processing • Rule Mining • Apriori algorithm • FP-Growth algorithm

  42. Conclusion (II) • Clustering • K-Means • K-Means++ • Distributed computing • MapReduce • Performance / Runtime • Multiple minutes • Hours • Days… • Online analytical processing for Big Data?

  43. Thank you for your attention

  44. Appendix

  45. Big Data definitions Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. (IBM Corporate ) Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. (Gartner Inc.) Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. (McKinsey & Company)

  46. Big Data definitions Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. (IBM Corporate ) Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. (Gartner Inc.) Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. (McKinsey & Company)

  47. Complex Event Processing – Windows Tumbling Window Sliding Window Slides in time Buffers the last x elements • Moves as much as the window size Tumbling Window (Slide = WindowSize) Sliding Window (Slide < WindowSize) Window Slide

  48. MapReduce vs. BigQuery

  49. Apriori Algorithm (Pseudocode) • for ( • for each do • foreachdo • end for • end for • ifthen • end if • end for • return

  50. Apriori Algorithm (Pseudocode) • for ( • for each do • foreachdo • end for • end for • ifthen • end if • end for • return

More Related