Big Data Analysis Technology

Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June 12, 2013 Tobias Hardes (6687549) – Tobias.Hardes@gmail.com

Table of content • Introduction • Definitions • Background • Example • Related Work • Research • Main Approaches • Association Rule Mining • MapReduce Framework • Conclusion

4 Big keywords

Big Data vs. Business Intelligence • How can we predict cancer early enough to treat it successfully? • How Can I make significant profit on the stock market next month? Docs.oralcle.com • Which is the most profitable branch of our supermarket? • In a specific country? • During a specific period of time

Background home.web.cern.ch

Big Science – The LHC • 600 million times per second, particles collide within the Large Hadron Collider (LHC) • Each collision generate new particles • Particles decay in complex way • Each collision is detected • The CERN Data Center reconstruct this collision event • 15 petabytes of data stored every year • Worldwide LHC Computing Grid (WLCG) is used to crunch all of the data home.web.cern.ch

Data Stream Analysis • Just in time analysis of data. • Sensor networks • Analysis for a certain time (last 30 seconds) http://venturebeat.com

Complexeventprocessing (CEP) • Provides queries for streams • Usage of „Event Processing Languages“ (EPL) • selectavg(price)fromStockTickEvent.win:time(30 sec) Tumbling Window (Slide = WindowSize) Sliding Window (Slide < WindowSize) Window Slide https://forge.fi-ware.eu

Complex Event Processing - Areas of application • Just in time analysis  Complexity of algorithms • CEP is used with Twitter: • Identify emotional states of users • Sarcasm?

Related Work

Big Data in companies

Principles • Statistics • Probability theory • Machine learning • Data Mining • Association rule learning • Cluster analysis • Classificiation

AssociationRule Mining – Cluster analysis Association Rule Mining Is soda purchased with bananas? • Relationships between items • Find associations, correlations or causal structures • Apriori algorithm • Frequent Pattern (FP)-Growth algorithm

Cluster analysis – Classification Cluster Analysis • Classification of similar objects into classes • Classes are defined during the clustering • k-Means • K-Means++

Research andfuturework • Performance, performance, performance… • Passes of the data source • Parallelization • NP-hard problems • …. • Accuracy • Optimized solutions

Example • Apriori algorithm: n+1 database scans • FP-Growth algorithm: 2 database scans

Distributed computing – Motivation • Complex computational tasks • Serveralterabytes of data • Limited hardware resources • Google‘sMapReduce framework Prof. Dr. Erich Ehses (FH Köln)

Main approaches http://ultraoilforpets.com

Structure • Association rule mining • Apriori algorithm • FP-Growth algorithm • Googles MapReduce

Association rule mining • Identify items that are related to other items • Example: Analysis of baskets in an online shop or in a supermarket http://img.deusm.com/

Terminology • A stream or a database with n elements: S • Item set: • Frequency of occurrence of an item set: Φ(A) • Association rule B : • Support: • Confidence:

Example • Rule: „If a basket contains cheese and chocolate, then it also contains bread“ • 6 of 60 transactions contains cheese and chocolate • 3 of the 6 transactions contains bread

Common approach • Disjoin the problem into two tasks: • Generation of frequent item sets • Find item sets that satisfy a minimum support value • Generation of rules • Find Confidence rules using the item sets

Aprioalgorithm – Frequent item set • Input: • Minimum support: min_sup • Datasource: S

Apriori – Frequent item sets (I) • Generation of frequent item sets : min_sup = 2 2 2 2 A B C 1 3 D 1 1 1 4 4 2 4 2 3 3 2 https://www.mev.de/ {}

Apriori – Frequent item sets (II) • Generation of frequent item sets : min_sup = 2 2 ACD BCD 1 Candidates L3 Candidates L2 AB 1 AC 2 AD 2 BC 3 BD CD 2 2 A 2 B 4 C 4 D 3 L1 https://www.mev.de/ {}

Apriori Algorithm – Rulegeneration • Uses frequent item sets to extract high-confidence rules • Based on the same principle as the item set generation • Done for all frequent item set Lk

Example: Rulegeneration

Summary Apriori algorithm • n+1 scansofthedatabase • Expensive generationofthecandidate item set • Implements level-wise search using frequent item property. • Easy toimplement • Someopportunities for specialized optimizations

FP-Growth algorithm • Used for databases • Features: • Requires 2 scans of the database • Uses a special data structure – The FP-Tree • Build the FP-Tree • Extract frequent item sets • Compression of the database • Devide this database and apply data mining

Construct FP-Tree

Extractfrequentitemsets (I) • Bottom-up strategy • Start with node „e“ • Then look for „de“ • Each path is processedrecursively • Solutions are merged

Extractfrequentitemsets (II) • Is e frequent? • Is de frequent? • … • Is ce frequent? • …. • Is be frequent? • …. • Is ae frequent? • ….. • Using subproblems to identify frequent itemsets Φ(e) = 3 – Assume the minimum support was set to 2

Extractfrequentitemsets (III) Update the support count along the prefix path Remove Node e Check the frequency of the paths Find item setswith de, ce, aeorbe

Apriori vs. FP-Growth • FP-Growth hassomeadvantages • Twoscansofthedatabase • No expensive computationofcandidates • Compressed datastructure • Easiertoparallelize W. Zhang, H. Liao, and N. Zhao, “Research on the fp growth algorithm about association rule mining

MapReduce • MapandReducefunctionsareexpressedby a developer • map(key, val) • Emitsnewkey-values p • reduce(key, values) • Emits an arbitraryoutput • Usually a keywithonevalue

MapReduce – Word count

User Programm (1)fork (7) return (1)fork (1)fork Master (2) assign (2) assign (4) localwrite (5) RPC (3) read worker Worker forbluekeys worker (6) write worker worker worker Worker forredkeys worker worker worker Worker foryellowkeys Map phase Intermediate files Reduce phase Input files Shuffle Output files

Conclusion: MapReduce (I) • MapReduceis design as a batchprocessingframework • Nousagefor ad-hoc analysis • Usedforvery large datasets • Usedfortime intensive computations • OpenSourceimplementation: Apache Hadoop http://hadoop.apache.org/

Conclusion

Conclusion (I) • Big Data is important for research and in daily business • Different approaches • Data Stream analysis • Complex event processing • Rule Mining • Apriori algorithm • FP-Growth algorithm

Conclusion (II) • Clustering • K-Means • K-Means++ • Distributed computing • MapReduce • Performance / Runtime • Multiple minutes • Hours • Days… • Online analytical processing for Big Data?

Thank you for your attention

Appendix

Big Data definitions Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. (IBM Corporate ) Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. (Gartner Inc.) Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. (McKinsey & Company)

Complex Event Processing – Windows Tumbling Window Sliding Window Slides in time Buffers the last x elements • Moves as much as the window size Tumbling Window (Slide = WindowSize) Sliding Window (Slide < WindowSize) Window Slide

MapReduce vs. BigQuery

Apriori Algorithm (Pseudocode) • for ( • for each do • foreachdo • end for • end for • ifthen • end if • end for • return

Big Data Analysis Technology