slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Big Data Analysis Technology PowerPoint Presentation
Download Presentation
Big Data Analysis Technology

Loading in 2 Seconds...

play fullscreen
1 / 69

Big Data Analysis Technology - PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on

Big Data Analysis Technology. University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June 12, 2013 Tobias Hardes (6687549) – Tobias.Hardes@gmail.com. Table of content. Introduction Definitions Background Example

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Big Data Analysis Technology' - raziya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Big Data Analysis Technology

University of Paderborn

L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English)

Summer semester 2013

June 12, 2013

Tobias Hardes (6687549) – Tobias.Hardes@gmail.com

slide2

Table of content

  • Introduction
    • Definitions
  • Background
    • Example
  • Related Work
    • Research
  • Main Approaches
    • Association Rule Mining
    • MapReduce Framework
  • Conclusion
big data vs business intelligence
Big Data vs. Business Intelligence
  • How can we predict cancer early enough to treat it successfully?
  • How Can I make significant profit on the stock market next month?

Docs.oralcle.com

  • Which is the most profitable branch of our supermarket?
    • In a specific country?
    • During a specific period of time
background

Background

home.web.cern.ch

slide6

Big Science – The LHC

  • 600 million times per second, particles collide within the Large Hadron Collider (LHC)
  • Each collision generate new particles
  • Particles decay in complex way
  • Each collision is detected
  • The CERN Data Center reconstruct this collision event
  • 15 petabytes of data stored every year
  • Worldwide LHC Computing Grid (WLCG) is used to crunch all of the data

home.web.cern.ch

data stream analysis
Data Stream Analysis
  • Just in time analysis of data.
    • Sensor networks
  • Analysis for a certain time (last 30 seconds)

http://venturebeat.com

complex event processing cep
Complexeventprocessing (CEP)
  • Provides queries for streams
  • Usage of „Event Processing Languages“ (EPL)
  • selectavg(price)fromStockTickEvent.win:time(30 sec)

Tumbling Window

(Slide = WindowSize)

Sliding Window

(Slide < WindowSize)

Window

Slide

https://forge.fi-ware.eu

complex event processing areas of application
Complex Event Processing - Areas of application
  • Just in time analysis  Complexity of algorithms
  • CEP is used with Twitter:
    • Identify emotional states of users
    • Sarcasm?
principles
Principles
  • Statistics
  • Probability theory
  • Machine learning
  • Data Mining
    • Association rule learning
    • Cluster analysis
    • Classificiation
association rule mining cluster analysis
AssociationRule Mining – Cluster analysis

Association Rule Mining

Is soda purchased with bananas?

  • Relationships between items
  • Find associations, correlations or causal structures
  • Apriori algorithm
  • Frequent Pattern (FP)-Growth algorithm
cluster analysis classification
Cluster analysis – Classification

Cluster Analysis

  • Classification of similar objects into classes
  • Classes are defined during the clustering
  • k-Means
  • K-Means++
research and future work
Research andfuturework
  • Performance, performance, performance…
    • Passes of the data source
    • Parallelization
    • NP-hard problems
    • ….
  • Accuracy
    • Optimized solutions
example
Example
  • Apriori algorithm: n+1 database scans
  • FP-Growth algorithm: 2 database scans
distributed computing motivation
Distributed computing – Motivation
  • Complex computational tasks
  • Serveralterabytes of data
  • Limited hardware resources
  • Google‘sMapReduce framework

Prof. Dr. Erich Ehses (FH Köln)

main approaches

Main approaches

http://ultraoilforpets.com

structure
Structure
  • Association rule mining
    • Apriori algorithm
    • FP-Growth algorithm
  • Googles MapReduce
association rule mining
Association rule mining
  • Identify items that are related to other items
  • Example: Analysis of baskets in an online shop or in a supermarket

http://img.deusm.com/

terminology
Terminology
  • A stream or a database with n elements: S
  • Item set:
  • Frequency of occurrence of an item set: Φ(A)
  • Association rule B :
  • Support:
  • Confidence:
example1
Example
  • Rule: „If a basket contains cheese and chocolate, then it also contains bread“
  • 6 of 60 transactions contains cheese and chocolate
  • 3 of the 6 transactions contains bread
common approach
Common approach
  • Disjoin the problem into two tasks:
  • Generation of frequent item sets
    • Find item sets that satisfy a minimum support value
  • Generation of rules
    • Find Confidence rules using the item sets
aprio algorithm frequent item set
Aprioalgorithm – Frequent item set
  • Input:
    • Minimum support: min_sup
    • Datasource: S
apriori frequent item sets i
Apriori – Frequent item sets (I)
  • Generation of frequent item sets : min_sup = 2

2

2

2

A

B

C

1

3

D

1

1

1

4

4

2

4

2

3

3

2

https://www.mev.de/

{}

apriori frequent item sets ii
Apriori – Frequent item sets (II)
  • Generation of frequent item sets : min_sup = 2

2

ACD

BCD

1

Candidates

L3

Candidates

L2

AB

1

AC

2

AD

2

BC

3

BD

CD

2

2

A

2

B

4

C

4

D

3

L1

https://www.mev.de/

{}

apriori algorithm rule generation
Apriori Algorithm – Rulegeneration
  • Uses frequent item sets to extract high-confidence rules
  • Based on the same principle as the item set generation
  • Done for all frequent item set Lk
summary apriori algorithm
Summary Apriori algorithm
  • n+1 scansofthedatabase
  • Expensive generationofthecandidate item set
  • Implements level-wise search using frequent item property.
  • Easy toimplement
  • Someopportunities for specialized optimizations
fp growth algorithm
FP-Growth algorithm
  • Used for databases
  • Features:
    • Requires 2 scans of the database
    • Uses a special data structure – The FP-Tree
    • Build the FP-Tree
    • Extract frequent item sets
  • Compression of the database
  • Devide this database and apply data mining
extract frequent itemsets i
Extractfrequentitemsets (I)
  • Bottom-up strategy
  • Start with node „e“
  • Then look for „de“
  • Each path is processedrecursively
  • Solutions are merged
extract frequent itemsets ii
Extractfrequentitemsets (II)
  • Is e frequent?
    • Is de frequent?
    • Is ce frequent?
      • ….
    • Is be frequent?
      • ….
    • Is ae frequent?
      • …..
  • Using subproblems to identify frequent itemsets

Φ(e) = 3 – Assume the minimum support was set to 2

extract frequent itemsets iii
Extractfrequentitemsets (III)

Update the support count along the prefix path

Remove Node e

Check the frequency of the paths

Find item setswith

de, ce, aeorbe

apriori vs fp growth
Apriori vs. FP-Growth
  • FP-Growth hassomeadvantages
    • Twoscansofthedatabase
    • No expensive computationofcandidates
    • Compressed datastructure
    • Easiertoparallelize

W. Zhang, H. Liao, and N. Zhao, “Research on the fp growth algorithm

about association rule mining

mapreduce
MapReduce
  • MapandReducefunctionsareexpressedby a developer
  • map(key, val)
    • Emitsnewkey-values p
  • reduce(key, values)
    • Emits an arbitraryoutput
    • Usually a keywithonevalue
slide38

User Programm

(1)fork

(7) return

(1)fork

(1)fork

Master

(2) assign

(2) assign

(4) localwrite

(5) RPC

(3) read

worker

Worker forbluekeys

worker

(6) write

worker

worker

worker

Worker forredkeys

worker

worker

worker

Worker foryellowkeys

Map

phase

Intermediate files

Reduce

phase

Input files

Shuffle

Output files

conclusion mapreduce i
Conclusion: MapReduce (I)
  • MapReduceis design as a batchprocessingframework
  • Nousagefor ad-hoc analysis
  • Usedforvery large datasets
  • Usedfortime intensive computations
  • OpenSourceimplementation: Apache Hadoop

http://hadoop.apache.org/

conclusion i
Conclusion (I)
  • Big Data is important for research and in daily business
  • Different approaches
  • Data Stream analysis
    • Complex event processing
  • Rule Mining
    • Apriori algorithm
    • FP-Growth algorithm
conclusion ii
Conclusion (II)
  • Clustering
    • K-Means
    • K-Means++
  • Distributed computing
    • MapReduce
  • Performance / Runtime
    • Multiple minutes
    • Hours
    • Days…
    • Online analytical processing for Big Data?
big data definitions
Big Data definitions

Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

(IBM Corporate )

Big data is high-volume, high-velocity

and high-variety information assets

that demand cost-effective, innovative

forms of information processing for

enhanced insight and decision making.

(Gartner Inc.)

Big data” refers to datasets whose size is

beyond the ability of typical database software

tools to capture, store, manage, and analyze.

(McKinsey & Company)

big data definitions1
Big Data definitions

Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

(IBM Corporate )

Big data is high-volume, high-velocity

and high-variety information assets

that demand cost-effective, innovative

forms of information processing for

enhanced insight and decision making.

(Gartner Inc.)

Big data” refers to datasets whose size is

beyond the ability of typical database software

tools to capture, store, manage, and analyze.

(McKinsey & Company)

complex event processing windows
Complex Event Processing – Windows

Tumbling Window

Sliding Window

Slides in time

Buffers the last x elements

  • Moves as much as the window size

Tumbling Window

(Slide = WindowSize)

Sliding Window

(Slide < WindowSize)

Window

Slide

apriori algorithm pseudocode
Apriori Algorithm (Pseudocode)
  • for (
    • for each do
      • foreachdo
      • end for
    • end for
    • ifthen
    • end if
  • end for
  • return
apriori algorithm pseudocode1
Apriori Algorithm (Pseudocode)
  • for (
    • for each do
      • foreachdo
      • end for
    • end for
    • ifthen
    • end if
  • end for
  • return
apriori algorithm pseudocode2
Apriori Algorithm (Pseudocode)
  • for (
    • for each do
      • foreachdo
      • end for
    • end for
    • ifthen
    • end if
  • end for
  • return
apriori algorithm pseudocode3
Apriori Algorithm (Pseudocode)
  • for (
    • for each do
      • foreachdo
      • end for
    • end for
    • ifthen
    • end if
  • end for
  • return
slide54

Distributed computing of Big Data

  • CERN‘s Worldwide LHC Computing Grid (WLCG) launched in 2002
  • Stores, distributes and analyse the 15 petabytes of data
  • 140 centres across 35 countries
apriori algorithm join
Apriori Algorithm – 𝑎𝑝𝑟𝑖𝑜𝑟𝑖𝐺𝑒𝑛  Join
  • Do not generate not too many candidate item sets, but making sure to not lose any that do turn out to be large.
  • Assumethattheitemsareordered (alphabetical)
  • {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak< bk, {a1, a2 , … ak, bk} is a candidate k+1-itemset.
big data vs business intelligence1
Big Data vs. Business Intelligence

Big Data

Business Intelligence

Transformed Data

Historical view

Easy to process and to analyse

Used for reporting:

Which is the most profitable branch of our supermarket?

Which postcodes suffered the most dropped calls in July?

  • Large and complex data sets
  • Temporal, historical, …
  • Difficult to process and to analyse
  • Used for deep analysis and reporting:
    • How can we predict cancer early enough to treat it successfully?
    • How Can I make significant profit on the stock market next month?
improvement approaches
Improvement approaches
  • Selectionofstartupparametersforalgorithms
  • Reducingthenumberofpassesoverthedatabase
  • Sampling thedatabase
  • Adding extra constraintsforpatterns
  • Parallelization
example fa dmfi
Example: FA-DMFI
  • AlgorithmforDiscoveringfrequent item sets
  • Read thedatabaseonce
    • Compressinto a matrix
    • Frequent item setsaregeneratedbycoverrelations
    • Further costlycomputationsareavoided
k means algorithm
K-Meansalgorithm
  • Select k entities as the initial centroids.
  • (Re)Assign all entities to their closest centroids.
  • Recompute the centroid of each newly assembled cluster.
  • Repeat step 2 and 3 until the centroids do not change or until the maximum value for the iterations is reached
solving approaches
Solving approaches
  • K-Meansclusteris NP-hard
  • Optimization methods to handle NP-hard problems (K-Means clustering)
examples
Examples
  • Apriori algorithm: n+1 database scans
  • FP-Growth algorithm: 2 database scans
  • K-Means: Exponential runtime
  • K-Means++: Improve startup parameters
google s bigquery
Google‘sBigQuery

Upload thedatasettothe Google Storage

Upload

http://glenn-packer.net/

Process

Import datatotables

Analyse

Run queries

the apriori algorithm
The Apriori algorithm
  • Most known algorithm for rule mining
  • Based on a simple principle:
    • „If an item set is frequent, then all subsets of this item are also frequent“
  • Input:
    • Minimum confidence: min_conf
    • Minimum support: min_sup
    • Data source: S
apriori algorithm apriorigen
Apriori Algorithm – aprioriGen
  • Generates a candidate item set that might by larger
  • Join: Generation of the item set
  • Prune: Elimination of item sets with
apriori algorithm rule generation example
Apriori Algorithm – Rulegeneration -- Example
  • {Butter, milk, bread}  {cheese}
  • {Butter, meat, bread}  {cola}
  • {Butter, bread}  {cheese, cola}
how to improve the apriori algorithm
Howtoimprovethe Apriori algorithm
  • Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent.
  • Sampling: mining on a subset of given data
  • Dynamic itemset counting:
construction of fp tree
Constructionof FP-Tree
  • Compressed representationofthedatabase
  • First scan
    • Getthesupportofevery item andsortthembythesupportcount
  • Second scan
    • Eachtransactionismappedto a path
    • Compressionisdoneifoverlappingpatharedetected
    • Generate links between same nodes
  • Eachnodehas a counter Numberofmappedtransactions