Data mining and scalability
This presentation is the property of its rightful owner.
Sponsored Links
1 / 49

Data Mining and Scalability PowerPoint PPT Presentation


  • 52 Views
  • Uploaded on
  • Presentation posted in: General

Data Mining and Scalability. Lauren Massa-Lochridge Nikolay Kojuharov Hoa Nguyen Quoc Le. Outline. Data Mining Overview Scalability Challenges & Approaches. Overview – Association rules. Case study - BIRCH – An Efficient Data Clustering Method for VLDB.

Download Presentation

Data Mining and Scalability

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Data mining and scalability

Data Mining and Scalability

Lauren Massa-Lochridge

Nikolay Kojuharov

Hoa Nguyen

Quoc Le


Outline

Outline

  • Data Mining Overview

  • Scalability Challenges & Approaches.

  • Overview – Association rules.

  • Case study - BIRCH – An Efficient Data Clustering Method for VLDB.

  • Case Study – Scientific Data Mining.

  • Q&A


Data mining

DATA MINING


Data mining rationale

Data Mining: Rationale

  • Data size

    • Data in databases is estimated to double every year.

    • Number of people who look at the data stays constant

  • Complexity

    • The analysis is complex.

    • The characteristics and relationships are often unexpected and unintuitive.

  • Knowledge discovery tools and algorithms are needed to make sense and use of data


Data mining rationale cont d

Data Mining: Rationale (cont’d)

  • As of 2003, France Telecom has largest decision-support DB, ~30 TB; AT&T was 2nd with 26 TB database.

  • Some of the largest databases on the Web, as of 2003, include

    • Alexa (www.alexa.com) internet archive: 7 years of data, 500 TB

    • Internet Archive (www.archive.org),~ 300 TB

    • Google, over 4 Billion pages, many, many TB

  • Applications

    • Business – analyze inventory, predict customer acceptance, etc.

    • Science – find correlation between genes and diseases, pollution and global warming, etc.

    • Government – uncover terrorist networks, predict flu pandemic, etc.

Adapted from: Data Mining, and Knowledge Discovery: An Introduction, http://www.kdnuggets.com/dmcourse/other_lectures/intro-to-data-mining-notes.htm


Data mining definition

Data Mining: Definition

  • Semi-automatic discovery of patterns, changes, anomalies, rules, and statistically significant structures and events in data.

  • Nontrivial extraction of implicit, previously unknown, and potentially useful information from data

  • Data mining is often done on targeted, preprocessed, transformed data.

    • Targeted: data fusion, sampling.

    • Preprocessed: Noise removal, feature selection, normalization.

    • Transformed: Dimension reduction.


Data mining evolution

Data Mining: Evolution

Adapted from: An Introduction to Data Mining, http://www.thearling.com/text/dmwhite/dmwhite.htm


Data mining approaches

Data Mining: Approaches

  • Clustering - identify natural groupings within the data.

  • Classification - learn a function to map a data item into one of several predefined classes.

  • Summarization – describe groups, summary statistics, etc.

  • Association – identify data items that occur frequently together.

  • Prediction – predict values or distribution of missing data.

  • Time-series analysis – analyze data to find periodicity, trends, deviations.


Scaling

SCALING


Scalability performance

Scalability & Performance

  • Scaling and performance are often considered together in Data Mining. The problem of scalability in DM is not only how to process such large sets of data, but how to do it within a useful timeframe.

  • Many of the issues of scalability in DM and DBMS are similar to scaling performance issues for Data Management in general.

  • Dr. Gregory Piatetsky-Shapiro & Prof. Gary Parker, (P&P) define that the main issue for a clustering algorithms in general as an approach to DM is: “The main issue in clustering is how to evaluate the quality of potential grouping. There are many methods, ranging from manual, visual inspection to a variety of mathematical measures that minimize the similarity of items within the cluster and maximize the difference between the clusters."


Common dm scaling problem

Common DM Scaling Problem

Algorithms generally:

  • Operate on data with assumption of in-memory processing of entire data set

  • Operate under assumption that KIWI will be used to address I/O and other performance scaling issues

  • Or just don't address scalability within resource constraints at all


Data mining scalability

Data Mining: Scalability

  • Large Datasets

    • Use scalable I/O architecture - minimize I/O, make it fit, make it fast.

    • Reduce data - aggregation, dimensional reduction, compression, discretization.

  • Complex Algorithms

    • Reduce algorithm complexity

    • Exploit parallelism, use specialized hardware

  • Complex Results

    • Effective visualization

    • Increase understanding, trustworthiness


Scalable i o architecture

Scalable I/O Architecture

  • Shared memory parallel computers: local + global memory. Locking is used to synchronize.

  • Distributed memory parallel computers: Message Passing/ Remote Memory Operations.

  • Parallel Disk: B records – 1 unit. D blocks can be read or written at once.

  • Primitives: Scatter, Gather, Reduction

  • Data parallelism or Task parallelism


Scaling general approaches

Scaling – General Approaches

  • Question: How can we make tackle memory constraints and efficiency?

    • Statistics: Manipulate data to fits into memory – sampling, selecting features, partition, summarization.

    • Database: Reduce the time to access out of memory data – specialized data structures, block reads, parallel block reads.

    • High Performance Computing: Use several processors

    • Data Mining Imp: Efficient DM primitives, Pre-compute

    • Misc.: Reduce the amount of data - Discretization, Compression, Transformation.


Scaling in association rules

SCALING in ASSOCIATION RULES


Association rules

Association rules

  • Beer-Diapers example.

  • Basket data analysis, Cross-market, sale-campaigns, Web-log analysis etc.

  • Introduced for the first time in 1993

    • Mining Association Rules between Sets of Items in Large Databases

      by R. Agrawal et al., SIGMOD 93 Conference.


Ar applications

AR Applications

X ==> Y:

  • What are the interesting applications?

    • find all rules with “bagels” as X?

      • what should be shelved together with bagels?

      • what would be impacted if stop selling bagels?

    • find all rules with “Diet Coke” as Y?

      • what the store should do to promote Diet Coke?

    • find all rules relating any items in Aisles 1 and 2?

      • shelf planning to see if the two aisles are related


Ar input output

AR: Input & Output

  • Input:

    • a database of sales “transactions”

    • parameters:

      • minimal support: say 50%

      • minimal confidence: say 100%

  • Output:

    • rule 1: {2, 3} ==> {5}: s =?, c =?

    • rule 2: {3, 5} ==> {2}

    • more … …

TIDItems

100134

200235

3001235

40025


Apriori a candidate generation and test approach

Apriori: A Candidate Generation-and-Test Approach

  • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)

  • Method:

    • Initially, scan DB once to get frequent 1-itemset

    • Generate length (k+1) candidate itemsets from length k frequent itemsets

    • Test the candidates against DB

    • Terminate when no frequent or candidate set can be generated


Scaling attempts

Scaling Attempts

  • Count Distribution: distribute transaction data among processors and count the transactions in parallel.  scale linearly with # of transactions

  • Savasere et. al. (VLDB95): Partition data and scan twice (local, global).

  • Toivonen (VLDB96): Sampling – Verification of closure borders.

  • Brin et. al. (SIGMOD97): Dynamic itemset counting.

  • Pei & Han (SIGMOD00): Compact Description (FP-tree), no candidate generation. Scale up with partition-based projection..


Scalable data clustering birch approach

SCALABLE DATA CLUSTERING:BIRCH Approach


Birch approach

BIRCH Approach

  • Informal definition: "data clustering identifies the sparse and the crowded places and hence discovers the overall distribution patterns of the data set."

  • Hierarchical clustering utilizing a distance measure is the catagory of clustering algorithm that BIRCH uses, K-Means is an example of distance measure.

  • Approach: "statistical identification of clusters, i.e. densely populated regions in multi-dimensional dataset, given the desired number of clusters K. and a dataset of N points, and a distance based measurement”.

  • Problem with other approaches, distance measure, hierarchical, etc. are all similar in terms of scaling and resource utilization.


Birch novelty

BIRCH Novelty

  • First algorithm proposed in the database areathat filters out “noise”, i.e. outliers

  • Prior work does not adequately address large data sets with minimization of I/O cost

  • Prior work does not address issues of data set fit to memory

  • Prior work does not address resource utilization or resource constraints in scalability and performance


Database dm oriented constraints

Database / DM Oriented Constraints

  • Resource utilization is maximizing usage of available resources as opposed to just working within resources constraints alone, which does not necessarily optimize utilization.

  • Resource utilization is important in DM Scaling or for any case where the data sets are very large.

  • BIRCH single scan of data set yields a minimum of “good enough” clustering.

  • One or more additional passes are optional and depending upon specifics of constraints for a particular system and application, can be used to improve the quality over and above the "good enough" .


Database oriented constraints

Database Oriented Constraints

  • Database Oriented Constraints are what differentiates BIRCH from more general DM algorithms

  • Limited acceptable response time

  • Resource Utilization – optimize not just work within resources available – necessary for VeryLarge data sets

  • Fit to available memory

  • Minimize I/O costs

  • Need I/O cost linear in size of data set


Features of birch solution

Features of BIRCH Solution:

  • Locality of reference: each unit clustering decision made without scanning all data points for all existing clusters.

  • Clustering decision: measurements reflect natural "closeness" of points

  • Locality enables incrementally maintained and updated during clustering process

  • Optional removal of outliers:

    • Cluster equals dense region of points.

    • Outlier equals point in sparse region.


More features of birch solution

More Features of BIRCH Solution:

  • Optimal memory resource usage -> Utilization and and within Resource Constraints.

  • Finest possible sub clusters, given memory resource and I/O/time constraints:

    • Finest clusters given memory implies best accuracy achievable (another type of optimal utilization).

  • Minimize I/O costs:

    • implies efficiency and required response time.


More features of birch solution1

More Features of BIRCH Solution

  • Running time linearly scalable (in size of data set).

  • Optionally, incremental scan of data set, i.e. do not have to scan entire data said in advance and increments adjustable.

  • Only scans complete data set once (others scan multiple times)


Background single cluster

Background (Single cluster)

  • Given N d-dimensional data points : {Xi}

    • “Centroid”

    • “radius”

    • “diameter”


Background two clusters

Background (two clusters)

Given the centroids : X0 and Y0,

  • The centroid Euclidean distance D0:

  • The centroid Manhattan distance D1:


Background two clusters1

Background ( two clusters)

  • Average inter-cluster distance

    D2=

  • Average intra-cluster distance

    D3=


Clustering feature

Clustering Feature

  • CF = (N, LS, SS)

    N = |C| “number of data points”

    LS = “linear sum of N data points”

    SS = “square sum of N data points ”

     Summarization of cluster


Cf additive theorem

CF AdditiveTheorem

  • Assume CF1=(N1, LS1 ,SS1), CF2 =(N2,LS2,SS2) .

  • Information stored in CFs is sufficient to compute:

    • Centroids

    • Measures for the compactness of clusters

    • Distance measure for clusters


Cf tree

CF-Tree

  • height-balanced tree

  • two parameters:

    • branching factor

      • B : An internal node contains at most B entries [CFi, childi]

      • L : A leaf node contains at most L entries [CFi]

    • threshold T

      • The diameter of all entries in a leaf node is at most T

  • Leaf nodes are connected via prev and next pointers  efficient for data scan


Cf tree example

CF tree example


Birch algorithm scaling details

BIRCH Algorithm Scaling Details

CF / CF Tree used to optimize clusters for memory & I/O:

  • P, page size (page of memory)

  • Tree size a function of T, larger T -> smaller CF Tree

  • Require node to fit in memory page size P –> split to fit, or merge for optimal utilization - dynamically

  • P can be varied on the system or in the algorithm for performance tuning and scaling


Birch algorithm steps

BIRCH Algorithm Steps


Phase 1

Phase 1

Start CF tree t1 of initial T

Continue scanning data and insert into t1

Out of memory

Finish scanning data

Result?

  • increase T

  • rebuild CF tree t2 of new T from CF tree t1. if a leaf entry is a

  • potential outlier, write to disk. Otherwise use it.

  • t1 <= t2

Otherwise

Out of disk space

Result?

Re-absorb potential outliers into t1

Re-absorb potential outliers into t1


Analysis

Analysis

  • I/O cost:

    Where

    • N: Number of data points

    • M: Memory size

    • P: Page size

    • d: dimension

    • N0: Number of data points loaded into memory with threshold T0


Scientific data mining

SCIENTIFIC DATA MINING


Terminologies

Terminologies

  • Data mining: The semi-automatic discovery of patterns, associations, anomalies, and statistically significant structures of data.

  • Pattern recognition: The discovery and characterization of patterns

  • Pattern: An ordering with an underlying structure

  • Feature: Extractable measurement or attribute


Scientific data mining1

Scientific data mining

Figure 1. Key steps in scientific data mining


Data mining is essential

Data mining is essential

  • Scientific data set is very complex

    • Multi-sensor, multi-resolution,multi-spectral data

    • High-dimentional data

    • Mesh data from simulation

    • Data contaminated with noise

      • Sensor noise, clouds, atmospheric turbulence, …


Data mining is essential1

Data mining is essential

  • Massive dataset

    • Advances in technology allows us to collect ever increasing amount of scientific data (in experiments, observations, and simulations)

      • Astronomies dataset with tens of millions of galaxies

      • Sloan Digital Sky Survey: Assuming the pixel size of about 0.25”, the whole sky is 10Tera pixels (2 bytes/pixel and 1TeraByte)

    • Collection of data made possible by advances in:

      • Sensors (telescopes, satellites, …)

      • Computers and storages (faster, parallel, …)

         We need fast and accurate data analysis techniques to realize the full potential of our enhanced data collecting ability. And manual techniques are impossible


Data mining in astronomy

Data mining in astronomy

  • FIRST: Detecting radio-emitted stars

    • Dataset: 100GByte of image data (1996)

Image Map

16K image maps, 7.1MB each


Data mining in astronomy1

Data mining in astronomy

  • Example

  • Result: Find 20K radio-emitted stars from 400K entries


Mining climate data univ of minnesota

Mining climate data (Univ. of Minnesota)

Research Goal:

  • Find global climate patterns of interest to Earth Scientists

Average Monthly Temperature

A key interest is finding connections between the ocean / atmosphere and the land.

  • Global snapshots of values for a number of variables on land surfaces or water.

  • Span a range of 10 to 50 years.


Mining climate data univ of minnesota1

Mining climate data (Univ. of Minnesota)

  • EOS satellites provide high resolution measurements

    • Finer spatial grids

      • 8 km  8 km grid produces 10,848,672 data points

      • 1 km  1 km grid produces 694,315,008 data points

    • More frequent measurements

    • Multiple instruments

      • Generates terabytes of day per day

         SCALABILITY

Earth Observing System

(e.g., Terra and Aqua satellites)


Questions and answering

Questions and Answering!


  • Login