- 58 Views
- Uploaded on
- Presentation posted in: General

Data Mining and Scalability

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Data Mining and Scalability

Lauren Massa-Lochridge

Nikolay Kojuharov

Hoa Nguyen

Quoc Le

- Data Mining Overview
- Scalability Challenges & Approaches.
- Overview – Association rules.
- Case study - BIRCH – An Efficient Data Clustering Method for VLDB.
- Case Study – Scientific Data Mining.
- Q&A

- Data size
- Data in databases is estimated to double every year.
- Number of people who look at the data stays constant

- Complexity
- The analysis is complex.
- The characteristics and relationships are often unexpected and unintuitive.

- Knowledge discovery tools and algorithms are needed to make sense and use of data

- As of 2003, France Telecom has largest decision-support DB, ~30 TB; AT&T was 2nd with 26 TB database.
- Some of the largest databases on the Web, as of 2003, include
- Alexa (www.alexa.com) internet archive: 7 years of data, 500 TB
- Internet Archive (www.archive.org),~ 300 TB
- Google, over 4 Billion pages, many, many TB

- Applications
- Business – analyze inventory, predict customer acceptance, etc.
- Science – find correlation between genes and diseases, pollution and global warming, etc.
- Government – uncover terrorist networks, predict flu pandemic, etc.

Adapted from: Data Mining, and Knowledge Discovery: An Introduction, http://www.kdnuggets.com/dmcourse/other_lectures/intro-to-data-mining-notes.htm

- Semi-automatic discovery of patterns, changes, anomalies, rules, and statistically significant structures and events in data.
- Nontrivial extraction of implicit, previously unknown, and potentially useful information from data
- Data mining is often done on targeted, preprocessed, transformed data.
- Targeted: data fusion, sampling.
- Preprocessed: Noise removal, feature selection, normalization.
- Transformed: Dimension reduction.

Adapted from: An Introduction to Data Mining, http://www.thearling.com/text/dmwhite/dmwhite.htm

- Clustering - identify natural groupings within the data.
- Classification - learn a function to map a data item into one of several predefined classes.
- Summarization – describe groups, summary statistics, etc.
- Association – identify data items that occur frequently together.
- Prediction – predict values or distribution of missing data.
- Time-series analysis – analyze data to find periodicity, trends, deviations.

- Scaling and performance are often considered together in Data Mining. The problem of scalability in DM is not only how to process such large sets of data, but how to do it within a useful timeframe.
- Many of the issues of scalability in DM and DBMS are similar to scaling performance issues for Data Management in general.
- Dr. Gregory Piatetsky-Shapiro & Prof. Gary Parker, (P&P) define that the main issue for a clustering algorithms in general as an approach to DM is: “The main issue in clustering is how to evaluate the quality of potential grouping. There are many methods, ranging from manual, visual inspection to a variety of mathematical measures that minimize the similarity of items within the cluster and maximize the difference between the clusters."

Algorithms generally:

- Operate on data with assumption of in-memory processing of entire data set
- Operate under assumption that KIWI will be used to address I/O and other performance scaling issues
- Or just don't address scalability within resource constraints at all

- Large Datasets
- Use scalable I/O architecture - minimize I/O, make it fit, make it fast.
- Reduce data - aggregation, dimensional reduction, compression, discretization.

- Complex Algorithms
- Reduce algorithm complexity
- Exploit parallelism, use specialized hardware

- Complex Results
- Effective visualization
- Increase understanding, trustworthiness

- Shared memory parallel computers: local + global memory. Locking is used to synchronize.
- Distributed memory parallel computers: Message Passing/ Remote Memory Operations.
- Parallel Disk: B records – 1 unit. D blocks can be read or written at once.
- Primitives: Scatter, Gather, Reduction
- Data parallelism or Task parallelism

- Question: How can we make tackle memory constraints and efficiency?
- Statistics: Manipulate data to fits into memory – sampling, selecting features, partition, summarization.
- Database: Reduce the time to access out of memory data – specialized data structures, block reads, parallel block reads.
- High Performance Computing: Use several processors
- Data Mining Imp: Efficient DM primitives, Pre-compute
- Misc.: Reduce the amount of data - Discretization, Compression, Transformation.

- Beer-Diapers example.
- Basket data analysis, Cross-market, sale-campaigns, Web-log analysis etc.
- Introduced for the first time in 1993
- Mining Association Rules between Sets of Items in Large Databases
by R. Agrawal et al., SIGMOD 93 Conference.

- Mining Association Rules between Sets of Items in Large Databases

X ==> Y:

- What are the interesting applications?
- find all rules with “bagels” as X?
- what should be shelved together with bagels?
- what would be impacted if stop selling bagels?

- find all rules with “Diet Coke” as Y?
- what the store should do to promote Diet Coke?

- find all rules relating any items in Aisles 1 and 2?
- shelf planning to see if the two aisles are related

- find all rules with “bagels” as X?

- Input:
- a database of sales “transactions”
- parameters:
- minimal support: say 50%
- minimal confidence: say 100%

- Output:
- rule 1: {2, 3} ==> {5}: s =?, c =?
- rule 2: {3, 5} ==> {2}
- more … …

TIDItems

100134

200235

3001235

40025

- Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
- Method:
- Initially, scan DB once to get frequent 1-itemset
- Generate length (k+1) candidate itemsets from length k frequent itemsets
- Test the candidates against DB
- Terminate when no frequent or candidate set can be generated

- Count Distribution: distribute transaction data among processors and count the transactions in parallel. scale linearly with # of transactions
- Savasere et. al. (VLDB95): Partition data and scan twice (local, global).
- Toivonen (VLDB96): Sampling – Verification of closure borders.
- Brin et. al. (SIGMOD97): Dynamic itemset counting.
- Pei & Han (SIGMOD00): Compact Description (FP-tree), no candidate generation. Scale up with partition-based projection..

- Informal definition: "data clustering identifies the sparse and the crowded places and hence discovers the overall distribution patterns of the data set."
- Hierarchical clustering utilizing a distance measure is the catagory of clustering algorithm that BIRCH uses, K-Means is an example of distance measure.
- Approach: "statistical identification of clusters, i.e. densely populated regions in multi-dimensional dataset, given the desired number of clusters K. and a dataset of N points, and a distance based measurement”.
- Problem with other approaches, distance measure, hierarchical, etc. are all similar in terms of scaling and resource utilization.

- First algorithm proposed in the database areathat filters out “noise”, i.e. outliers
- Prior work does not adequately address large data sets with minimization of I/O cost
- Prior work does not address issues of data set fit to memory
- Prior work does not address resource utilization or resource constraints in scalability and performance

- Resource utilization is maximizing usage of available resources as opposed to just working within resources constraints alone, which does not necessarily optimize utilization.
- Resource utilization is important in DM Scaling or for any case where the data sets are very large.
- BIRCH single scan of data set yields a minimum of “good enough” clustering.
- One or more additional passes are optional and depending upon specifics of constraints for a particular system and application, can be used to improve the quality over and above the "good enough" .

- Database Oriented Constraints are what differentiates BIRCH from more general DM algorithms
- Limited acceptable response time
- Resource Utilization – optimize not just work within resources available – necessary for VeryLarge data sets
- Fit to available memory
- Minimize I/O costs
- Need I/O cost linear in size of data set

- Locality of reference: each unit clustering decision made without scanning all data points for all existing clusters.
- Clustering decision: measurements reflect natural "closeness" of points
- Locality enables incrementally maintained and updated during clustering process
- Optional removal of outliers:
- Cluster equals dense region of points.
- Outlier equals point in sparse region.

- Optimal memory resource usage -> Utilization and and within Resource Constraints.
- Finest possible sub clusters, given memory resource and I/O/time constraints:
- Finest clusters given memory implies best accuracy achievable (another type of optimal utilization).

- Minimize I/O costs:
- implies efficiency and required response time.

- Running time linearly scalable (in size of data set).
- Optionally, incremental scan of data set, i.e. do not have to scan entire data said in advance and increments adjustable.
- Only scans complete data set once (others scan multiple times)

- Given N d-dimensional data points : {Xi}
- “Centroid”
- “radius”
- “diameter”

Given the centroids : X0 and Y0,

- The centroid Euclidean distance D0:
- The centroid Manhattan distance D1:

- Average inter-cluster distance
D2=

- Average intra-cluster distance
D3=

- CF = (N, LS, SS)
N = |C| “number of data points”

LS = “linear sum of N data points”

SS = “square sum of N data points ”

Summarization of cluster

- Assume CF1=(N1, LS1 ,SS1), CF2 =(N2,LS2,SS2) .
- Information stored in CFs is sufficient to compute:
- Centroids
- Measures for the compactness of clusters
- Distance measure for clusters

- height-balanced tree
- two parameters:
- branching factor
- B : An internal node contains at most B entries [CFi, childi]
- L : A leaf node contains at most L entries [CFi]

- threshold T
- The diameter of all entries in a leaf node is at most T

- branching factor
- Leaf nodes are connected via prev and next pointers efficient for data scan

CF / CF Tree used to optimize clusters for memory & I/O:

- P, page size (page of memory)
- Tree size a function of T, larger T -> smaller CF Tree
- Require node to fit in memory page size P –> split to fit, or merge for optimal utilization - dynamically
- P can be varied on the system or in the algorithm for performance tuning and scaling

Start CF tree t1 of initial T

Continue scanning data and insert into t1

Out of memory

Finish scanning data

Result?

- increase T
- rebuild CF tree t2 of new T from CF tree t1. if a leaf entry is a
- potential outlier, write to disk. Otherwise use it.
- t1 <= t2

Otherwise

Out of disk space

Result?

Re-absorb potential outliers into t1

Re-absorb potential outliers into t1

- I/O cost:
Where

- N: Number of data points
- M: Memory size
- P: Page size
- d: dimension
- N0: Number of data points loaded into memory with threshold T0

- Data mining: The semi-automatic discovery of patterns, associations, anomalies, and statistically significant structures of data.
- Pattern recognition: The discovery and characterization of patterns
- Pattern: An ordering with an underlying structure
- Feature: Extractable measurement or attribute

Figure 1. Key steps in scientific data mining

- Scientific data set is very complex
- Multi-sensor, multi-resolution,multi-spectral data
- High-dimentional data
- Mesh data from simulation
- Data contaminated with noise
- Sensor noise, clouds, atmospheric turbulence, …

- Massive dataset
- Advances in technology allows us to collect ever increasing amount of scientific data (in experiments, observations, and simulations)
- Astronomies dataset with tens of millions of galaxies
- Sloan Digital Sky Survey: Assuming the pixel size of about 0.25”, the whole sky is 10Tera pixels (2 bytes/pixel and 1TeraByte)

- Collection of data made possible by advances in:
- Sensors (telescopes, satellites, …)
- Computers and storages (faster, parallel, …)
We need fast and accurate data analysis techniques to realize the full potential of our enhanced data collecting ability. And manual techniques are impossible

- Advances in technology allows us to collect ever increasing amount of scientific data (in experiments, observations, and simulations)

- FIRST: Detecting radio-emitted stars
- Dataset: 100GByte of image data (1996)

Image Map

16K image maps, 7.1MB each

- Example
- Result: Find 20K radio-emitted stars from 400K entries

Research Goal:

- Find global climate patterns of interest to Earth Scientists

Average Monthly Temperature

A key interest is finding connections between the ocean / atmosphere and the land.

- Global snapshots of values for a number of variables on land surfaces or water.
- Span a range of 10 to 50 years.

- EOS satellites provide high resolution measurements
- Finer spatial grids
- 8 km 8 km grid produces 10,848,672 data points
- 1 km 1 km grid produces 694,315,008 data points

- More frequent measurements
- Multiple instruments
- Generates terabytes of day per day
SCALABILITY

- Generates terabytes of day per day

- Finer spatial grids

Earth Observing System

(e.g., Terra and Aqua satellites)