High performance data mining
Download
1 / 66

High Performance Data Mining - PowerPoint PPT Presentation


  • 318 Views
  • Uploaded on

Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 8 ... Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 9 ...

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'High Performance Data Mining ' - Kelvin_Ajay


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
High performance data mining l.jpg
High Performance Data Mining

Vipin Kumar

Army High Performance Computing Research Center

Department of Computer Science

University of Minnesota http://www.cs.umn.edu/~kumar

Research sponsored by AHPCRC/ARL, DOE, NASA, and NSF


Overview l.jpg
Overview

  • Introduction to Data Mining (What, Why, and How?)

  • Issues and Challenges in Designing Parallel Data Mining Algorithms

  • Case Study: Discovery of Patterns in Global Climate Data using Data Mining

  • Summary


What is data mining l.jpg
What is Data Mining?

  • Many Definitions

    • Non-trivial extraction of implicit, previously unknown and potentially useful information from data

    • Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns


What is not data mining l.jpg
What is (not) Data Mining?

  • What is not Data Mining?

    • Look up phone number in phone directory

    • Query a Web search engine for information about “Amazon”

  • What is Data Mining?

    • Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

    • Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)


Why mine data commercial viewpoint l.jpg
Why Mine Data? Commercial Viewpoint

  • Lots of data is being collected and warehoused

    • Web data, e-commerce

    • purchases at department/grocery stores

    • Bank/Credit Card transactions

  • Computers have become cheaper and more powerful

  • Competitive Pressure is Strong

    • Provide better, customized services for an edge (e.g. in Customer Relationship Management)


Why mine data scientific viewpoint l.jpg
Why Mine Data? Scientific Viewpoint

  • Data collected and stored at enormous speeds (GB/hour)

    • remote sensors on a satellite

    • telescopes scanning the skies

    • microarrays generating gene expression data

    • scientific simulations generating terabytes of data

  • Traditional techniques infeasible for raw data

  • Data mining may help scientists

    • in classifying and segmenting data

    • in Hypothesis Formation


Mining large data sets motivation l.jpg

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

Mining Large Data Sets - Motivation

  • There is often information “hidden” in the data that is not readily evident

  • Human analysts may take weeks to discover useful information

  • Much of the data is never analyzed at all

The Data Gap

Total new disk (TB) since 1995

Number of analysts


Origins of data mining l.jpg
Origins of Data Mining Scientific and Engineering Applications”

  • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

  • Traditional Techniquesmay be unsuitable due to

    • Enormity of data

    • High dimensionality of data

    • Heterogeneous, distributed nature of data

Statistics/AI

Machine Learning/

Pattern Recognition

Data Mining

Database systems


Role of parallel and distributed computing l.jpg

Statistics/ Scientific and Engineering Applications”AI

Machine Learning/

Pattern Recognition

Data Mining

High Performance Computing

Database systems

Role of Parallel and Distributed Computing

  • Many algorithms use computation time more than O(n)

  • High Performance Computing (HPC) is often critical for scalability to large data sets

  • Sequential computers have limited memory

    • This may required multiple,expensive I/O passes over data

  • Data may be distributed

    • due to privacy reasons

    • physically dispersed over many different geographic locations


Data mining tasks l.jpg
Data Mining Tasks... Scientific and Engineering Applications”

Data

Clustering

Predictive Modeling

Anomaly Detection

Association Rules

Milk


Predictive modeling l.jpg
Predictive Modeling Scientific and Engineering Applications”

  • Find a model for class attribute as a function of the values of other attributes

Model for predicting tax evasion

categorical

categorical

continuous

Married

class

No

Yes

NO

Income100K

Yes

Yes

Income  80K

NO

Yes

No

Learn

Classifier

NO

YES


Predictive modeling applications l.jpg
Predictive Modeling: Applications Scientific and Engineering Applications”

  • Targeted Marketing

  • Customer Attrition/Churn

  • Classifying Galaxies

  • Class:

  • Stages of Formation

Early

  • Attributes:

  • Image features,

  • Characteristics of light waves received, etc.

Intermediate

Late

  • Sky Survey Data Size:

  • 72 million stars, 20 million galaxies

  • Object Catalog: 9 GB

  • Image Database: 150 GB

Courtsey: http://aps.umn.edu


Clustering l.jpg
Clustering Scientific and Engineering Applications”

  • Given a set of data points, find groupings such that

    • Data points in one cluster are more similar to one another

    • Data points in separate clusters are less similar to one another


Clustering applications l.jpg
Clustering: Applications Scientific and Engineering Applications”

  • Market Segmentation

  • Gene expression clustering

  • Document Clustering


Association rule discovery l.jpg
Association Rule Discovery Scientific and Engineering Applications”

  • Given a set of records, find dependency rules which will predict occurrence of an item based on occurrences of other items in the record

  • Applications

    • Marketing and Sales Promotion

    • Supermarket shelf management

    • Inventory Management

Rules Discovered:

{Milk} --> {Coke} (s=0.6, c=0.75)

{Diaper, Milk} --> {Beer} (s=0.4, c=0.67)


Deviation anomaly detection l.jpg
Deviation/Anomaly Detection Scientific and Engineering Applications”

  • Detect significant deviations from normal behavior

  • Applications:

    • Credit Card Fraud Detection

    • Network Intrusion Detection

Typical network traffic at University level may reach over 100 million connections per day


General issues and challenges in parallel data mining l.jpg
General Issues and Challenges in Parallel Data Mining Scientific and Engineering Applications”

  • Dense vs. Sparse

  • Structured versus Unstructured

  • Static vs. Dynamic

  • Data mining computations tend to be unstructured, sparse and dynamic.


Specific issues and challenges in parallel data mining l.jpg
Specific Issues and Challenges in Parallel Data Mining Scientific and Engineering Applications”

  • Disk I/O

    • Data is often too large to fit in main memory

    • Spatial locality is critical

  • Hash Tables

    • Many efficient data mining algorithms require fast access to large hash tables.


Constructing a decision tree l.jpg

Pay Scientific and Engineering Applications”

Evade

Refund

3

0

No Refund

4

3

Constructing a Decision Tree

Marital Status

Refund

Single/Divorced

Married

Yes

No

Pay: 3

Evade:3

Pay: 4

Evade:0

Pay: 3

Evade:0

Pay: 4

Evade:3

Key Computation


Constructing a decision tree20 l.jpg
Constructing a Decision Tree Scientific and Engineering Applications”

Refund: Yes

Refund: No


Constructing a decision tree in parallel l.jpg

Partitioning of data only Scientific and Engineering Applications”

global reduction per node is required

large number of classification tree nodes gives high communication cost

Pay

Evade

Refund

3

0

No Refund

4

3

Constructing a Decision Tree in Parallel

m categorical attributes

n records


Constructing a decision tree in parallel22 l.jpg

10,000 training records Scientific and Engineering Applications”

7,000 records

3,000 records

2,000

5,000

2,000

1,000

Constructing a Decision Tree in Parallel

  • Partitioning of classification tree nodes

    • natural concurrency

    • load imbalance as the amount of work associated with each node varies

    • child nodes use the same data as used by parent node

      • loss of locality

      • high data movement cost


Challenges in constructing parallel classifier l.jpg
Challenges in Constructing Parallel Classifier Scientific and Engineering Applications”

  • Partitioning of data only

    • large number of classification tree nodes gives high communication cost

  • Partitioning of classification tree nodes

    • natural concurrency

    • load imbalance as the amount of work associated with each node varies

    • child nodes use the same data as used by parent node

      • loss of locality

      • high data movement cost

  • Hybrid algorithms: partition both data and tree


Experimental results srivastava han kumar and singh 1999 l.jpg
Experimental Results Scientific and Engineering Applications”(Srivastava, Han, Kumar, and Singh, 1999)

  • Data set

    • function 2 data set discussed in SLIQ paper (Mehta, Agrawal and Rissanen, EDBT’96)

    • 2 class labels, 3 categorical and 6 continuous attributes

  • IBM SP2 with 128 processors

    • 66.7 MHz CPU with 256 MB real memory

    • AIX version 4

    • high performance switch


Speedup comparison of the three parallel algorithms l.jpg
Speedup Comparison of the Three Parallel Algorithms Scientific and Engineering Applications”

0.8 million examples

1.6 million examples


Splitting criterion verification in the hybrid algorithm l.jpg
Splitting Criterion Verification in the Hybrid Algorithm Scientific and Engineering Applications”

0.8 million examples on 8 processors

1.6 million examples on 16 processors




Hash table access l.jpg
Hash Table Access Sets

  • Some efficient decision tree algorithms require random access to large data structures.

  • Example: SPRINT (Shafer, Agrawal, Mehta)

Hash Table

Storing the entire has table on one processor makes the algorithm unscalable.


Scalparc joshi karypis kumar 1998 l.jpg
ScalParC Sets(Joshi, Karypis, Kumar, 1998)

  • ScalParC is a scalable parallel decision tree construction algorithm

    • Scales to large number of processors

    • Scales to large training sets

  • ScalParC is memory efficient

    • The hash-table is distributed among the processors

  • ScalParC performs minimum amount of communication


This design is inspired by l.jpg
This Design is Inspired by.. Sets

  • Communication Structure of Parallel Sparse Matrix-Vector Algorithms.


Parallel runtime joshi karypis kumar 1998 l.jpg
Parallel Runtime Sets(Joshi, Karypis, Kumar, 1998)

128 Processor Cray T3D


Computing association patterns l.jpg
Computing Association Patterns Sets

2. Find item combinations (itemsets) that occur frequently in data

1. Market-basket transactions

3. Generate association rules


Computing association require exponential computation l.jpg
Computing Association Require Exponential Computation Sets

{a}

{b}

{c}

{d}

{a,b}

{a,c}

{a,d}

{b,c}

{b,d}

{c,d}

{a,b,c}

{a,b,d}

{a,c,d}

{b,c,d}

{a,b,c,d}

Given m items, there are 2m-1 possible item combinations


Handling exponential complexity l.jpg
Handling Exponential Complexity Sets

  • Given n transactions and m different items:

    • number of possible association rules:

    • computation complexity:

  • Systematic search for all patterns, based on support constraint [Agarwal & Srikant]:

    • If {A,B} has support at least a, then both A and B have support at least a.

    • If either A or B has support less than a, then {A,B} has support less than a.

    • Use patterns of n-1 items to find patterns of n items.


Illustrating apriori principle agrawal and srikant 1994 l.jpg
Illustrating Apriori Principle Sets(Agrawal and Srikant, 1994)

Items (1-itemset candidates)

Pairs (2-itemset candidates)

Minimum Support = 3

Triplets (3-itemset candidates)

If every subset is considered,

6C1 + 6C2 + 6C3 = 41

With support-based pruning,

6 + 6 + 1 = 13


Counting candidates l.jpg
Counting Candidates Sets

  • Frequent Itemsets are found by counting candidates.

  • Simple way:

    • Search for each candidate in each transaction. Expensive!!!

Transactions

Candidates

M

N


Parallel formulation of association rules han karypis and kumar 2000 l.jpg
Parallel Formulation of Association Rules Sets(Han, Karypis, and Kumar, 2000)

  • Need:

    • Huge Transaction Datasets (10s of TB)

    • Large Number of Candidates.

  • How?

    • Partition the Transaction Database among processors

      • communication needed for global counts

      • local memory on each processor should be large enough to store the entire hash tree

    • Partition the Candidates among processors

      • redundant I/O for transactions

    • Partition both Candidates and Transaction Database


Parallel association rules scaleup results 100k 0 25 han karypis and kumar 2000 l.jpg
Parallel Association Rules: Scaleup Results Sets(100K,0.25%)(Han, Karypis, and Kumar, 2000)


Parallel association rules response time np 64 50k han karypis and kumar 2000 l.jpg
Parallel Association Rules: Response Time Sets(np=64,50K) (Han, Karypis, and Kumar, 2000)


Discovery of patterns in the global climate system l.jpg
Discovery of Patterns in the Global Climate System Sets

Research Goals:

  • Find global climate patterns of interest to Earth Scientists

  • Global snapshots of values for a number of variables on land surfaces or water.

  • Monthly over a range of 10 to 50 years.

# grid points: 67K Land, 40K Ocean Current data size range: 20 – 400 MB


Importance of global climate patterns and npp l.jpg
Importance of Global Climate Patterns and NPP Sets

  • Net Primary Production (NPP) is the net assimilation of atmospheric carbon dioxide (CO2) into organic matter by plants.

  • Keeping track of NPP is important because it includes the food source of humans and all other organisms.

  • NPP is impacted by global climate patterns.

Image from http://www.pmel.noaa.gov/co2/gif/globcar.png


Patterns of interest l.jpg
Patterns of Interest Sets

  • Zone Formation

    • Find regions of the land or ocean which have similar behavior.

  • Associations

    • Find relations between climate events and land cover.

  • Teleconnections

    • Teleconnections are the simultaneous variation in climate and related processes over widely separated points on the Earth.

    • El Nino associated with droughts in Australia and Southern Africa and heavy rainfall along the western coast of South America.

Sea Surface Temperature Anomalies off Peru (ANOM 1+2)


Clustering of raw npp and raw sst num clusters 2 l.jpg
Clustering of Raw NPP and Raw SST Sets(Num clusters = 2)


K means clustering of raw npp and raw sst num clusters 2 l.jpg
K-Means Clustering of Raw NPP and Raw SST Sets(Num clusters = 2)

Land Cluster Cohesion:

North = 0.78

South = 0.59

Ocean Cluster Cohesion:

North = 0.77

South = 0.80


Ocean climate indices connecting the ocean and the land l.jpg

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 46

Ocean Climate Indices: Connecting the Ocean and the Land

  • An OCI is a time series of temperature or pressure

    • Based on Sea Surface Temperature (SST) or Sea Level Pressure (SLP)

  • OCIs are important because

    • They distill climate variability at a regional or global scale into a single time series.

    • They are related to well-known climate phenomena such as El Niño.


Ocean climate indices anom 1 2 l.jpg
Ocean Climate Indices – ANOM 1+2 the Global Climate System using Data Mining

  • ANOM 1+2 is associated with El Niño and La Niña.

  • Defined as the Sea Surface Temperature (SST) anomalies in a regions off the coast of Peru

  • El Nino is associated with

    • Droughts in Australia and Southern Africa

    • Heavy rainfall along the western coast of South America

    • Milder winters in the Midwest

El Nino Events


Connection of anom 1 2 to land temp l.jpg
Connection of ANOM 1+2 to Land Temp the Global Climate System using Data Mining

OCIs capture teleconnections, i.e., the simultaneous variation in climate and related processes over widely separated points on the Earth.


Ocean climate indices nao l.jpg

Iceland the Global Climate System using Data Mining

Azores

Ocean Climate Indices - NAO

  • The North Atlantic Oscillation (NAO) is associated with climate variation in Europe and North America.

  • Normalized pressure differences between Ponta Delgada, Azores and Stykkisholmur, Iceland.

  • Associated with warm and wet winters in Europe and in cold and dry winters in northern Canada and Greenland

  • The eastern US experiences mild and wet winter conditions.


Connection of nao to land temp l.jpg
Connection of NAO to Land Temp the Global Climate System using Data Mining


Discovery of ocean climate indices l.jpg
Discovery of Ocean Climate Indices the Global Climate System using Data Mining

  • Use clustering to find areas of the oceans that have high density, I.e., relatively homogeneous behavior.

    • Cluster centroids are potential OCIs.

    • For SLP pairs of cluster centroids are potential OCIs.

  • Evaluate the “influence” of potential OCIs on land points.

  • Determine if the potential OCI matches a known OCI.

  • For potential OCIs that are not well-known, conduct further evaluation.

    • Are there land points that have higher correlation for the potential OCI than for known indices?


Snn clustering advantages l.jpg
SNN Clustering - Advantages the Global Climate System using Data Mining

  • Finding clusters of different shapes and sizes, especially in the presence of noise, is a difficult clustering problem.

  • SNN clustering

    • Handles problems of varying density, shape and size.

    • Is resistant to noise.

      • Earth Science data is noisy

    • SNN clustering finds the number of clusters automatically.

  • Requires O(n2) time

    • Need to calculate the pairwise similarity matrix

    • This is a highly parallel operation.


Sst clusters l.jpg
SST the Global Climate System using Data Mining Clusters


Sst clusters that correspond to el nino climate indices l.jpg
SST Clusters that Correspond to El Nino Climate Indices the Global Climate System using Data Mining

75 78 67 94

El Nino Regions Defined by Earth Scientists

SNN clusters of SST that are highly correlated with El Nino indices, ~ 0.93 correlation.


Sst clusters highly correlated to known indices l.jpg
SST Clusters Highly Correlated to Known Indices … the Global Climate System using Data Mining

Examples of some SST clusters that are highly correlated to known OCIs and have high area weighted correlation with land temperature. These indices have a significant correlation with El Nino indices.


Sst clusters highly correlated to known indices56 l.jpg
SST Clusters Highly Correlated to Known Indices… the Global Climate System using Data Mining

However, there are areas (yellow) where these clusters correlate better.


Sst cluster moderately correlated to known indices l.jpg
SST Cluster Moderately Correlated to Known Indices the Global Climate System using Data Mining


Mining associations in earth science data tan steinbach kumar potter klooster torregrosa 2001 l.jpg
Mining Associations in Earth Science Data the Global Climate System using Data Mining (Tan, Steinbach, Kumar, Potter, Klooster, Torregrosa, 2001)

  • First, transform Earth Science data into transactions.

  • Find patterns using association discovery algorithms.

1 FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI ==> NPP-HI (support count=145, confidence=100%)

2 FPAR-HI PET-HI PREC-HI TEMP-HI ==> NPP-HI (support count=933, confidence=99.3%)

3 FPAR-HI PET-HI PREC-HI ==> NPP-HI (support count=1655, confidence=98.8%)

4 FPAR-HI PET-HI PREC-HI SOLAR-HI ==> NPP-HI (support count=268, confidence=98.2%)

75 FPAR-HI ==> NPP-HI (support count = 216924, confidence = 55.7%)


Example of interesting association rules l.jpg
Example of Interesting Association Rules the Global Climate System using Data Mining

FPAR-Hi ==> NPP-Hi (sup=5.9%, conf=55.7%)

Shrubland areas


Land cover types l.jpg

Shrublands/ the Global Climate System using Data Mining

Land Cover Types


Slide61 l.jpg

Example of Interesting Association Rules… the Global Climate System using Data Mining

Support Count

Land Cover

  • Temp-Hi  NPP-Hi tends to occur in the forest and cropland regions in the northern hemisphere (Forests (33.5%), Grassland(8.7%), Cropland (24.5%), Desert (0.4%) )


Need for parallel computing l.jpg
Need for Parallel Computing the Global Climate System using Data Mining

  • Satellites are providing measurements of finer granularity.

    • Finer spatial grids

      • 1 by 1 grid produces 64,800 data points

      • 0.1 by 0.1 grid produces 6,480,000 data points

    • More frequent measurements

      • Daily measurements multiply monthly data by a factor of 30

  • Looking at weather instead of climate requires finer resolution

    • Detection of movement of fronts

Earth Observing System - EOS AM 1


Need for parallel computing63 l.jpg
Need for Parallel Computing the Global Climate System using Data Mining

  • SNN clustering analyses require O(n2) comparisons.

    • Evaluate correlation of every ocean point with every land point.

  • Association rule algorithms can also be very compute intensive.

    • Potentially very much greater than O(n2)

  • Amount of memory required exceeds for clustering and association rule algorithms can exceed 4GB of traditional sequential servers.


Conclusion l.jpg
Conclusion the Global Climate System using Data Mining

  • Data mining techniques are increasingly being used for discovering useful and previously unknown information from data

  • HPC holds the promise of making data mining applicable for massive datasets

  • HPC challenges

    • Parallelization of existing data mining algorithms

    • Development of novel parallel/distributed formulations

    • Efficient implementation of hash tables in parallel/distributed environment


Bibliography l.jpg
Bibliography the Global Climate System using Data Mining

  • Large-Scale Parallel Data Mining.Mohammed J. Zaki and Ching-Tien Ho (editors),Springer-Verlag, 2000.

  • Advances in Distributed and Parallel Knowledge Discovery.Hillol Kargupta and Philip Chan (editors), AAAI Press/ MIT Press, 2000.

  • Data Mining for Scientific and Engineering Applications.Robert L. Grossman, Chandrika Kamath, Philip Kegelmeyer, Vipin Kumar, and Raju Namburu, Kluwer Academic Publishers, October 2001.

  • “Parallel Formulations of Decision-Tree Classification Algorithms,” Anurag Srivastava, Eui-Hong (Sam) Han, Vipin Kumar, and Vineet Singh, Data Mining and Knowledge Discovery: An International Journal, vol. 3, no. 3, pp 237-261, September 1999.

  • “Data Mining for the Discovery of Ocean Climate Indices”, Michael Steinbach, Pang-Ning Tan, Vipin Kumar, Chris Potter, Steven Klooster, Workshop on Mining Scientific Data, SDM 2002


Bibliography66 l.jpg
Bibliography … the Global Climate System using Data Mining

  • “ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets.” Mahesh V. Joshi, George Karypis and Vipin Kumar, Proc. of 1998 International Parallel Processing Symposium, April 1998.

  • “Scalable Parallel Data Mining for Association Rules,” Eui-Hong (Sam) Han, George Karypis and Vipin Kumar, IEEE Transactions on Knowledge and Data Engineering, Vol. 12, No. 3, May/June 2000.

  • “Finding Spatio-Temporal Patterns in Earth Science Data,” Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Steven Klooster, Christopher Potter, Alicia Torregrosa, KDD 2001 Workshop on Temporal Data Mining.

  • “Fast algorithms for mining association rules,” R. Agrawal and R. Srikant, In Proc. of the 20th International Conference Very Large Data Bases, pages 487--499. Morgan Kaufmann, 1994.


ad