SDSC Summer Institute 2004 TUTORIAL Data Mining for Scientific Applications

SDSC Summer Institute 2004TUTORIALData Mining for Scientific Applications Peter Shin Hector Jasso San Diego Supercomputer Center UCSD

Overview • Introduction to data mining • Definitions, concepts, applications • Machine learning methods for KDD • Supervised learning – classification • Unsupervised learning – clustering • Cyberinfrastructure for data mining • SDSC/NPACI resources – hardware and software • Survey of Applications at SKIDL • Break • Hands on tutorial with IBM Intelligent Miner and SKIDLkit • Customer targeting • Microarray analysis (leukemia dataset)

Data Mining Definition The search for interesting patterns and models, in large data collections, using statistical and machine learning methods, and high-performance computational infrastructure. Key point: applications are data-driven and compute-intensive

Analysis Levels and Infrastructure • Informal methods – graphs, plots, visualizations, exploratory data analysis (yes – Excel is a data mining tool) • Advanced query processing and OLAP – e.g., National Virtual Observatory (NVO) • Machine learning (compute-intensive statistical methods) • Supervised – classification, prediction • Unsupervised – clustering • Computational infrastructure needed at all levels – collections management, information integration, high-performance database systems, web services, grid services, scientific workflows, the global IT grid

The Case for Data Mining: Data Reality • Deluge from new sources • Remote sensing • Microarray processing • Wireless communication • Simulation models • Instrumentation – microscopes, telescopes • Digital publishing • Federation of collections • “5 exabytes (5 million terabytes) of new information was created in 2002” (source: UC Berkeley researchers Peter Lyman and Hal Varian) • This is the result of a recent paradigm shift: from hypothesis-driven data collection to data mining • Data destination: Legacy archives and independent collection activities

Knowledge Discovery Process Application/Decision Support Knowledge Presentation/Visualization Analysis/Modeling Management/Federation/Warehousing Processing/Cleansing/Corrections Data Collection “Data is not information; information is not knowledge; knowledge is not wisdom.” Gary Flake, Principal Scientist & Head of Yahoo! Research Labs, July 2004.

Characteristics of Data Mining Applications • Data: • Lots of data, numerous sources • Noisy – missing values, outliers, interference • Heterogeneous – mixed types, mixed media • Complex – scale, resolution, temporal, spatial dimensions • Relatively little domain theory, few quantitative causal models • Lack of valid ground truth • Advice: don’t choose problems that have all these characteristics …

Scientific vs. Commercial Data Mining Goals: • Science – Theories: Need for insight and theory-based models, interpretable model structures, generate domain rules or causal structures, support for theory development • Commercial – Profits: black boxes OK Types of data: • Science – Images, sensors, simulations • Commercial - Transaction data • Both - Spatial and temporal dimensions, heterogeneous Trend – Common IT (information technology) tools fit both enterprises • Database systems (Oracle, DB2, etc), integration tools (Information Integrator), web services (Blue Titan, .NET) • This is good!

Introduction to Machine Learning • Basic machine learning theory • Concepts and feature vectors • Supervised and unsupervised learning • Model development • training and testing methodology, • model validation, • overfitting • confusion matrices • Survey of algorithms • Decision Trees classification • k-means clustering • Hierarchical clustering • Bayesian networks and probabilistic inference • Support vector machines

Basic Machine Learning Theory Basic inductive learning hypothesis: • Having a large number of observations, we can approximate the rule that describes how the data was generated, and thus generate a model (using some algorithm) No Free Lunch Theorem: • There is no ultimate algorithm: In the absence of prior information about the problem, there are no reasons to prefer one learning algorithm over another. Conclusion: • There is no problem-independent “best” learning system. Formal theory and algorithms are not enough. • Machine learning is an empirical subject.

Concepts are described as feature vectors Example: vehicles • Has wheels • Runs on gasoline • Carries people • Flies • Weighs less than 500 pounds Boolean feature vectors for vehicles • car254 [ 1 1 1 0 0 ] • motorcyle14 [ 1 1 1 0 1 ] • airplane132 [ 1 1 1 1 0 ]

Easy to generalize to complex data types: • Number of wheels • Fuel type • Carrying capacity • Flies • Weight car254 [ 4, gas, 6, 0, 2000 ] motorcyle14 [ 2, gas, 2, 0, 400 ] airplane132 [ 10, jetfuel, 110, 1, 35000 ] Most machine learning algorithms expect feature vectors, stored in text files or databases Suggestions: • Identify the target concept • Organize your data to fit feature vector representation • Design your database schemas to support generation of data in this format

Supervised vs. Unsupervised Learning Supervised – Each feature vector belongs to a class (label). Labels are given externally, and algorithms learn to predict the label of new samples/observations. Unsupervised – Finds structure in the data, by clustering similar elements together. No previous knowledge of classes needed.

Model development Training and testing Model validation • Hold-out validation (2/3, 1/3 splits) • Cross validation, simple and n-fold (reuse) • Bootstrap validation (sample with replacement) • Jackknife validation (leave one out) • When possible hide a subset of the data until train-test is complete. Train Test Apply

Avoid overfitting Overfitting Optimal Depth Train Test v v

Predicted Confusion matrices Negative Positive Negative Actual Positive Accuracy =(124 + 84) / (124 + 15 + 8 + 84)“proportion of predictions correct” True positive rate =84 / (8 + 84) “proportion of positive cases correctly identified” False positive rate =15 / (124 + 15)“proportion of negative cases incorrectly class as positive” True negative rate = 124 / (124 + 15) “proportion of negative cases correctly identified” False negative rate =8 / (8 + 84)“proportion of positive cases incorrectly class as negative” Precision =84 / (15 + 84) “proportion of predicted positive cases that were correct”

Classification – Decision Tree Annual Precipitation Ecosystem

YES NO Precipitation > 63?

YES NO Precipitation > 63? YES NO Precipitation > 5?

Learned Model If (Precip > 63 ) then “Forest” else If (Precip > 5) then “Prairie” else “Desert” Confusion matrix Predicted D F P D Actual F P Classification accuracy on training data is 100%

Testing Set Results IF(Precip > 63 ) then Forest Else If (Precip > 5) then Prairie Else Desert Test Data Learned Model Confusion matrix Predicted D F P D True Predicted Actual F P Result: Accuracy 67% Model shows overfitting, generalizes poorly

Pruning to improve generalizationPruned Decision Tree IF(Precip < 60 ) then Desert Else, [P(Forest) = .75] & [P(Prairie) = .25] Precipitation < 60?

Decision Trees Summary • Simple to understand • Works with mixed data types • Heuristic search sensitive to local minima • Models non-linear functions • Handles classification and regression • Many successful applications • Readily available tools

Overview of Clustering • Definition: • Clustering is the discovery of classes • Unlabeled examples => unsupervised learning. • Survey of Applications • Grouping of web-visit data, clustering of genes according to their expression values, grouping of customers into distinct profiles, • Survey of Methods • k-means clustering • Hierarchical clustering • Expectation Maximization (EM) algorithm • Gaussian mixture modeling • Cluster analysis • Concept (class) discovery • Data compression/summarization • Bootstrapping knowledge

Clustering – k-Means Precipitation Temperature

Clustering – k-Means

C1 70 - 85 0 - 25 C2 35 - 60 25 - 55 C3 50 – 80 50 – 80 Clustering – k-Means Cluster Temperature Precipitation

Clustering – k-Means Cluster Temperature Precipitation Ecosystem

Using k-means • Requires a priori knowledge of ‘k’ • The final outcome depends on the initial choice of k-means -- inconsistency • Sensitive to the outliers, which can skew the means of their clusters • Favors spherical clusters – clusters may not match domain boundaries • Requires real-valued features

Cyberinfrastructure for Data Mining • Resources – hardware and software (analysis tools and middleware) • Policies – allocating resources to the scientific community. Challenges to the traditional supercomputer model. Requirements for interactive and real-time analysis resources.

NSF TeraGridBuilding Integrated National CyberInfrastructure • Prototype for CyberInfrastructure • Ubiquitous computational resources • Plug-in compatibility • National Reach: • SDSC, NCSA, CIT, ANL, PSC • High Performance Network: • 40 Gb/s backbone, 30 Gb/s to each site • Over 20 Teraflops compute power • Over 1PB Online Storage • 8.9PB Archival Storage

SDSC is Data-Intensive Center 39

SDSC is Data-Intensive Center 40

LAN (multiple GbE, TCP/IP) Local Disk (50TB) Power 4 DB Blue Horizon WAN (30 Gb/s) Power 4 HPSS Sun F15K Linux Cluster, 4TF SAN (2 Gb/s, SCSI) SCSI/IP or FC/IP 30 MB/s per drive 200 MB/s per controller FC GPFS Disk (100TB) FC Disk Cache (400 TB) Database Engine Data Miner Vis Engine Silos and Tape, 6 PB, 1 GB/sec disk to tape 32 tape drives Blue Horizon: 1152 processor IBM SP, 1.7 Teraflops HPSS: over 600 TB data stored SDSC Machine Room Data Architecture • .5 PB disk • 6 PB archive • 1 GB/s disk-to-tape • Optimized support for DB2 /Oracle Philosophy: enable SDSC configuration to serve the grid as Data Center

SDSC IBM Regatta - DataStar • 100+ TB Disk • Numerous fast CPUs • 64 GB of RAM per node • DB2 v8.x ESE • IBM Intelligent Miner • SAS Enterprise Miner • Platform for high-performance database, data mining, comparative IT studies …

Data Mining Tools used at SDSC • SAS Enterprise Miner (Protein crystallization - JCSG) • IBM Intelligent Miner (Protein crystallization - JCSG, Corn Yield – Michigan State University, Security logs - SDSC) • CART (Protein crystallization - JCSG) • Matlab SVM package (TeraBridge health monitoring – UCSD Structural Engineering Department, North Temperate Lakes Monitoring - LTER) • PyML (Text Mining – NSDL, Hyperspectral data - LTER) • SKIDLkit by SDSC (Microarray analysis – UCSD Cancer Center, Hyperspectral data - LTER) • SVMlight (Hyperspectral data, LTER) • LSI by Telecordia (Text Mining – NSDL) • CoClustering by Fair Isaac (Text Mining – NSDL) • Matlab Bayes Net package • WEKA

SKIDLkit • Toolkit for feature selection and classification • Filter methods • Wrapper methods • Data normalization • Feature selection • Support Vector Machine & Naïve Bayesian Clustering • http://daks.sdsc.edu/skidl • Will use it in the hands-on demo…

Survey of Applications at SDSC • Sensor networks for bridge monitoring (with Structural Engineering Dept., UCSD) • Text mining the NSDL (National Science Digital Library) collection • Hyperspectral remote sensing data for groundcover classification (with Long Term Ecological Research Network - LTER) • Microarray analysis for tumor detection (with UCSD Cancer Center)

Sensor Networks for Bridge Monitoring • Task: detection & classification • Identify damaged piers based on the data stream of acceleration measurements. • Determine which sensors are key to determining bridge health. • Multi-resolution analysis Rational resource management. • Testbed: • Humboldt Bay Bridge with 8 piers. • Assumptions: • Damage only happens at the lower end of each pier (location of plastic hinge) • There is only one damaged pier each time.

Text Mining the NSDL Variously Formatted Documents Strip Formatting Pick out content words using “stop lists” Stemming Processing pipeline Various Retrieval Schemes (LSI, Classification, or clustering modules) Generate Term Document Matrix Word count, Term Weighting Discard words that appear in every document or only one Query: for a list of words, get docs with highest score

Hyperspectral Image Classification • Characteristics of the data • Over 200 bands • Small number of samples through labor-intensive collecting process • Collaboration with the Long Term Ecological Research Network • Tasks: • Classify the vegetation (e.g. Juniper tree, Sage, etc.) • Identify key bands • Detect spatio-temporal patterns

SDSC Summer Institute 2004 TUTORIAL Data Mining for Scientific Applications