From Feature Construction, to Simple but Effective Modeling, to Domain Transfer

From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson www.cs.columbia.edu/~wfan www.weifan.info weifan@us.ibm.com, wei.fan@gmail.com

Feature Vector • Most data mining and machine learning model assume the following structured data: • (x1, x2, ..., xk) -> y • where xi’s are independent variable • y is dependent variable. • y drawn from discrete set: classification • y drawn from continuous variable: regression

Frequent Pattern-Based Feature Construction • Data not in the pre-defined feature vectors • Transactions • Biological sequence • Graph database Frequent pattern is a good candidate for discriminative features So, how to mine them?

A discovered pattern NSC 4960 NSC 699181 NSC 40773 NSC 164863 NSC 191370 FP: Sub-graph (example borrowed from George Karypis presentation)

Computational Issues • Measured by its “frequency” or support. • E.g. frequent subgraphs with sup ≥ 10% • Cannot enumerate sup = 10% without first enumerating all patterns > 10%. • Random sampling not work since it is not exhaustive. • NP hard problem

DataSet Mined Discriminative Patterns 1 2 4 Frequent Patterns 1---------------------- ---------2----------3 ----- 4 --- 5 -------- --- 6 ------- 7------ select mine NN represent F1 F2 F4 Data1 1 1 0 Data2 1 0 1 Data3 1 1 0 Data4 0 0 1 ……… DT Petal.Length< 2.45 | setosa Petal.Width< 1.75 SVM versicolor virginica LR Any classifiers you can name Conventional Procedure Two-Step Batch Method • Mine frequent patterns (>sup) • Select most discriminative patterns; • Represent data in the feature space using such patterns; Build classification models. Feature Construction followed by Selection

DataSet Frequent Patterns 1---------------------- ---------2----------3 ----- 4 --- 5 -------- --- 6 ------- 7------ mine Two Problems • Mine step • combinatorial explosion 2. patterns not considered if minsupport isn’t small enough 1. exponential explosion

Mined Discriminative Patterns 1 2 4 Frequent Patterns 1---------------------- ---------2----------3 ----- 4 --- 5 -------- --- 6 ------- 7------ select Two Problems 4. Correlation not directly evaluated on their joint predictability • Select step • Issue of discriminative power 3. InfoGain against the complete dataset, NOT on subset of examples

dataset Mine & SelectP: 20% Most discriminative F based on IG 1 N Y Mine & Select P:20% Mine & SelectP: 20% 5 2 N N Y Y Mine & Select P:20% Mine & Select P:20% 3 6 7 4 N N Y … Y N Y + + Few Data … Direct Mining & Selection via Model-based Search Tree Feature Miner Classifier Compact set of highly discriminative patterns 1 2 3 4 5 6 7 . . . • Basic Flow Global Support: 10*20%/10000=0.02% Divide-and-Conquer Based Frequent Pattern Mining Mined Discriminative Patterns

Analyses (I) • Scalability of pattern enumeration • Upper bound (Theorem 1): • “Scale down” ratio: • Bound on number of returned features

Analyses (II) • Subspace pattern selection • Original set: • Subset: • Non-overfitting • Optimality under exhaustive search

Experimental Studies: Itemset Mining (I) dataset dataset Mine & SelectP: 20% Mine & SelectP: 20% Most discriminative F based on IG Most discriminative F based on IG 1 1 N N Y Y Mine & Select P:20% Mine & Select P:20% Mine & SelectP: 20% Mine & SelectP: 20% 5 5 2 2 N N N N Y Y Y Y Mine & Select P:20% Mine & Select P:20% Mine & Select P:20% Mine & Select P:20% Global Support: 10*20%/10000=0.02% Global Support: 10*20%/10000=0.02% 3 3 6 6 7 7 4 4 N N N N Y Y Y Y + + + + Few Data Few Data • Scalability Comparison

Experimental Studies: Itemset Mining (II) 4 Wins 1 loss But, much smaller number of patterns • Accuracy of Mined Itemsets

Experimental Studies: Itemset Mining (III) • Convergence

Experimental Studies: Graph Mining (I) • 9 NCI anti-cancer screen datasets • The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. • Active (Positive) class : around 1% - 8.3% • 2 AIDS anti-viral screen datasets • URL: http://dtp.nci.nih.gov. • H1: CM+CA – 3.5% • H2: CA – 1%

dataset dataset Mine & SelectP: 20% Mine & SelectP: 20% Most discriminative F based on IG Most discriminative F based on IG 1 1 N N Y Y Mine & Select P:20% Mine & Select P:20% Mine & SelectP: 20% Mine & SelectP: 20% 5 5 2 2 N N N N Y Y Y Y Mine & Select P:20% Mine & Select P:20% Mine & Select P:20% Mine & Select P:20% Global Support: 10*20%/10000=0.02% Global Support: 10*20%/10000=0.02% 3 3 6 6 7 7 4 4 N N N N Y Y Y Y + + + + Few Data Few Data Experimental Studies: Graph Mining (II) • Scalability

Experimental Studies: Graph Mining (III) • AUC and Accuracy AUC 11 Wins 10 Wins 1 Loss

Experimental Studies: Graph Mining (IV) • AUC of MbT, DT MbT VS Benchmarks 7 Wins, 4 losses

Summary • Model-based Search Tree • Integrated feature mining and construction. • Dynamic support • Can mine extremely small support patterns • Both a feature construction and a classifier • Not limited to one type of frequent pattern: plug-play • Experiment Results • Itemset Mining • Graph Mining • New: Found a DNA sequence not previously reported but can be explained in biology. • Code and dataset available for download

List of methods: • Logistic Regression • Probit models • Naïve Bayes • Kernel Methods • Linear Regression • RBF • Mixture models Model 6 Model 1 Model 3 Model 5 Model 2 Model 4 Some unknown distribution But how about, you don’t know which to choose or use the wrong one? How to train models? • Even the true distribution is unknown, still assume that the data is generated by some known function. • Estimate parameters inside the function via • training data • CV on the training data After structure is prefixed, learning becomes optimization to minimize errors: • quadratic loss • exponential loss • slack variables • There probably will always be mistakes unless: • The chosen model indeed generates the distribution • Data is sufficient to estimate those parameters

How to train models II • List of methods: • Decision Trees • RIPPER rule learner • CBA: association rule • clustering-based methods • … … • Not quite sure the exact function, but use a family of “free-form” functions given some “preference criteria”. Preference criteria • Simplesthypothesis that fits the data is the best. • Heuristics: • info gain, gini index, Kearns-Mansour, etc • pruning: MDL pruning, reduced error-pruning, cost-based pruning. • Truth: none of purity check functions guarantee accuracy on unseen test data, it only tries to build a smaller model • There probably will always be mistakes unless: • the training data is sufficiently large. • free form function/criteria is appropriate.

Can Data Speak for Themselves? • Make no assumption about the true model, neither parametric form nor free form. • “Encode” the data in some rather “neutral” representations: • Think of it like encoding numbers in computer’s binary representation. • Always cannot represent some numbers, but overall accurate enough. • Main challenge: • Avoid “rote learning”: do not remember all the details • Generalization • “Evenly” representing “numbers” – “Evenly” encoding the “data”.

Potential Advantages • If the accuracy is quite good, then • Method is quite “automatic and easy” to use • No Brainer – DM can be everybody’s tool.

Encoding Data for Major Problems • Classification: • Given a set of labeled data items, such as, (amt, merchant category, outstanding balance, date/time, ……,) and the label is whether it is a fraud or non-fraud. • Label: set of discrete values • classifier: predict if a transaction is a fraud or non-fraud. • Probability Estimation: • Similar to the above setting: estimate the probability that a transaction is a fraud. • Difference: no truth is given, i.e., no true probability • Regression: • Given a set of valued data items, such as (zipcode, capital gain, education, …), interested value is annual gross income. • Target value: continuous values. • Several other on-going problems

v v v v v v 2.5 v v v v v v v v v v v virginica v v v v v v v v v v v v v v v v v v v c v v v v v v v v v v c c c c v c c c c c c c c c c v v c c c c c c c v Petal width 1.5 c c c c c c c c c c c c c v v v setosa v v v c c c c c 2.5 v v v v v v v v c c c v v v c c c c c c c virginica v v v v v v v v v v v v v v v v v v v c v v v v v v v v v s v c c c c v s versicolor c c c c c c c c c c v v s s s s s s s c c c c c c c v Petal width 0.5 s s s s s s s 1.5 c c c c c c c c c c c c c setosa s s s s s s s s s s s s s s s s s s s s s s s s s s s s s c c c c c s s s s s c c c c c c c c c c s s 1 2 3 4 5 6 7 versicolor s s s s s s s 0.5 s s s s s s s Petal length s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s 1 2 3 4 5 6 7 Petal length Encoding Data in Decision Trees • Think of each tree as a way to “encode” the training data. • Why tree? a decision tree records some common characteristic of the data, but not every piece of trivial detail • Obviously, each tree encodes the data differently. • Subjective criteria that prefers some encodings than others are always adhoc. • Do not prefer anything then – just do it randomly • Minimizes the difference by multiple encodings, and then “average” them.

Random Decision Tree to Encode Data-classification, regression, probability estimation • At each node, an un-used feature is chosen randomly • A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. • A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen

Continued • We stop when one of the following happens: • A node becomes too small (<= 3 examples). • Or the total height of the tree exceeds some limits: • Such as the total number of features.

Illustration of RDT B1: {0,1} B2: {0,1} B3: continuous B1 chosen randomly Random threshold 0.3 B2: {0,1} B3: continuous B2: {0,1} B3: continuous B3 chosen randomly B2 chosen randomly Random threshold 0.6 B3: continous

Petal.Length< 2.45 | setosa Petal.Width< 1.75 50/0/0 virginica versicolor 0/1/45 0/49/5 P(setosa|x,θ) = 0 P(versicolor|x,θ) = 49/54 P(virginica|x,θ) = 5/54 Classification

Petal.Length< 2.45 | setosa Petal.Width< 1.75 Height=10in virginica versicolor Height=12in Height=15 in 15 in average value of all examples In this leaf node Regression

Prediction • Simply Averaging over multiple trees

Potential Advantage • Training can be very efficient. Particularly true for very large datasets. • No cross-validation based estimation of parameters for some parametric methods. • Natural multi-class probability. • Natural multi-label classification and probability estimation. • Imposes very little about the structures of the model.

Reasons • The true distribution P(y|X) is never known. • Is it an elephant? • Every random tree is not a random guess of this P(y|X). • Their structure is, but not the “node statistics” • Every random tree is consistent with the training data. • Each tree is quite strong, not weak. • In other words, if the distribution is the same, each random tree itself is a rather decent model.

Expected Error Reduction • Proven that for quadratic loss, such as: • for probability estimation: • ( P(y|X) – P(y|X, θ) )2 • regression problems • ( y – f(x))2 • General theorem: the “expected quadratic loss” of RDT (and any other model averaging) is less than any combined model chosen “at random”.

Theorem Summary

Number of trees • Sampling theory: • The random decision tree can be thought as sampling from a large (infinite when continuous features exist) population of trees. • Unless the data is highly skewed, 30 to 50 gives pretty good estimate with reasonably small variance. In most cases, 10 are usually enough.

Variance Reduction

Optimal Decision Boundary from Tony Liu’s thesis (supervised by Kai Ming Ting)

RDT looks like the optimal boundary

Regression Decision Boundary (GUIDE) • Properties • Broken and Discontinuous • Some points are far from truth • Some wrong ups and downs

RDT Computed Function • Properties • Smooth and Continuous • Close to true function • All ups and downs caught

Hidden Variable

Hidden Variable • Limitation of GUIDE • Need to decide grouping variables and independent variables. A non-trivial task. • If all variables are categorical, GUIDE becomes a single CART regression tree. • Strong assumption and greedy-based search. Sometimes, can lead to very unexpected results.

It grows like …

ICDM’08 Cup Crown Winner • Nuclear ban monitoring • RDT based approach is the highest award winner.

Ozone Level Prediction (ICDM’06 Best Application Paper) • Daily summary maps of two datasets from Texas Commission on Environmental Quality (TCEQ)

SVM: 1-hr criteria CV

AdaBoost: 1-hr criteria CV

SVM: 8-hr criteria CV

AdaBoost: 8-hr criteria CV

From Feature Construction, to Simple but Effective Modeling, to Domain Transfer

From Feature Construction, to Simple but Effective Modeling, to Domain Transfer

Presentation Transcript

Domain Modeling

From Here to Construction

Domain Transfer - How to go about it

Simple but effective solar technology in LEDCs

Feature-Based Modeling

A simple but most effective tool to reach young people

FROM ANALOG TO DIGITAL DOMAIN

Domain Modeling

From Domain Ontologies to Modeling Ontologies to Executable Simulation Models

Simple Way to Transfer Contacts from Samsung Phone to Computer/PC

Simple Way to Transfer Contacts from Android Phone to Computer/PC

Transfer from Brisbane to Maleny

HOW TO TRANSFER WORDPRESS DOMAIN TO ANOTHER DOMAIN

Simple but Effective SEO Tips: Keep Your Content Up to Date

5 Simple but Effective Ways to Increase the Law Firm Revenue

Simple But Effective Ways To Make Your Bathroom Look More Elegant

From Standards to Transfer

From Domain Ontologies to Modeling Ontologies to Executable Simulation Models

Domain Modeling

Simple But Effective Cleaning Tricks

Domain Transfer Calgary

Transfer Orly to Disneyland, Transfer from CDG to Disney