Create Presentation
Download Presentation

Download Presentation
## Multi-Label Feature Selection for Graph Classification

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Multi-Label Feature Selection for Graph Classification**Xiangnan Kong, Philip S. Yu Department of Computer Science University of Illinois at Chicago**Outline**• Introduction • Multi-Label Feature Selection for Graph Classification • Experiments • Conclusion**Introduction: Graph Data**• Conventional data mining and machine learning approaches assume data are represented as feature vectors. E.g. (x1, x2, …, xd) - y • In real apps, data are not directly represented as feature vectors, but graphs with complex structures. • E.g. G(V, E, l) - y Chemical Compounds Program Flows XML Docs**Introduction: Graph Classification**• Graph Classification: • Construct a classification model for graph data • Example:drug activity prediction • Given a set of chemical compounds labeled with activitiesto one type of disease or virus • Predict active / inactive for a testing compound Training Graphs Testing Graph + ? -**Graph Classification using Subgraph Features**Subgraph Patterns H H g1 g2 g3 N H C H C … C C C C C C H H How to find a set of subgraphfeatures in order to effectively perform graph classification? C C O O C C N x1 C H H G1 C Classifier C … H N O 1 0 1 x1 H x2 H … 0 1 1 H H C C C H G2 x2 Feature Vectors C C H C H C C C C C C C C H O H O Graph Objects Feature Vectors Classifiers**Existing Methods for Subgraph Feature Selection**• Feature Selection for Graph Classification • Find a set of useful subgraph features for classification • Existing Methods • Select discriminative subgraph features • Focused on single-label settings • Assume one graph can only have one label C C C C C H H + - O O C C N C Graphs + Lung Cancer Useful Subgraphs • Graph • Label**Multi-Label Graphs**• In many real apps, one graph can have multiple labels. + Breast Cancer - Lung Cancer + Melanoma • Graph • Labels • Anti-Cancer Drug Prediction**Multi-Label Graphs**• Other Applications: • XML Document Classification • (One document -> multiple tags) • Program flow error detection • (One program -> multiple types of errors) • Kinase Inhibitor Discovery • (One chemical -> multiple types of kinase) • …**Multi-Label Feature Selection for Graph Classification**Evaluation Criteria F(p) b x a a a b b Multi-label Classification Multi-LabelGraphs Subgraph features c c c • Find useful subgraph features for graphs with multiple labels**Two Key Questions to Address**• Evaluation: How to evaluate a set of subgraph features using multiple labels of the graphs? (effective) • Search Space Pruning: How to prune the subgraph search space using multiple labels of the graphs? (efficient)**What is a good feature?**• Dependence Maximization Maximize dependence between the features and the multiple labels of graphs • Assumption Graphs with similar label sets should have similar features. a d a a d d b e b c e c f 1 f 2**Dependence Measure**• Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al. 05] • Evaluates the dependence between input feature and label vectors in kernel space. • Empirical Estimate is easy to calculate a a b c c HSIC = • KS : kernel matrix for graphs • KS [i, j] : measures the similarity between graph i and j on the common subgraph features they contain (in S) • L : kernel matrix for label vectors • L[i, j] : measures the similarity between label sets of graph i and graph j • H = I – 11T/n : centering matrix using common subgraph features in S using label vectors in {0,1}Q**Optimization -> gHSIC Criterion**• gHSIC Score: • Objective: MaximizeDependence (HSIC) H good H N bad • (the sum over all • selected features) gHSICScore C C C C represents the i-th subgraph feature**Two Key Questions to Address**• How to evaluate a set of subgraph features with multiple labels of the graphs? (effective) • How to prune the subgraph search space using multiple labels of the graphs? (efficient)**Finding a Needle in a Haystack**Pattern Search Tree • gSpan[Yan et. al ICDM’02] • An efficient algorithm to enumerate all frequent subgraph patterns (frequency ≥ min_support) ┴ 0-edges 1-edge 2-edges … • Too many frequent subgraph patterns • Find the mostuseful one(s) usingmultiple labels How to find the Best node(s) in this tree without searching all the nodes? (Branch and Bound to prune the search space) not frequent**gHSIC Upper Bound**• gHSIC: represents the i-th subgraph feature • An Upper-Bound of gHSIC: gHSIC-UB = • Upper-Bound of gHSIC scores for all supergraphs of the • Anti-monotonic with subgraph frequency • ----> Pruning**Pruning Principle**gHSIC Pattern Search Tree best subgraph so far current node … best score so far H H N upper bound current score H H sub-tree If best score ≥upper bound We can prune the entire sub-tree C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C … …**Experiment Setup**• Four methods are compared: • Multi-labelfeature selection +Multi-label classification • gMLC[This Paper] + BoosTexter [Schapire & Singer 00] • Multi-label feature selection +Binary classification • gMLC[This Paper]+ BR-SVM [Boutell et al 04](Binary Relevance) • Single-label feature selection +Binary classification • BR (Binary Relevance)+ Information Gain + SVM • Top-k frequent subgraphs+Multi-label classification • gSpan[Yan & Han 02] + BoosTexter[Schapire & Singer 00]**Data Sets**• Three multi-label graph classification tasks: • Anti-cancer activity prediction • Toxicology prediction of chemical compounds • Kinase inhibitor prediction**Evaluation**• Multi-Label Metrics [Elisseef&Weston NIPS’02] • Ranking Loss ↓ • Average number of label pairs being ranked incorrectly • The smaller the better • Average Precision ↑ • Average fraction of correct labels in top ranked labels • The larger the better • 10 times 10-fold cross-validation**Experiment Results**Ranking Loss 1 – AvePrec Anti-Cancer dataset PTC dataset Kinase Inhibition dataset**Experiment Results**Anti-Cancer Dataset • Our approach with multi-label classifier performed best at NCI and PTC datasets Single-Label FS + Single-label Classifiers Ranking Loss (lower is better) Multi-Label FS+ Single-label Classifiers Unsupervised FS + Multi-label Classifier Multi-Label FS + Multi-label Classifier # Selected Features**Pruning Results**Running Time #Subgraph Explored**Pruning Results**Without gHSIC pruning Running time (seconds) (lower is better) gHSIC pruning (anti-cancer dataset)**Pruning Results**Without gHSIC pruning # Subgraphs explored (lower is better) gHSIC pruning (anti-cancer dataset)**Conclusions**• Multi-Label Feature Selection for Graph Classification • Evaluating subgraph features using multiple labels of the graphs (effective) • Branch&boundpruning the search space using multiple labels of the graphs (efficient) Thank you!