1 / 40

KDD-2001 Cup The Genomics Challenge

KDD-2001 Cup The Genomics Challenge. Advisor : Dr. Hsu Graduate : Min-Hong Lin IDSL seminar. Outline. Motivation Objective KDD Cup 2001 Report Task1:Thrombin Result Task2:Predicting Function Task3:Localization Conclusions Personal Opinion. Motivation.

kcole
Download Presentation

KDD-2001 Cup The Genomics Challenge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KDD-2001 CupThe Genomics Challenge Advisor: Dr. Hsu Graduate:Min-Hong Lin IDSL seminar

  2. Outline • Motivation • Objective • KDD Cup 2001 Report • Task1:Thrombin Result • Task2:Predicting Function • Task3:Localization • Conclusions • Personal Opinion IDSL

  3. Motivation • Because of the rapid growth interest in mining biological databases. • Bioinformatics datasets are typically under-determined • very large number of features (complex domain) • small number of instances (high cost per data point) IDSL

  4. Objective • KDD Cup 2001 was focused on mining biological databases. It related to • drug design • genomics. IDSL

  5. Dataset 1: Prediction of Molecular Bioactivity for Drug Design-Binding to Thrombin • Dataset provided by DuPont Pharmaceuticals • Activity of compounds binding to thrombin • Library of compounds included(training data): • 1909 known molecules (42 actively binding thrombin) • 139,351 binary features describe the 3-D structure of each compound • 636 new compounds with unknown capacity to bind thrombin(test data) IDSL

  6. Dataset2: Prediction of Gene/Protein Function and Localization • Yeast Genome dataset • Data on the protein-protein interactions from MIPS database (Munich Information Centre for Protein Sequences) • Genes that encode for 6449 yeast proteins are already known, only 52% of these proteins have been characterized. • Relational dataset • Gene information • Interaction information • Predict function,localization of unknown proteins IDSL

  7. Statistics: I. Participation • 136 groups participated(200 submissions) • Almost 5-fold increase over previous years • More than half of the entries from commercial sector IDSL

  8. Statistics: II. Data Mining Software • Mostly custom software was used • Especially for task 1, where the number of features was too large for most commercial systems • Gap points to need for commercial tools that can cope with bioinformatics datasets IDSL

  9. Statistics: III. Algorithms • Feature selection used in almost 70% of the entries for Task 1 • Ensemble classifiers based on more than one algorithm used extensively • Decision trees among the most commonly used, with Naïve Bayes and k-NN • Cross-validation to deal with small dataset size IDSL

  10. KDD-2001 Cup Winners • Task 1: Jie Cheng, CIBC(Canadian Imperial Bank of Commerce ) • Task 2: Mark-A. Krogel, Magdeburg Univ. • Task 3: Hisashi Hayashi, Jun Sese, and Shinichi Morishita, Univ. of Tokyo IDSL

  11. Task 1:Thrombin Result • Object • Prediction of molecular bioactivity for drug design -- binding to Thrombin • Data • Training: 1909 cases (42 positive), 139,351 binary features • Test: 634 cases • Challenge • Highly imbalanced, high-dimensional, different distribution • Approach • Bayesian network predictive model IDSL

  12. Bayesian Network • A Bayesian network B=<N,A,Θ> is a directed acyclic graph (DAG) <N,A> • Each node n є N represents a domain variable • Each arc a є A between nodes represents a probabilistic dependency • Quantified using a conditional probability distribution(CP table) θi є Θ for each node ni • A major advantage of BNs is that the Bayesian network structure represents the inter-relationships among the dataset attributes. IDSL

  13. Bayesian network structure of ‘Adult’ data • Two ways to view it: • Represents the joint probability distribution of the attributes • Encodes the conditional independence relationships among the nodes IDSL

  14. Our approach to Thrombin Data • Pre-processing: Feature subset selection using mutual information (200 of 139,351 features) IDSL

  15. Learning and evaluating BN models • The BN PowerPredictor system allows users to control the complexity of the learned network by adjusting a threshold value. • The system allows users to choose from two commonly used performance measures: • The prediction accuracy • The area under ROC curve(AUC) • Five candidate models was generated from the preprocessed training data set • Each of the five candidates had from 2~12features. IDSL

  16. Activity 10695 91839 16794 79651 Learning and evaluating BN models • For each candidate we used it to classify the training set and measured its AUC scores • Then picked the simplest model that had a “decent” AUC score IDSL

  17. Classifying the testing set • Using the chosen model, created the posterior probabilities of each instance in the test dataset. • Decide the cut point to classify the test cases into either active or inactive • 8 possible cut points to choose from 32,71,72,74,75,215,223,550 • Decide to classify 223 cases as active IDSL

  18. Analyzing the result Accuracy: 0.711 Weighted Accuracy: 0.684 sensitivity 1-specificity IDSL

  19. Conclusions • The combination of information gain based feature filtering and the Bayesian net based feature selection is a novel, effective approach for analyzing high-dimensional data. • We gained awareness of the overfitting problem when out-of-sample validation is impossible, especially when the sample size is small. • One should carefully choose performance measures that are cost function independent when a well-defined cost function is not available, such as AUC. IDSL

  20. Task 2:Gene/Protein Function Prediction • RELAGGS(Multirelational Learning Algorithm ) was developed at Magdeburg University • RELAGGS is intended to deal with relational data • RELAGGS had been tested on relational datasets from financial domains IDSL

  21. Preprocessing with SQL • General: renormalize into multiple tables as a natural representation of the data • The genes_relation contained 862 training examples, 381 test examples. • Specific for KDD Cup tasks 2/3: consider only interactions with high correlations, assume transitivity, make symmetry explicit IDSL

  22. Preprocessing with RELAGGS • It takes as input a description of the tables • RELAGGS uses the foreign link information to compute join definitions • Performs automatic transformation of multiple tables into single table with the help of aggregate functions • Uses propositional learner such as C4.5 or SVMlight IDSL

  23. Data Mining withSVMlight • An SVMlight run on the RELAGGS output resulted in model files from the training genes and in prediction files for the test genes IDSL

  24. Postprocessing with SQL • The predictions for single functions and localizations had to be integrated into a final solution IDSL

  25. Conclusion • From 10-fold cross-validation: • Accuracies: 92.9% on task 272.5% on task 3 • From the Cup organizers: • Accuracies: 93.6% on task 2: rank 1 69.8% on task 3: rank 4 IDSL

  26. Task 3:Localization • Task • Predict the localization of a given gene in a cell among 15 distinct positions • Data • Relation table with six categorical attributes Essential, Class, Complex, Phenotype, Motif, Chromosome Number • Interaction matrix listing all the interactions between genes • Training: 862 training genes • Test: 381 test genes IDSL

  27. Characteristic of Dataset • Dataset 2 has three interesting features: • The dataset contains many missing values • The domain of the objective attribute contains 15 non-ordered values • The dataset is a mixture of two types of data IDSL

  28. Coping with Missing Values • Class,Complex, and Motif are highly correlated with localization • With regard to the binary interaction relationship • Genes that interacted with the focusing gene were usually located in the same part of the cell • Compensate for the missing information by using information on the three attributes and the binary interaction relationship IDSL

  29. Different Test Approaches • Applied three independent approaches to the data analysis • Decision trees with correlated association rules • Adaboost (Boosting correlated association rules) • Nearest neighbor method • The nearest neighbor method worked best for the training dataset IDSL

  30. Nearest Neighbor Analysis • Attribute Agreement of Records • r1 and r2in R are called agree on fiif r1 [fi] and r2[fi] share some common elements • r1[fi] ٨r2[fi] ‡ф IDSL

  31. Binary interaction relation IDSL

  32. Gene1 Gene4 Gene2 Gene3 Neighbors • Two records are neighbors if they agree with respect to certain attributes. IDSL

  33. Nearest Neighbor Assignment by Prioritizing Attributes • A single attribute may not be sufficient form accurate prediction • In cases where the number of neighbors is large • Ex:Complex->Class • (G235065 agree with G234126, located in the cytoplasm IDSL

  34. Computing the nearest neighborhood • We denoted the final answer Nm as: NN(r,Rtrain,[g1,…,gm]) IDSL

  35. Classification by Nearest Neighborhood Analysis • Let obj be an objective attribute, such as Localization, and let Dobj be its domain. • Calculate the objective value of r,r[obj] from the majority of objective values of nearest neighbors in NN(r,Rtrain,[g1,…,gm]) IDSL

  36. Computing Optimal Priority • Theorem: It is NP-hard to compute [g1,…,gm] that optimizes accuracy(Rtest, Rtrain, [g1,…gm]) • Branch-and bound search technique for solving the optimization problem. • Proof IDSL

  37. Experimental Results • The priority list Pmax=[Complex, Class, Interaction, Motif] • Accuracy(Strain,Strain, Pmax)=79% • Accuracy(Stest,Strain, Pmax)=72% IDSL

  38. IDSL

  39. Conclusions • Lessons for mining biological databases • It is very surprising that protein interaction information was not more useful in Tasks 2 and 3 • A second lesson is the issue of interacting with the laboratory • General lessons for data mining • Bayes nets should not be rejected out of hand for pure classification tasks • The propositionalization often is a good approach to a relational learning task • The need for improved human-computer interaction and the question of how to handle a changing distribution over data IDSL

  40. Personal Opinion • Current tools and approaches do not adequately address the Genomics Challenge • The step of handling missing values was most elaborated and time-consuming IDSL

More Related