GGS Lecture: Knowledge discovery in large datasets

GGS Lecture:Knowledge discovery in large datasets Yvan Saeys yvan.saeys@ugent.be

Overview • Emergence of large datasets • Dealing with large datasets • Dimension reduction techniques • Case study: knowledge discovery for splice site prediction • Computer exercise

Emergence of large datasets • Examples: image processing, text mining, spam filtering, biological sequence analysis, micro-array data • Complexity of datasets: • Many instances (examples) • Many features (characteristics) • Many dependencies between features (correlations)

Examples of large datasets • Micro-array data: • Colon cancer dataset (2000 genes, 22 samples) • Leukemia (7129 genes, 72 cell-lines) • Gene prediction data: • Homo sapiens splice site data (e.g. Genie): 5788 sequences, 90 bp • Text mining: • Hundreds of thousands of instances (documents), thousands of features

Dealing with complex data • Data pre-processing • Dimensionality reduction: • Instance selection • Feature transformation/selection • Data analysis • Clustering • Classification • Requires methods that are fast and able to deal with large amounts of data

Dimensionality reduction • Instance selection: • Remove identical/inconsistent/incomplete instances (e.g. reduction of homologous genes in gene prediction tasks) • Feature transformation/selection: • Projection techniques (e.g. principle component analysis) • Compression techniques (e.g. minimum description length) • Feature selection techniques

Principle component analysis (PCA) • Transforms the original features of the data to a new set of variables (the principal components) to summarize the features of the data • Usually only the 2 or 3 first PC are then used to visualize the data • Example: clustering gene expression data

PCA Example • Principal component analysis for clustering gene expression data for sporulation in Yeast (Yeung and Ruzzo , Bioinformatics 17 (9), 2001) • 447 genes, 7 timepoints

Feature selection techniques • In contrast to projection or compression, the original features are not changed • For classification purposes: • Goal = to find a “minimal” subset of features with “best” classification performance

Feature selection for Bioinformatics • In many cases, the underlying biological process that is modeled, is not yet fully understood • Which features to include ? • Include as many features as possible, and hope the “relevant” ones are included. • Then apply feature selection techniques to identify the relevant features • Visualization, learn something from your data (data  knowledge)

Benefits of feature selection • Attain good or even better classification performance using a small subset of features • Provide more cost-effective classifiers • Less features to take into account faster classifiers • Less features to store smaller datasets • Gain more insight into the processes that generated the data

Feature selection techniques • Filter approach • Wrapper approach • Embedded approach Classification Model FSS FSS Search Method Classification Model Classification Model Classification Model Parameters FSS

Filter methods • Independent of classification model • Uses only dataset of annotated examples • A relevance measure for each feature is calculated: • E.g: Feature – Class entropy • Kullback-Leibler divergence (cross-entropy) • Information gain, gain ratio • Features with a value lower than some threshold t will be removed

Filter method: example • Feature-class entropy • Measures the “uncertainty” about the class when observing feature I • Example: f1 f2 f3 f4 class f1 f2 f3 f4 class 1 0 1 1 1 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0

Wrapper method • Specific to a classification algorithm • The search for a good feature subset is guided by a search algorithm (e.g. greedy forward or backward) • The algorithm uses the evaluation of the classifier as a guide to find good feature subsets • Examples: sequential forward or backward search, simulated annealing, genetic algorithms

Wrapper method: example • Sequential backward elimination • Starts with the set of all features • Iteratively discards the feature whose removal results in the best classification performance

0.7 0.8 0.1 0.75 f2,f3,f4 f1,f3,f4 f1,f2,f4 f1,f2,f3 0.85 0.1 0.8 f3,f4 f1,f4 f1,f3 0.2 0.7 f4 f3 Wrapper method: example (2) Full feature set : f1,f2,f3,f4

Embedded methods • Specific to a classification algorithm • Model parameters are directly used to derive feature weights • Examples: • Weighted Naïve Bayes Method (WNBM) • Weighted Linear Support Vector Machine (WLSVM)

Case study: knowledge discovery for splice site prediction • Splice site prediction: • Correctly identify the borders of introns and exons in genes (splice sites) • Important for gene prediction • Split up into 2 tasks: • Donor prediction (exon -> intron) • Acceptor prediction (intron -> exon)

Splice site prediction • Splice sites are characterized by a conserved dinucleotide in the intron part of the sequence • Donor sites : …. [ GT … • Acceptor sites : ….. AG ] …. • Classification problem: • Distinguish between true GT, AG and false GT, AG.

Splice site predictionFeatures • Position dependent features • e.g. an A on position 1, C on position 17, …. • Position independent features • e.g. subsequence “TCG” occurs, “GAG” occurs,… 1 2 3 17 28 atcgatcagtatcgat GT ctgagctatgag atcgatcagtatcgat GT ctgagctatgag

0 1 Example : acceptor prediction • Local context of 100 nucleotides around the splice site • 100 position dependent features • 400 binary features (A=1000, T=0100, C=0010, G=0001) • 2x64 binary features, representing the occurrence of 3-mers • Total: 528 binary features • Color coding of feature importance

CAT CAT TGG CAG TGG CAG CAC CGG CTA CAC CGG CTA CTC GAG CTG CTC GAG CTG CCA GTG CGA CCA GTG CGA CCT GCG CGT CCT GCG CGT CCC GGA GAT CCC GGA GAT TGT CCG GGT GAC TGT CCG GGT GAC CAA CTT CGC GGC GTA CAA CTT CGC GGC GTA GAA GTT GCC GGG GTC GAA GTT GCC GGG GTC TGA GCA TGA GCA TGC GCT TGC GCT Higher importance Lower importance Donor prediction : 528 features 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 A T C G AAA ATT ACC AGG ATC AAA ATT ACC AGG ATC AAT TAT TCC ATG AAT TAT TCC ATG AAC TTA ACT AAC TTA ACT AAG TTT ACG AAG TTT ACG ATA TTC AGT ATA TTC AGT ACA TTG AGC ACA TTG AGC AGA TCT TAC AGA TCT TAC TAA TAG TAA TAG TCA TCA TCG TCG

50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 A T C G AAA ATT ACC AGG ATC CAT AAA ATT ACC AGG ATC CAT AAT TAT TCC TGG ATG CAG AAT TAT TCC TGG ATG CAG AAC TTA CAC CGG ACT CTA AAC TTA CAC CGG ACT CTA AAG TTT CTC GAG ACG CTG AAG TTT CTC GAG ACG CTG ATA TTC CCA GTG AGT CGA ATA TTC CCA GTG AGT CGA ACA TTG CCT GCG AGC CGT ACA TTG CCT GCG AGC CGT AGA TCT CCC GGA TAC GAT AGA TCT CCC GGA TAC GAT TAA TGT CCG GGT TAG GAC TAA TGT CCG GGT TAG GAC CAA CTT CGC GGC TCA GTA CAA CTT CGC GGC TCA GTA GAA GTT GCC GGG TCG GTC GAA GTT GCC GGG TCG GTC TGA GCA TGA GCA TGC GCT TGC GCT Higher importance Acceptor prediction : 528 features AAT, TAA, AGA, AGG, AGT, TAG CAG

How to decide on a splice site ? • Classification models • PWM • Collection of (conditional) probabilities • Linear discriminant analysis • Hyperplane decision function in a high-dimensional space • Classification tree • Decision is made by traversing a tree structure • Decision nodes • Leaf nodes • Easy to interpret by a human

Flu Temperature Headache normal veryhigh high {e1, e4} e1 yes normal no {e2, e5} {e3,e6} e2 yes high yes no Headache Headache e3 yes very high yes e4 no normal no yes {e2} yes {e3} no {e6} no {e5} e5 no high no yes no e6 no very high no yes no Classification Tree • Choose the “best” attribute by a given selection measure • Extend tree by adding new branch for each attribute value • Sorting training examples to leaf nodes • If examples unambiguously classified Then Stop Else Repeat steps 1-4 for leaf nodes • Pruning unstable leaf nodes Temperature

Acceptor prediction • Original dataset : 353 binary features • Reduce this set to 15 features (e.g. using a filter technique) • 353 features is hard to visualize in e.g. a decision tree • 15 features is easy to visualize

352 Binary features

15 Binary features

Computer exercise • Feature selection for classification of human acceptor splice sites • Use WEKA machine learning toolkit for knowledge discovery in acceptor sites. • Download files from http://www.psb.ugent.be/~yvsae/GGSlecture.html

GGS Lecture: Knowledge discovery in large datasets