1 / 44

Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work. Exploring Alternative Splicing Features using Support Vector Machines. Jing Xia 1 , Doina Caragea 1 , Susan J. Brown 2.

locke
Download Presentation

Exploring Alternative Splicing Features using Support Vector Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features using Support Vector Machines Jing Xia1, Doina Caragea1, Susan J. Brown2 1 Computing and Information Sciences Kansas State University, USA 2 Bioinformatics Center Kansas State University, USA Jan 16 2008

  2. Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Outline Background & Motivation 1 Problem & Feature Construction 2 Problem Definition Data Set Feature Construction 3Experiments Design & Results Experimental Design Experimental Results Conclusions and Future Work 4 Conclusion

  3. Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Alternative Splicing Alternative Splicing exon intron exon intron exon DNA Splicing: important step 5’UTR GT AG GT AG 3’UTR during gene expression Trasncription TSS ATG exon intron exon exon intron Variable splicing process cap pre−mNRA AG AG GT 5’UTR GU 3’UTR (Alternative splicing) one Splicing AUG gene -> many proteins mRNA Translation protein Genes expression: genes to pro- teins

  4. Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Alternative Splicing Alternative Splicing Splicing: important step Gene pre−mRNA during gene expression Alternative Splicing Variable splicing process (Alternative splicing) one transcript isoforms gene -> many proteins Proteins One genes to many proteins

  5. Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Patterns of Alternative Splicing Patterns of Alternative Splicing Constitutively Spliced Exon (CSE) Alternatively Spliced Exon (ASE) Exon skipping (most frequent) ASE CSE CSE exon1 exon2 exon3 exon4 Alternative 5’ splice sites Alternative 3’ splice sites Intron retention Mutually exclusive Here, focus on predicting alternatively spliced exons (ASE) and constitutively spliced exons (CSE) based on SVM

  6. Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Alternative splicing Wet lab experiments finding AS is time Traditionally, align EST to genome alignments (limited to amount of EST available to the genome)

  7. Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Identifying Alternative Splicing in genome Transcripts Alternative splicing Wet lab experiments finding AS is time genomic DNA consuming Traditionally, align EST to genome alignments (limited to amount of EST available to the genome) Alternative 3’ Exon Exon Skipping

  8. Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Identifying Alternative Splicing in genome Alternative splicing Wet lab experiments finding AS is time consuming Traditionally, align EST to genome alignments (limited to amount of EST available to the genome) Use machine learning algorithms that to predict AS at the genome level

  9. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Problem Definition Problem Definition: given an exon, can we predict it as alternatively spliced exons (ASE) or constitutively spliced exons (CSE)? Constitutively Spliced Exon (CSE) Alternatively Spliced Exon (ASE) CSE CSE CSE ASE exon1 exon2 exon3 exon4

  10. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Problem Addressed and Our Approach Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM) Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE) Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs

  11. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Problem Addressed and Our Approach Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM) Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE) Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs

  12. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Problem Addressed and Our Approach Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM) Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE) Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs

  13. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Problem Addressed and Our Approach Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM) Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE) Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs

  14. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Problem Addressed and Our Approach Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM) Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE) Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs

  15. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Data Set Published data set from the model organism, C. elegans (worm) Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites Example of data set GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT ASE ATACTATAGCGTCTTG....ACCGATCGTACACGCT ASE GTACTATAGCGTCTTG....ACCGATCGTACTCGCT CSE GT AG exon AG −100 0 +100 0 +100 −100

  16. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Data Set Published data set from the model organism, C. elegans (worm) Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites Example of data set GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT ASE ATACTATAGCGTCTTG....ACCGATCGTACACGCT ASE GTACTATAGCGTCTTG....ACCGATCGTACTCGCT CSE GT AG exon AG −100 0 +100 0 +100 −100

  17. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Data Set Published data set from the model organism, C. elegans (worm) Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites Example of data set GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT ASE ATACTATAGCGTCTTG....ACCGATCGTACACGCT ASE GTACTATAGCGTCTTG....ACCGATCGTACTCGCT CSE GT AG exon AG −100 0 +100 0 +100 −100

  18. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Data Set Published data set from the model organism, C. elegans (worm) Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites Previous work: Motifs captured and identified by kernel G. Ratch et al., Length of exons and flanking introns Sorek et al. Our work: Exploit more biologically significant features Use several additional approaches to derive features

  19. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Data Set Published data set from the model organism, C. elegans (worm) Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites Previous work: Motifs captured and identified by kernel G. Ratch et al., Length of exons and flanking introns Sorek et al. Our work: Exploit more biologically significant features Use several additional approaches to derive features

  20. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Feature List Several features known to be biologically important Strength of splice sites (SSS) Motif features Intronic splicing regulator (ISR) Motifs derived from local sequences (MAST) Exonic splicing enhancer (ESE) Reduced set of motif features based on locations of motifs on secondary structure (MAST-R) Optimal folding energy (OPE) Basic sequence features (BSF)

  21. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work SSS: Strength of Splice Site exon AGGTAAGT CGAG We consider all splice sites exon AGGTAAGT CGAG logF(Xi) exon AGGTAGGT GGAG ∑ score = F(X) , i exon AGGTTAGT CGAG exon AGGTAAGT CCAG where X ∈ {A,U,G,C}. i ∈ {−3,+7} +2 −26 +7 −3 for 3’ splice sites (3’ss) and 5’ ss 3’ ss i ∈ {−26,+2} for 5’ splice sites (5’ss).

  22. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Motif: sequence pattern that occurs repeatedly in group of sequences Intronic Splicing Regulator: identified in Kabat et al. MAST: derived by MEME using [-100,+100] sequence Exon Splicing Enhancers: based on two assumption ISR exon Illustration of ISR dispersed among sequences

  23. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Motif: sequence pattern that occurs repeatedly in group of sequences Intronic Splicing Regulator: identified in Kabat et al. MAST: derived by MEME using [-100,+100] sequence Exon Splicing Enhancers: based on two assumption Example: a 20-base motif derived from sequences around splice sites

  24. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Motif: sequence pattern that occurs repeatedly in group of sequences Intronic Splicing Regulator: identified in Kabat et al. MAST: derived by MEME using [-100,+100] sequence Exon Splicing Enhancers: based on two assumption more frequent in exons than in introns more frequent in exons with weak splice sites than in exons with strong splice sites ISR MAST ESE Motifs - dispersed among exons and introns

  25. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Pre-mRNA secondary structures influence exon recognition motif AUCCAUGGGCCGGAUGUGACGGUAGUAGGGUAUACGUCACAUAGGCUUCCUCUCAUGA Located at different structure Secondary structure: derived from Mfold filter motifs using secondary structure Loop Optimal Folding Energy: stability of RNA secondary structure Stem

  26. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work Pre-mRNA secondary structures influence exon recognition motif AUCCAUGGGCCGGAUGUGACGGUAGUAGGGUAUACGUCACAUAGGCUUCCUCUCAUGA Located at different structure Secondary structure: derived from Mfold filter motifs using secondary structure Loop Optimal Folding Energy: stability of RNA secondary structure Stem

  27. Background & Motivation Problem Definition Problem & Feature Construction Data Set Experiments Design & Results Feature Construction Conclusions and Future Work GC content (G & C ratio),= A+U+G+C , characteristics of sequence Sequence length Length of exons and length of exons’ flanking introns frames of stop codons Summary of features Motif features Secondary structure Strength of splice sites Sequence features

  28. Background & Motivation Problem & Feature Construction Experimental Design Experiments Design & Results Experimental Results Conclusions and Future Work Experimental Design Experimental Design List of previous defined features as SVM input Combination of different features to represent ASEs & CSEs split1 split2 split3 split4 split5 Tune SVM parameters to train (kernel linear, RBF.., Cost C) 5−fold cross validation 20% 80% Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs

  29. Background & Motivation Problem & Feature Construction Experimental Design Experiments Design & Results Experimental Results Conclusions and Future Work Experimental Design Experimental Design List of previous defined features as SVM input Combination of different features to represent ASEs & CSEs split1 split2 split3 split4 split5 Tune SVM parameters to train (kernel linear, RBF.., Cost C) 5−fold cross validation 20% 80% Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs

  30. Background & Motivation Problem & Feature Construction Experimental Design Experiments Design & Results Experimental Results Conclusions and Future Work Experimental Design Experimental Design List of previous defined features as SVM input Combination of different features to represent ASEs & CSEs split1 split2 split3 split4 split5 Tune SVM parameters to train (kernel linear, RBF.., Cost C) 5−fold cross validation 20% 80% Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs

  31. Background & Motivation Problem & Feature Construction Experimental Design Experiments Design & Results Experimental Results Conclusions and Future Work Experimental Design Experimental Design List of previous defined features as SVM input Combination of different features to represent ASEs & CSEs split1 split2 split3 split4 split5 Tune SVM parameters to train (kernel linear, RBF.., Cost C) 5−fold cross validation 20% 80% Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs

  32. Background & Motivation Problem & Feature Construction Experimental Design Experiments Design & Results Experimental Results Conclusions and Future Work Experimental Design Experimental Design List of previous defined features as SVM input Combination of different features to represent ASEs & CSEs split1 split2 split3 split4 split5 Tune SVM parameters to train (kernel linear, RBF.., Cost C) 5−fold cross validation 20% 80% Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs

  33. Background & Motivation Problem & Feature Construction Experimental Design Experiments Design & Results Experimental Results Conclusions and Future Work Experimental results Results of alternatively spliced exon classification. All features, including ISR motifs, are used. C Cross Validation Score Test score fp 1% AUC % fp 1% AUC% Split1 0.05 32.45 86.55 56.48 90.05 Split2 0.05 39.33 88.32 52.04 89.04 Split3 0.1 37.56 87.76 38.71 87.97 Split4 0.01 40.86 89.02 37.63 84.42 Split5 0.1 36.48 87.50 35.79 85.69

  34. Background & Motivation Problem & Feature Construction Experimental Design Experiments Design & Results Experimental Results Conclusions and Future Work Experimental results 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Mixed-Feas (85.55%) Base-Feas(78.78%) 0 0 0.2 0.4 0.6 0.8 1 False Positive Rate Comparison of ROC curves obtained using basic features only and basic features plus other mixed features (except conserved ISR motifs). Models trained using 5-fold CV with C = 1.

  35. Background & Motivation Problem & Feature Construction Experimental Design Experiments Design & Results Experimental Results Conclusions and Future Work Experimental results AUC score comparison between data sets with secondary struc- tural features and data sets without secondary structural fea- tures

  36. Background & Motivation Problem & Feature Construction Experimental Design Experiments Design & Results Experimental Results Conclusions and Future Work Motif Evaluation Intersection between motifs derived from sequences & intronic splicing regulators

  37. Background & Motivation Problem & Feature Construction Experimental Design Experiments Design & Results Experimental Results Conclusions and Future Work Motif Evaluation Conserved ESE in metazoans (animals), Human and Mouse

  38. Background & Motivation Problem & Feature Construction Experimental Design Experiments Design & Results Experimental Results Conclusions and Future Work Motif Evaluation Comparison with A. thaliana

  39. Background & Motivation Problem & Feature Construction Conclusion Experiments Design & Results Conclusions and Future Work Conclusions Alternative splicing (AS) events can be found using transcripts Machine learning effectively used for prediction of AS events Identified features informative in predicting AS Explored comparatively comprehensive feature sets from biological point of view

  40. Background & Motivation Problem & Feature Construction Conclusion Experiments Design & Results Conclusions and Future Work Conclusions Alternative splicing (AS) events can be found using transcripts Machine learning effectively used for prediction of AS events Identified features informative in predicting AS Explored comparatively comprehensive feature sets from biological point of view

  41. Background & Motivation Problem & Feature Construction Conclusion Experiments Design & Results Conclusions and Future Work Conclusions Alternative splicing (AS) events can be found using transcripts Machine learning effectively used for prediction of AS events Identified features informative in predicting AS Explored comparatively comprehensive feature sets from biological point of view

  42. Background & Motivation Problem & Feature Construction Conclusion Experiments Design & Results Conclusions and Future Work Conclusions Alternative splicing (AS) events can be found using transcripts Machine learning effectively used for prediction of AS events Identified features informative in predicting AS Explored comparatively comprehensive feature sets from biological point of view

  43. Background & Motivation Problem & Feature Construction Conclusion Experiments Design & Results Conclusions and Future Work Future Work Apply this approach to specific organism Identify motifs more accurately Refine relationships between features (2nd Structure:w and motifs) Learn other types of AS events (not only skipped exons) adapted from "Detection of Alternative Splicing Events Using Machine Learning"

  44. Background & Motivation Problem & Feature Construction Conclusion Experiments Design & Results Conclusions and Future Work Thank you for your attention! Questions? Related work RASE http://www.fml.tuebingen.mpg.de/raetsch/projects/RASE Acknowledgement data set from Dr. Ratsch’s FML group http://www.fml.tuebingen.mpg.de/raetsch/ projects/RASE/altsplicedexonsplits.tar.gz Dr. Caragea’s MLB group http://people.cis.ksu.edu/~dcaragea/mlb Dr. Brown’s Bininformatics Center at KSU http://bioinformatics.ksu.edu

More Related