1 / 26

Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays

Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays. J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison M. Molla, E. Nuwaysir, R. Green Nimblegen Systems Inc. probes. surface. Oligonucleotide Microarrays.

ursa-bright
Download Presentation

Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison M. Molla, E. Nuwaysir, R. Green Nimblegen Systems Inc.

  2. probes surface Oligonucleotide Microarrays • Specific probes synthesized atknown spot on chip’s surface • Probes complementary to RNA of genes to be measured • Typical gene (1kb+) MUCH longer than typical probe (24 bases)

  3. Probes: Good vs. Bad Blue = Probe Red = Sample good probe bad probe

  4. Probe-Picking Method Needed • Hybridization characteristics differ between probes • Probe set represents very small subset of gene • Accurate measurement of expression requires good probe set

  5. Related Work • Use known hybridization characteristics Lockhardt et al. 1996 • Melting point (Tm) predictions Kurata and Suyama 1999 Li and Stormo 2001 • Stable secondary structure Kurata and Suyama 1999

  6. Our Approach • Apply established machine-learning algorithms • Train on categorized examples • Test on examples with category hidden • Choose features to represent probes • Categorize probes as good or bad

  7. Feature Name Description fracA, fracC, fracG, fracT The fraction of A, C, G, or T in the 24-mer fracAA, fracAC, fracAG, fracAT, fracCA, fracCC, fracCG, fracCT, fracGA, fracGC, fracGG, fracGT, fracTA, fracTC, fracTG, fracTT The fraction of each of these dimers in the 24-mer n1, n2, …., n24 The particular nucleotide (A, C, G, or T) at the specified position in the 24-mer d1, d2, …, d23 The particular dimer (AA, AC,…TT) at the specified position in the 24-mer The Features

  8. The Data • Tilings of 8 genes (from E. coli & B. subtilus) • Every possible probe (~10,000 probes) • Genes known to be expressed in sample Gene Sequence: GTAGCTAGCATTAGCATGGCCAGTCATG… Complement: CATCGATCGTAATCGTACCGGTCAGTAC… Probe 1: CATCGATCGTAATCGTACCGGTCA Probe 2: ATCGATCGTAATCGTACCGGTCAG Probe 3: TCGATCGTAATCGTACCGGTCAGT … …

  9. Our Microarray

  10. Defining our Categories Low Intensity = BAD Probes (45%) Mid-Intensity = Not Used in Training Set (23%) High Intensity = GOOD Probes (32%) Frequency 0 .05 .15 1.0 Normalized Probe Intensity

  11. The Machine Learning Techniques • Naïve Bayes (Mitchell 1997) • Neural Networks (Rumelhart et al. 1995) • Decision Trees (Quinlan 1996) • Can interpret predictions of each learner probabilistically

  12. Naïve Bayes • Assumes conditional independence between features • Make judgments about test set examples based on conditional probability estimates made on training set

  13. Naïve Bayes For each example in the test set, evaluate the following:

  14. Neural Network(1-of-n encoding with probe length = 3) Weights A1 C1 G1 T1 Example probe sequence: “CAG” Good or Bad … A2 C2 G2 T2 ACTIVATION A3 C3 G3 T3 ERROR …

  15. fracC Decision Tree Low High fracG fracT Automatically builds a tree of rules Low High … Low High fracTC fracG Low High … fracAC … Low High Low High … n14 … Good Probe A C G T … … … Bad Probe

  16. Decision Tree The information gain of a feature, F, is:

  17. Information Gain per Feature Probe Composition Features Normalized Information Gain Base Position Features Base Position Dimer Position

  18. Cross-Validation • Leave-one-out testing: • For each gene (of the 8) Train on all but this gene Test on this gene Record result Forget what was learned • Average results across 8 test genes

  19. Typical Probe-Intensity Prediction Across Short Region Actual Normalized Probe Intensity Starting Nucleotide Position for 24-mer Probe

  20. Typical Probe-Intensity Prediction Across Short Region Neural Network Naïve Bayes Decision Tree Actual Normalized Probe Intensity Starting Nucleotide Position for 24-mer Probe

  21. Probe-Picking Results Perfect Selector Number of probes selected with intensity >= 90th percentile Number of probes selected

  22. Probe-Picking Results Perfect Selector Neural Network Number of probes selected with intensity >= 90th percentile Naïve Bayes Primer Melting Point Decision Tree Number of probes selected

  23. Current and Future Directions • Consider more features • Folding patterns • Melting point • Feature selection • Evaluate specificity along with sensitivity • Ie, consider false positives • Evaluate probe selection + gene calling • Try more ML techniques • SVMs, ensembles, …

  24. Take-Home Message • Machine learning does a good job on this part of probe-selection problem • Easy to collect large number of training ex’s • Easily measured features work well • Intelligent probe selection can increase microarray accuracy and efficiency

  25. Acknowledgements • NimbleGen Systems, Inc. for providing the intensities from the eight tiled genes measured on their maskless array. • Darryl Roy for helping in creating the training data. • Grants NIH 2 R44 HG02193-02, NLM 1 R01 LM07050-01, NSF IRI-9502990, NIH 2 P30 CA14520-29, and NIH 5 T32 GM08349.

  26. Thanks

More Related