1 / 21

Interpreting Microarray Expression Data Using Text Annotating the Genes

Interpreting Microarray Expression Data Using Text Annotating the Genes. Michael Molla, Peter Andreae, Jeremy Glasner, Frederick Blattner, Jude Shavlik University of Wisconsin – Madison. The Basic Task. Given Microarray Expression Data & Text Annotations of Genes Generate

aleron
Download Presentation

Interpreting Microarray Expression Data Using Text Annotating the Genes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Interpreting Microarray Expression DataUsing Text Annotating the Genes Michael Molla, Peter Andreae, Jeremy Glasner, Frederick Blattner, Jude Shavlik University of Wisconsin – Madison

  2. The Basic Task Given Microarray Expression Data & Text Annotations of Genes Generate Model of Expression

  3. Motivation • Lots of Data Available on the Internet • Microarray Expression Data • Text Annotations of Genes • Maybe we can Make the Scientist’s Job Easier • Generate a Model of Expression Automatically • Easier First Step for the Human

  4. Microarray Expression Data • Each spot represents a gene in E. coli • Colors Indicate Up- or Down-Regulation Under Antibiotic Shock • Four our Purpose 3 Classes • Up-Regulated • Down-Regulated • No-Change

  5. Microarray Expression Data From “Genome-Wide Expression in Escheria Coli K-12”, Blattner et al., 1999

  6. Our Microarray Experiment • 4290 genes • 574 up-regulated • 333 down-regulated • 2747 un-regulated • 636 non enough signal

  7. Text Annotations of Genes • The text from a sample SwissProt entry (b1382) • The “description” field HYPOTHETICAL 6.8 KDA PROTEIN IN LDHA-FEAR INTERGENIC REGION • The “keyword” field HYPOTHETICAL PROTEIN

  8. Sample Rules From a Model for Up-Regulation • IF • The annotation contains FLAGELLARAND does NOT contain HYPOTHETICAL OR • The annotation contains BIOSYNTHESIS • THEN • The gene is up-regulated

  9. Why use Machine Learning? • Concerned with machines learning from available data • Informed by text data, the leaner can make first-pass model for the scientist

  10. Desired Properties of a Model • Accurate • Measure with cross validation • Comprehensible • Measure with model size • Stable to Small Changes in the Data • Measure with random subsampling

  11. Approaches • Naïve Bayes • Statistical method • Uses all of the words (present or absent) • PFOIL • Covering algorithm • Chooses words to use one at a time

  12. Naïve Bayes For each word wi, there are two likelihood ratios (lr): lr(wi present) = p(wi present | up) / p(wi present | down) lr(wi absent) = p(wi absent | up) / p(wi absent | down) For each annotation, the lrs are combined to form a lr for a gene: where X is either present or absent.

  13. PFOIL • Learn rules from data • Produces multiple if-then rules from data • Builds rules by adding one word at a time • Easy to interpret models

  14. Accuracy/Comprehensibility Tradeoff

  15. Stabilized PFOIL • Repeatedly run PFOIL on randomly sampled subsets • For each word, count the number of models it appears in • Restrict PFOIL to only those words that appear in a minimum of m models • Rerun PFOIL with only those words

  16. Stability Measure After running the algorithm N times to generate N rule sets: Where: U = the set of words appearing in any rule set count(wi) = number of rule sets containing word wi

  17. Accuracy/Stability Tradeoff

  18. Discussion • Not very severe tradeoffs in Accuracy • vs. stability • vs. comprehensibility • PFOIL not as good at characterizing data • suggests not many dependencies • need for “softer” rules

  19. Future Directions • M of N rules • Permutation Test • More Sources of Text Data

  20. Take-Home Message • This is just a first step toward an aid for understanding expression data • Make expression models based on text in stead of DNA sequence.

  21. Acknowledgements • This research was funded by the following grants: NLM 1 R01 LM07050-01, NSF IRI-9502990, NIH 2 P30 CA14520-29, and NIH 5 T32 GM08349.

More Related