query driven search methods for large microarray databases n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Query-driven search methods for large microarray databases PowerPoint Presentation
Download Presentation
Query-driven search methods for large microarray databases

Loading in 2 Seconds...

play fullscreen
1 / 48
vivien-clemons

Query-driven search methods for large microarray databases - PowerPoint PPT Presentation

71 Views
Download Presentation
Query-driven search methods for large microarray databases
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Query-driven search methodsfor large microarray databases Matt Hibbs Troyanskaya Laboratory for BioInformatics and Functional Genomics

  2. Broad Goals/Challenges • Characterize the function of proteins • Learn the mechanisms of gene expression and regulation under many conditions • Growing amounts of data facilitate this goal • Noise, heterogeneity, and biases in available data must be addressed

  3. Specific Goals • Large collection of S.cerevisiae microarray data • From > 80 publications • Totaling ~2400 conditions • Divided into ~130 “datasets” • How can such a large amount of data be leveraged? • What can we learn? Or not learn? • Accessibility, usefulness to community

  4. Outline • Microarray methodology • Analysis concerns • Functional Biases • Improved Approaches • Preliminary Conclusions

  5. Outline • Microarray methodology • Analysis concerns • Functional Biases • Improved Approaches • Preliminary Conclusions

  6. Central Dogma TF • Transcription factors recruit or repress polymerase • Transcription • DNA  mRNA • Translation • mRNA  Proteins • Proteins do work DNA Polymerase mRNA Ribosome Proteins

  7. Molecular Measurements • Measurements of protein abundance in a variety of conditions can suggest function • Difficult to measure accurately in a large-scale manner • One off: measure abundance of mRNA transcripts as a proxy • Much easier to measure on a large scale • Several competing technologies reaching maturity

  8. Basic Microarray Methodology reference mRNA Step 2: Add mRNA to slide for Hybridization test mRNA add green dye add red dye hybridize Step 3: Scan hybridized array Step 1: Prepare cDNA spots

  9. Microarray Outputs Measure amounts of green and red dye on each spot Represent level of expression as a log ratio between these amounts Raw Image from Spellman et al., 98

  10. Microarray Outputs • Log ratios in data matrix • Missing values present • Potentially high levels of noise Experiments Genes

  11. Additional Technology • Two-color (homemade, Agilent) • Process just described, with 2 labeled samples undergoing competitive hybridization • Single-color (Affymetrix) • Highly calibrated hybridization spots • Match and Mis-match spots for each oligo • Other techniques/tricks • Randomized layouts, barcode arrays, tiling arrays, etc.

  12. Outline • Microarray methodology • Analysis concerns • Functional Biases • Improved Approaches • Preliminary Conclusions

  13. Noise Sources • Transcriptional noise • mRNA transcripts not a direct reflection of protein levels • Process of isolating mRNA can stress cells • Especially true of older protocols/data • Chemical noise • Fluorescent labels sensitive to environment • Operator noise • High variation between scientists running the same experiment

  14. Missing Values • Several choices: • Ignore missing values • Remove genes with missing values • Impute missing values • KNN-Impute • Replace missing values with a weighted average of the K-nearest neighbors • Used for analysis presented later

  15. Normalization • “Bright” arrays • Whole arrays often normalized by average intensity • Two-color • Choice of reference population can affect measurements • Avoid divide by zero errors • Affymetrix • Convert hybridization values to log ratios • Divide by average value • Log transform

  16. Clustering Analysis • Distance metrics • Euclidean • Pearson • Spearman • … • Algorithms • Hierarchical • K-means • SOM • …

  17. Megaclustering • Combining data from multiple sources can cause problems • Normalization differences • Technology differences • Noise biases • Requires unified pre-processing and smart application of statistics

  18. Apples to Apples • Pearson correlation distributions not always normal • Large dependence on number of conditions 40 condition dataset 6 condition dataset Histograms of Pearson correlation coefficients

  19. Apples to Apples • Fischer’s Z-score transform normalizes the distributions • Z = ln[(r+1)/(r-1)] / 2, where r = Pearson corr. coeff. 40 condition dataset 6 condition dataset Histograms of Z-scores

  20. Evaluation Measurements • Gene Ontology (GO) • Hierarchical organization of biological processes, molecular functions, and cellular components • Cross-organism structure, organism-specific annotations • Closest available approximation of a “gold standard” • True Positives and False Positives can be defined from the ontology • Node size, depth, expert voting used for cutoffs

  21. Precision / Recall • Calculate and sort distances between all pairs of genes • Determine a cutoff, all pairs below cutoff are predicted “true,” above “false” • Given these predictions, can calculate precision and recall • Precision = TP / (TP + FP) • Recall = TP / TotalPositives • Slide the cutoff from smallest to largest distance to create a curve of precision / recall pairs • Ramp down from few, high confidence predictions to many, low confidence predictions

  22. Example Precision/Recall of various data types

  23. Outline • Microarray methodology • Analysis concerns • Functional Biases • Improved Approaches • Preliminary Conclusions

  24. Functional Biases • Microarray experiments often targeted at a particular process, pathway, or function • However, several “global” signals are often present • Ribosomal response • General Stress Response • Some datasets do contain more targeted “local” signals as well

  25. Ribosome Bias Precision/Recall of various data types

  26. Ribosome Bias Precision/Recall excluding Ribosome Biogenesis

  27. Process-specific P/R • Can generate PR-curves on a per-GO term basis • TPs are pairs of genes annotated to term • TFs are pairs with one gene in term, with smallest common ancestor in very large term • Normalize by size of GO term • Results for individual data sets can expose functional biases

  28. Per-dataset Biases Typical Results

  29. Per-dataset Biases Poor Results

  30. Per-dataset Biases Diverse Results

  31. Z-test for significance • Difference between pair-wise distances for all genes in a term vs. background

  32. A Global View Z-test P-values Columns - datasets Rows - GO terms Red at a cutoff of 10-10

  33. A Global View

  34. A Global View

  35. A Local View

  36. A Local View

  37. Outline • Microarray methodology • Analysis concerns • Functional Biases • Improved Approaches • Preliminary Conclusions

  38. Bi-clustering • Traditional clustering will be driven by “global” signals and ignore “local” signals • Bi-clustering identifies groups of genes and conditions rather than just genes Bi-clustering Traditional clustering

  39. Bi-clustering goals/issues • Better capture biological reality • Genes only cooperate in certain conditions • Genes can have multiple functions • Datasets have functional biases • Computationally difficult problem • Reducible to bi-clique finding • NP-complete • Heuristics, simplifications, approximations • e.g. -biclusters, SAMBA, PISA

  40. Bi-clustering goals/issues • Microarray noise can lead to spurious output • As compendiums increase in size, patterns by chance increase • Datasets have “smallest logical groupings” • Restrict co-expression to these groups • Long running times + large result sets • Difficult to validate results • Scientifically frustrating

  41. Query-driven approach • Allow users to specify a starting point for search • Leverages expert knowledge of domain • Known to be useful in other contexts • bioPIXIE • Identify conditions/datasets of interest based on the set of query genes • Expand query set to include additional related genes in these conditions

  42. Query-driven approach • Reduces problem complexity to allow for real-time results • Fast results allow for user-driven refinement of search criterions • Extensible to larger data compendiums and more complex organisms • Locality sensitive hashing • Pre-processing

  43. Query Weighting • Identify data conditions related in query set • Average correlation, distance, etc. • Signal to Noise ratio of query • Centroid significance • Additional genes related to query • Correlation, distance, etc. weighted by identified condition sets

  44. Simple Scheme • Weighted by correlation of query

  45. Simple Scheme • Results, weighted sum of correlation to query decreasing correlation decreasing correlation

  46. Ongoing Work • Compare query weighting schemes • UI challenges • Scalability concerns • Indexing, Locality Sensitive Hashing • Human data • Assess biological usefulness

  47. Preliminary Conclusions • Noise, functional biases, collection sizes require consideration in microarray analysis • Evaluation metrics can be influenced by biases creating misleading results • Query-driven approaches show promise • Targeted search • Computational feasibility / Real-time results • Extensibility

  48. Acknowledgements • Olga Troyanskaya • Chad Myers • Curtis Huttenhower • Kai Li and lab • Botstein and Kruglyak labs • Kara Dolinski, Maitreya Dunham Jessy