1 / 16

Mining microarray expression data by literature profiling

Mining microarray expression data by literature profiling. Damien Chaussabel and Alan Sher National Institutes of Health. Why do we need automated literature profiling?. Very large datasets and complex experiments Inability of qualified individuals to manually mine available literature

kalin
Download Presentation

Mining microarray expression data by literature profiling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining microarray expression data by literature profiling Damien Chaussabel and Alan Sher National Institutes of Health

  2. Why do we need automated literature profiling? • Very large datasets and complex experiments • Inability of qualified individuals to manually mine available literature • Diverging naming schema have evolved

  3. Literature Profiling • “We describe how a literature-derived term frequency database can be generated and mined through the analysis of patterns of occurrences of a restricted subset of relevant terms”

  4. Step 1: Literature Indexing • Articles related to the genes in the list are searched for in the Medline database • For each gene, the search results were downloaded and the abstracts extracted and saved as a new file for text analysis

  5. Step 2: Text Analysis • Determine the word occurrence for each unique word by analysis of the Medline entries (this study had 4,000 of them) • Gives three kinds of terms: those found a whole lot (“cell”, “because”, “is”, “the”), those found rarely, and those found frequently but only for a few genes. • The third is the useful one.

  6. Step 3: Filter, Filter, Filter • 1. Remove commonly found terms • 2. Set cutoff for distance from baseline • 3. Eliminate terms that only apply to 1 gene • 4. Increasing threshold/Increase Specificity • 5. Decrease threshold/Increase Sensitivity

  7. Step 4: Clustering Analysis • Tools originally designed for microarray clustering can be applied to literature mining to create “literature profiles” • Array of term-occurrence values vs. individual genes is created • Relationships are assessed by hierarchical clustering (done using Cluster/Treeview at Eisen lab website *free*)

  8. Results • In this study, the groups identified were related to immune response (as was hoped) • Genes for transcription factors that control inflammation and apoptosis fell into the first cluster • Chemokines were all grouped into the second group • MHC-I antigen-presenting pathway genes occupied the third grouping

  9. Assumption • “The basis for analyzing expression patterns is the assumption that genes under common transcriptional control are involved in similar processes.”

  10. Notes about parameter setting • For genes with a larger number of abstracts, a 25% cutoff may be too high, but for genes with only five abstracts, it may be too low. • Optimizing cutoffs: cut-off = t + (k/n) where t is the minimum threshold, k is a constant and n is the number of abstracts retrieved for a given gene (t and k arbitrary)

  11. Benefits • Independent of user bias and can be used to identify promising findings in unbiased way • Provides investigators with leads for further in-depth investigation of the literature

  12. Limitations • Hindered by need to retrieve the relevant literature reliably for each gene included in the analysis (editing often required by hand) • Can only be used to direct further investigation (has false +s and –s)

  13. Why this paper’s method is significant… • Few groups have tried to overcome the inability of scientists to manually mine all the literature in a high-throughput fashion • This technique differs from others because it is based on term occurrence rather than gene name co-citation frequencies

  14. Some text-mining software: • Omniviz (www.omniviz.com) • Eisen Labs (http://rana.lbl.gov/index.htm)

  15. Possible use: • Could identify functions of unknown genes using ‘guilt by association’

  16. Where to find more information: • Profiles generated from this paper can be downloaded and explored using the clustergram browser Treeview available online at no charge (rana.lbl.gov/index.htm).

More Related