mining microarray expression data by literature profiling n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Mining microarray expression data by literature profiling PowerPoint Presentation
Download Presentation
Mining microarray expression data by literature profiling

Loading in 2 Seconds...

play fullscreen
1 / 16

Mining microarray expression data by literature profiling - PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on

Mining microarray expression data by literature profiling. Damien Chaussabel and Alan Sher National Institutes of Health. Why do we need automated literature profiling?. Very large datasets and complex experiments Inability of qualified individuals to manually mine available literature

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Mining microarray expression data by literature profiling' - kalin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
mining microarray expression data by literature profiling

Mining microarray expression data by literature profiling

Damien Chaussabel and Alan Sher

National Institutes of Health

why do we need automated literature profiling
Why do we need automated literature profiling?
  • Very large datasets and complex experiments
  • Inability of qualified individuals to manually mine available literature
  • Diverging naming schema have evolved
literature profiling
Literature Profiling
  • “We describe how a literature-derived term frequency database can be generated and mined through the analysis of patterns of occurrences of a restricted subset of relevant terms”
step 1 literature indexing
Step 1: Literature Indexing
  • Articles related to the genes in the list are searched for in the Medline database
  • For each gene, the search results were downloaded and the abstracts extracted and saved as a new file for text analysis
step 2 text analysis
Step 2: Text Analysis
  • Determine the word occurrence for each unique word by analysis of the Medline entries (this study had 4,000 of them)
  • Gives three kinds of terms: those found a whole lot (“cell”, “because”, “is”, “the”), those found rarely, and those found frequently but only for a few genes.
  • The third is the useful one.
step 3 filter filter filter
Step 3: Filter, Filter, Filter
  • 1. Remove commonly found terms
  • 2. Set cutoff for distance from baseline
  • 3. Eliminate terms that only apply to 1 gene
  • 4. Increasing threshold/Increase Specificity
  • 5. Decrease threshold/Increase Sensitivity
step 4 clustering analysis
Step 4: Clustering Analysis
  • Tools originally designed for microarray clustering can be applied to literature mining to create “literature profiles”
  • Array of term-occurrence values vs. individual genes is created
  • Relationships are assessed by hierarchical clustering (done using Cluster/Treeview at Eisen lab website *free*)
results
Results
  • In this study, the groups identified were related to immune response (as was hoped)
  • Genes for transcription factors that control inflammation and apoptosis fell into the first cluster
  • Chemokines were all grouped into the second group
  • MHC-I antigen-presenting pathway genes occupied the third grouping
assumption
Assumption
  • “The basis for analyzing expression patterns is the assumption that genes under common transcriptional control are involved in similar processes.”
notes about parameter setting
Notes about parameter setting
  • For genes with a larger number of abstracts, a 25% cutoff may be too high, but for genes with only five abstracts, it may be too low.
  • Optimizing cutoffs:

cut-off = t + (k/n)

where t is the minimum threshold, k is a constant and n is the number of abstracts retrieved for a given gene (t and k arbitrary)

benefits
Benefits
  • Independent of user bias and can be used to identify promising findings in unbiased way
  • Provides investigators with leads for further in-depth investigation of the literature
limitations
Limitations
  • Hindered by need to retrieve the relevant literature reliably for each gene included in the analysis (editing often required by hand)
  • Can only be used to direct further investigation (has false +s and –s)
why this paper s method is significant
Why this paper’s method is significant…
  • Few groups have tried to overcome the inability of scientists to manually mine all the literature in a high-throughput fashion
  • This technique differs from others because it is based on term occurrence rather than gene name co-citation frequencies
some text mining software
Some text-mining software:
  • Omniviz (www.omniviz.com)
  • Eisen Labs (http://rana.lbl.gov/index.htm)
possible use
Possible use:
  • Could identify functions of unknown genes using ‘guilt by association’
where to find more information
Where to find more information:
  • Profiles generated from this paper can be downloaded and explored using the clustergram browser Treeview available online at no charge (rana.lbl.gov/index.htm).