1 / 23

Literature Retrieval and Mining

Literature Retrieval and Mining. Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520. Outline. Introduction to PubMed PubMed Related Articles Search engines and Google features H index. PubMed. PubMed  NCBI  NLM  NIH Biomedical literature database

italia
Download Presentation

Literature Retrieval and Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Literature Retrieval and Mining Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

  2. Outline • Introduction to PubMed • PubMed Related Articles • Search engines and Google features • H index

  3. PubMed • PubMed  NCBI  NLM  NIH • Biomedical literature database • > 21M citations from 4800 journals since 1948. • Entrez is the retrieval system • PubMed entry • Citation (paper) published (recent papers could be indexed upon epub) • Citation indexed in PubMed with PubMedID assigned • Citation indexed with MeSH (Medical Subject Heading, like keywords) terms • For direct full article access: http://www.ncbi.nlm.nih.gov.ezp1.harvard.edu/sites/entrez?holding=hulib

  4. PubMed Articles

  5. Search by Author / Journal / Date • By author: • Lastname FirstMiddleInitial [au]: Liu JS • First author [1au], last author [lastau] • Full name [fau]: Jun S Liu • By journal title: [ta] • Full journal title or MEDLINE abbreviation: PNAS • Index to get journal title “proceedings of the national” • By date: • yyyy/mm/dd [dp] • Date range (:) or “last x days/months/years” • Jun S Liu [au] AND "last 3 years" [dp]

  6. Search Syntax and Tags • Boolean: AND, OR, NOT • Field tags • First author [1AU], author [AU], full author [FAU], 1st author affiliation [AD] • MeSH term [MH] • Title [TI], title/abstract [TIAB], text words [TW] • Publication date [DP] • Publication type [PT], journal title [TA], Language [LA] • Sorted by: Date, author, journal • Details: View and edit actual query

  7. Advanced Search and PubMed Display • Advanced search • Index: refine search before display • History: keep most recent 100 queries for 8 hours, e.g. #5 AND #3 • Displayed by: • Summary • Abstract • Send To: text, file, email, Clipboard, RSS! • More on PubMed: • http://www.nlm.nih.gov/bsd/disted/pubmed.html

  8. Getting Information from Text

  9. Literature Mining Terms • Corpus: Collection of documents. E.g.all papers in PubMed • Term frequency: Number of times a word appears in a document. E.g. “polymerase” appeared 41 times in a paper • Document frequency: Number of documents a word appears in. E.g. 1234x papers has the word “transcription” • Collection frequency: Total number of times a word appears in a corpus. E.g. “transcription” appeared 6789X times in all of PubMed indexed papers • Stop words: Words in the corpus that contribute little to meaning. E.g. to, is, an • Stemming: Group together different variations of the same word. E.g. activate vs. activated vs. activating

  10. Documents Represented as Vectors • A document is summarized as a vector of word counts. • Each dimension contains the number of times a word appears. • Can calculate similarity between two documents by comparing their vectors • ”Our analysis includes comparison of amino acid environments with random control environments as well as with each of the other amino acid environments.” acid 2 amino 2 analysis 1 comparison 1 control 1 environments 2 […] our 1

  11. Comparing Two Documents • Intuitive comparison between two papers  correlation coefficient of their word occurrence vectors • Correlation measures the strength of linear relationship between two random variables a = c(1, 3, 5, 1, 8, 20, 0, 0, 0, 3, 1) b = c(2, 3, 4, 0, 10, 25, 1, 0, 2, 4, 3) c = c(2, 0, 1, 10, 2, 4, 7, 1, 5, 0, 8) cor(a, b) 0.985615 Correlated cor(b, c) -0.110328 Not correlated

  12. Term Weighting Considerations • Give different terms different weight • Global weight • Document frequency

  13. Term Weighting Considerations • Give different terms different weight • Global weight • Document frequency: Fewer documents, more weight. E.g. progesterone vs gene • Local weight • Term frequency

  14. Term Weighting Considerations • Give different terms different weight • Global weight • Document frequency: log(N / df) • Local weight • Term frequency: 1 + log(tf) • Document length

  15. Related Citations • Related Citations • Similarity between two documents: all terms (local wt1 × local wt2 × global wt) • Pre-computed related articles for each citation • Rank ordered by relevance, then date • How to evaluate: • Tradeoff between precision and recall • Precision = # relevant hits / # hits • Recall = # relevant hits / # relevant • Often # relevant is arbitrary, or sampled

  16. Jane • http://www.biosemantics.org/jane/ • Have you recently written a paper, but you're not sure to which journal you should submit it? Or maybe you want to find relevant articles to cite in your paper? Or are you an editor, and do you need to find reviewers for a particular paper? Maybe you are a reviewer, and wonder whether the authors’‘novel’ approach or finding is really novel? Jane can help!

  17. Search Engines Components • The crawler: visit all websites, traverse all links • The index • Check keywords and full text • Ignore stop words: e.g. is, at, from… • Paid inclusion: not Google • Google looks at semantics and logic • The search engine software: how to rank • Location (front) and frequency (more) • Off the page factors: how many pages link to this one • Clickthrough measurement: lower the rank for search results not clicked

  18. Other Useful Google Features • http://www.google.com/intl/en/help/features.html • Conversion: 88 cm in inches or 10000 yen in USD • Time: time Beijing • Definition: define proteomics • Local search: CVS 02115 • Travel info: united 134 • Site search: training grant site:hsph.harvard.edu • Who links to this: link:www.ncbi.nlm.nih.gov • Filetype: comparative genomics filetype:ppt

  19. H-index • Hirsch, PNAS 2005 • Simultaneously measures productivity and impact of a scientist • A scholar with an index of h has published h papers each of which has been cited by others at least h times • Check citation from scholar.google.com

  20. H-index • For physical sciences: ~12 tenured associate prof, ~18 full prof, ~45 member of national academy of science • Few biases: • Does not care where author appears on paper and total number of authors on papers • Advantage for older, alive scientists and sustained productivity • Ignore context (wrong results)

  21. Google Scholar • Scholar.google.com • Set preference to access open source and Harvard Lib: http://scholar.google.com/scholar_preferences?hl=en • Where is a paper cited • Get free pdf without subscription • Get H-index • “My Citations” • Manually update papers after the initial creation

  22. Want to Learn a Comp Bio Topic? • Pubmed: • Recent reviews in good journal • “related articles” • Nature Biotechnology CompBio track • PLoS Computational Biology Collection • http://collections.plos.org/ploscompbiol/index.php • 10 Simple rules, Educational • Search Google for topic, “lecture notes” or “tutorial”, filetype : ppt or pdf • Search http://www.wikipedia.org/ and Google definition • Try http://CompBio.pbwiki.com

  23. Acknolwedgement • Russ Altman • Soumya Raychaudhuri • John Quackenbush • Jeff Chang

More Related