310 likes | 617 Views
Literature Retrieval and Mining. Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520. Outline. Introduction to PubMed PubMed Related Articles Search engines and Google features H index. PubMed. PubMed NCBI NLM NIH Biomedical literature database
E N D
Literature Retrieval and Mining Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Outline • Introduction to PubMed • PubMed Related Articles • Search engines and Google features • H index
PubMed • PubMed NCBI NLM NIH • Biomedical literature database • > 21M citations from 4800 journals since 1948. • Entrez is the retrieval system • PubMed entry • Citation (paper) published (recent papers could be indexed upon epub) • Citation indexed in PubMed with PubMedID assigned • Citation indexed with MeSH (Medical Subject Heading, like keywords) terms • For direct full article access: http://www.ncbi.nlm.nih.gov.ezp1.harvard.edu/sites/entrez?holding=hulib
Search by Author / Journal / Date • By author: • Lastname FirstMiddleInitial [au]: Liu JS • First author [1au], last author [lastau] • Full name [fau]: Jun S Liu • By journal title: [ta] • Full journal title or MEDLINE abbreviation: PNAS • Index to get journal title “proceedings of the national” • By date: • yyyy/mm/dd [dp] • Date range (:) or “last x days/months/years” • Jun S Liu [au] AND "last 3 years" [dp]
Search Syntax and Tags • Boolean: AND, OR, NOT • Field tags • First author [1AU], author [AU], full author [FAU], 1st author affiliation [AD] • MeSH term [MH] • Title [TI], title/abstract [TIAB], text words [TW] • Publication date [DP] • Publication type [PT], journal title [TA], Language [LA] • Sorted by: Date, author, journal • Details: View and edit actual query
Advanced Search and PubMed Display • Advanced search • Index: refine search before display • History: keep most recent 100 queries for 8 hours, e.g. #5 AND #3 • Displayed by: • Summary • Abstract • Send To: text, file, email, Clipboard, RSS! • More on PubMed: • http://www.nlm.nih.gov/bsd/disted/pubmed.html
Literature Mining Terms • Corpus: Collection of documents. E.g.all papers in PubMed • Term frequency: Number of times a word appears in a document. E.g. “polymerase” appeared 41 times in a paper • Document frequency: Number of documents a word appears in. E.g. 1234x papers has the word “transcription” • Collection frequency: Total number of times a word appears in a corpus. E.g. “transcription” appeared 6789X times in all of PubMed indexed papers • Stop words: Words in the corpus that contribute little to meaning. E.g. to, is, an • Stemming: Group together different variations of the same word. E.g. activate vs. activated vs. activating
Documents Represented as Vectors • A document is summarized as a vector of word counts. • Each dimension contains the number of times a word appears. • Can calculate similarity between two documents by comparing their vectors • ”Our analysis includes comparison of amino acid environments with random control environments as well as with each of the other amino acid environments.” acid 2 amino 2 analysis 1 comparison 1 control 1 environments 2 […] our 1
Comparing Two Documents • Intuitive comparison between two papers correlation coefficient of their word occurrence vectors • Correlation measures the strength of linear relationship between two random variables a = c(1, 3, 5, 1, 8, 20, 0, 0, 0, 3, 1) b = c(2, 3, 4, 0, 10, 25, 1, 0, 2, 4, 3) c = c(2, 0, 1, 10, 2, 4, 7, 1, 5, 0, 8) cor(a, b) 0.985615 Correlated cor(b, c) -0.110328 Not correlated
Term Weighting Considerations • Give different terms different weight • Global weight • Document frequency
Term Weighting Considerations • Give different terms different weight • Global weight • Document frequency: Fewer documents, more weight. E.g. progesterone vs gene • Local weight • Term frequency
Term Weighting Considerations • Give different terms different weight • Global weight • Document frequency: log(N / df) • Local weight • Term frequency: 1 + log(tf) • Document length
Related Citations • Related Citations • Similarity between two documents: all terms (local wt1 × local wt2 × global wt) • Pre-computed related articles for each citation • Rank ordered by relevance, then date • How to evaluate: • Tradeoff between precision and recall • Precision = # relevant hits / # hits • Recall = # relevant hits / # relevant • Often # relevant is arbitrary, or sampled
Jane • http://www.biosemantics.org/jane/ • Have you recently written a paper, but you're not sure to which journal you should submit it? Or maybe you want to find relevant articles to cite in your paper? Or are you an editor, and do you need to find reviewers for a particular paper? Maybe you are a reviewer, and wonder whether the authors’‘novel’ approach or finding is really novel? Jane can help!
Search Engines Components • The crawler: visit all websites, traverse all links • The index • Check keywords and full text • Ignore stop words: e.g. is, at, from… • Paid inclusion: not Google • Google looks at semantics and logic • The search engine software: how to rank • Location (front) and frequency (more) • Off the page factors: how many pages link to this one • Clickthrough measurement: lower the rank for search results not clicked
Other Useful Google Features • http://www.google.com/intl/en/help/features.html • Conversion: 88 cm in inches or 10000 yen in USD • Time: time Beijing • Definition: define proteomics • Local search: CVS 02115 • Travel info: united 134 • Site search: training grant site:hsph.harvard.edu • Who links to this: link:www.ncbi.nlm.nih.gov • Filetype: comparative genomics filetype:ppt
H-index • Hirsch, PNAS 2005 • Simultaneously measures productivity and impact of a scientist • A scholar with an index of h has published h papers each of which has been cited by others at least h times • Check citation from scholar.google.com
H-index • For physical sciences: ~12 tenured associate prof, ~18 full prof, ~45 member of national academy of science • Few biases: • Does not care where author appears on paper and total number of authors on papers • Advantage for older, alive scientists and sustained productivity • Ignore context (wrong results)
Google Scholar • Scholar.google.com • Set preference to access open source and Harvard Lib: http://scholar.google.com/scholar_preferences?hl=en • Where is a paper cited • Get free pdf without subscription • Get H-index • “My Citations” • Manually update papers after the initial creation
Want to Learn a Comp Bio Topic? • Pubmed: • Recent reviews in good journal • “related articles” • Nature Biotechnology CompBio track • PLoS Computational Biology Collection • http://collections.plos.org/ploscompbiol/index.php • 10 Simple rules, Educational • Search Google for topic, “lecture notes” or “tutorial”, filetype : ppt or pdf • Search http://www.wikipedia.org/ and Google definition • Try http://CompBio.pbwiki.com
Acknolwedgement • Russ Altman • Soumya Raychaudhuri • John Quackenbush • Jeff Chang