Literature Retrieval and Mining

Literature Retrieval and Mining Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Outline • Introduction to PubMed • PubMed Related Articles • Search engines and Google features • H index

PubMed • PubMed  NCBI  NLM  NIH • Biomedical literature database • > 21M citations from 4800 journals since 1948. • Entrez is the retrieval system • PubMed entry • Citation (paper) published (recent papers could be indexed upon epub) • Citation indexed in PubMed with PubMedID assigned • Citation indexed with MeSH (Medical Subject Heading, like keywords) terms • For direct full article access: http://www.ncbi.nlm.nih.gov.ezp1.harvard.edu/sites/entrez?holding=hulib

PubMed Articles

Search by Author / Journal / Date • By author: • Lastname FirstMiddleInitial [au]: Liu JS • First author [1au], last author [lastau] • Full name [fau]: Jun S Liu • By journal title: [ta] • Full journal title or MEDLINE abbreviation: PNAS • Index to get journal title “proceedings of the national” • By date: • yyyy/mm/dd [dp] • Date range (:) or “last x days/months/years” • Jun S Liu [au] AND "last 3 years" [dp]

Search Syntax and Tags • Boolean: AND, OR, NOT • Field tags • First author [1AU], author [AU], full author [FAU], 1st author affiliation [AD] • MeSH term [MH] • Title [TI], title/abstract [TIAB], text words [TW] • Publication date [DP] • Publication type [PT], journal title [TA], Language [LA] • Sorted by: Date, author, journal • Details: View and edit actual query

Advanced Search and PubMed Display • Advanced search • Index: refine search before display • History: keep most recent 100 queries for 8 hours, e.g. #5 AND #3 • Displayed by: • Summary • Abstract • Send To: text, file, email, Clipboard, RSS! • More on PubMed: • http://www.nlm.nih.gov/bsd/disted/pubmed.html

Getting Information from Text

Literature Mining Terms • Corpus: Collection of documents. E.g.all papers in PubMed • Term frequency: Number of times a word appears in a document. E.g. “polymerase” appeared 41 times in a paper • Document frequency: Number of documents a word appears in. E.g. 1234x papers has the word “transcription” • Collection frequency: Total number of times a word appears in a corpus. E.g. “transcription” appeared 6789X times in all of PubMed indexed papers • Stop words: Words in the corpus that contribute little to meaning. E.g. to, is, an • Stemming: Group together different variations of the same word. E.g. activate vs. activated vs. activating

Documents Represented as Vectors • A document is summarized as a vector of word counts. • Each dimension contains the number of times a word appears. • Can calculate similarity between two documents by comparing their vectors • ”Our analysis includes comparison of amino acid environments with random control environments as well as with each of the other amino acid environments.” acid 2 amino 2 analysis 1 comparison 1 control 1 environments 2 […] our 1

Comparing Two Documents • Intuitive comparison between two papers  correlation coefficient of their word occurrence vectors • Correlation measures the strength of linear relationship between two random variables a = c(1, 3, 5, 1, 8, 20, 0, 0, 0, 3, 1) b = c(2, 3, 4, 0, 10, 25, 1, 0, 2, 4, 3) c = c(2, 0, 1, 10, 2, 4, 7, 1, 5, 0, 8) cor(a, b) 0.985615 Correlated cor(b, c) -0.110328 Not correlated

Term Weighting Considerations • Give different terms different weight • Global weight • Document frequency

Term Weighting Considerations • Give different terms different weight • Global weight • Document frequency: Fewer documents, more weight. E.g. progesterone vs gene • Local weight • Term frequency

Term Weighting Considerations • Give different terms different weight • Global weight • Document frequency: log(N / df) • Local weight • Term frequency: 1 + log(tf) • Document length

Related Citations • Related Citations • Similarity between two documents: all terms (local wt1 × local wt2 × global wt) • Pre-computed related articles for each citation • Rank ordered by relevance, then date • How to evaluate: • Tradeoff between precision and recall • Precision = # relevant hits / # hits • Recall = # relevant hits / # relevant • Often # relevant is arbitrary, or sampled

Jane • http://www.biosemantics.org/jane/ • Have you recently written a paper, but you're not sure to which journal you should submit it? Or maybe you want to find relevant articles to cite in your paper? Or are you an editor, and do you need to find reviewers for a particular paper? Maybe you are a reviewer, and wonder whether the authors’‘novel’ approach or finding is really novel? Jane can help!

Search Engines Components • The crawler: visit all websites, traverse all links • The index • Check keywords and full text • Ignore stop words: e.g. is, at, from… • Paid inclusion: not Google • Google looks at semantics and logic • The search engine software: how to rank • Location (front) and frequency (more) • Off the page factors: how many pages link to this one • Clickthrough measurement: lower the rank for search results not clicked

Other Useful Google Features • http://www.google.com/intl/en/help/features.html • Conversion: 88 cm in inches or 10000 yen in USD • Time: time Beijing • Definition: define proteomics • Local search: CVS 02115 • Travel info: united 134 • Site search: training grant site:hsph.harvard.edu • Who links to this: link:www.ncbi.nlm.nih.gov • Filetype: comparative genomics filetype:ppt

H-index • Hirsch, PNAS 2005 • Simultaneously measures productivity and impact of a scientist • A scholar with an index of h has published h papers each of which has been cited by others at least h times • Check citation from scholar.google.com

H-index • For physical sciences: ~12 tenured associate prof, ~18 full prof, ~45 member of national academy of science • Few biases: • Does not care where author appears on paper and total number of authors on papers • Advantage for older, alive scientists and sustained productivity • Ignore context (wrong results)

Google Scholar • Scholar.google.com • Set preference to access open source and Harvard Lib: http://scholar.google.com/scholar_preferences?hl=en • Where is a paper cited • Get free pdf without subscription • Get H-index • “My Citations” • Manually update papers after the initial creation

Want to Learn a Comp Bio Topic? • Pubmed: • Recent reviews in good journal • “related articles” • Nature Biotechnology CompBio track • PLoS Computational Biology Collection • http://collections.plos.org/ploscompbiol/index.php • 10 Simple rules, Educational • Search Google for topic, “lecture notes” or “tutorial”, filetype : ppt or pdf • Search http://www.wikipedia.org/ and Google definition • Try http://CompBio.pbwiki.com

Acknolwedgement • Russ Altman • Soumya Raychaudhuri • John Quackenbush • Jeff Chang

Literature Retrieval and Mining

Literature Retrieval and Mining

Presentation Transcript

Mining Medical Literature

CS276A Text Retrieval and Mining

Information Retrieval and Text Mining

CS276 Information Retrieval and Web Mining

CS276A Text Retrieval and Mining

CS276A Text Retrieval and Mining

CS276A Text Retrieval and Mining

Information Retrieval, Search, and Mining

Biological literature mining

CS276A Text Retrieval and Mining

CS276A Text Retrieval and Mining

CS276A Text Retrieval and Mining

Literature Mining and Systems Biology

CS276 Information Retrieval and Web Mining

Information Retrieval and Web Mining

CS276 Information Retrieval and Web Mining

Literature Mining BMI 730

Information Retrieval and Text Mining

Information Retrieval, Search, and Mining

CIS392 Text Processing, Retrieval, and Mining

CS276A Text Retrieval and Mining

Information Retrieval and Web Mining