Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis

Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis Min Song, PhD Associate Professor Department of Library and Information Science Yonsei University

Outline • Introduction and Background • Research Problem • Methods • Data Processing • Topic Modeling • Citation Analysis • Identification of Important Articles by PageRank • Visualization • Results & Discussion • Summary & Future Work

Introduction • Bioinformatics has grown into the cross-disciplinary field and proliferated into new areas of life Sciences • 400,000 biological researchers – worldwide • sequencing industry to grow from $1.5B to $100B in 20 years (NextGen Informatics, 2011) • Increasing number of biological databases including PubMed and PubMed Central • Understanding the trends in and the structure of Bioinformatics is increasingly important • Bibliometrics analysis has been applied to Bioinformatics for this purpose (Glänzel et al., 2009; Bansard et al., 2007; Huang et al., 2010)

Research Problem • Bibliometrics analysis utilizes quantitative analysis and statistics to describe patterns of publication within a given field or body of literature (Osareh, 1996) • Problems of Current Approaches • The current Bibliometrics analysis relies primarily on Thomson’s Web of Science product which results in the following problems: • Manually processing citation data • Incomplete coverage • Only use citation analysis • Can’t handle big data

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.

Goal • Detecting the trends in and the structure of the field of Bioinformatics • We introduce novel techniques to detect the knowledge structure of and trends in Bioinformatics by Text Mining techniques and automated citation analysis • Mining PubMed Central full-text with • topic modeling • word co-occurrence • named entity recognition • MeSH • Novel author co-citation analysis • Visualization

What is PubMed Central? • PubMed Central (PMC) is the U.S. National Library of Medicine's digital archive of biomedical and life sciences journal literature • Provides free and unrestricted access (XML format) • Integrates journal literature with other valuable information resources in the NCBI database family (e.g., PubMed, Nucleotide, Protein) • Launched in February 2000 • 383 journals, 1,512,652 articles, 4.3m unique visitors in April 2008

Citation Analysis • Citation Graphs • Link-based algorithms • HITS • PageRank Representative Publications Combine Cosine Bibliographic coupling (BC) QUANTIFY SIMILARITIES Text-based Citation-based Documents Boolean Input Vectors Co-citation

Methods – Data Collections Total 20,869 articles from 47 Journals

Overall Procedure of Our Approach

MeSH = Medical Subject Headings

Word co-occurrence analysis and MeSHterm frequency • Important concept identifications by word co-occurrence • The most widely used measure of co-occurrence is mutual information (MI) • We use the log-likelihood ratio (LLR) in that it is more appropriate than MI in the treatment of a mixture of high-frequency bigrams and low-frequency bigrams • Important concept identifications by MeSH Term • Counting MeSH terms assigned to each article • MeSH terms are not assigned to PubMed Central • Mapping from PubMed Central to PubMed record and then extract MeSH terms

Topic Modeling • Topic Modeling by LDA • We are to explore the salient topics in core literature of Bioinformatics. • We use Latent Dirichlet Allocation (LDA) proposed in (Blei et al., 2003) for topic model generation • LDA is a generative model that enables sets of observations to be accounted for by unobserved groups which explains similarity of documents in the collection • In LDA, each group is described as a random mixture over latent topics where each topic is a discrete distribution over the vocabulary of the collection

Detection of Organization and Country • We apply a Named Entity Recognition (NER) technique to identify country and organization from the text

Citation Analysis • Build a Citation Network from the Datasets • 990,000 citation nodes from about 20,000 papers • Apply the PageRank algorithm to the network to identify the important articles Citation Network (Complexity and Social Networks, 2012)

PageRank - definition • u: a web page • Fu: set of pages u points to • Bu: set of pages that point to u • Nu=|Fu|: the number of links from u • c: a factor used for normalization • The equation is recursive, but it may be computed by starting with any set of ranks and iterating the computation until it converges. • The definition corresponds to the probability • distribution of a random walk on the web graphs.

Results and Discussion • Term Co-location Analysis Keywords with High Ranked Word Co-occurrence

Results and Discussion • Term Relationship based on Latent Semantic Indexing

Results and Discussion (Cont’d) Top Ranked Word Pairs by LLC

Results and Discussion (Cont’d) • Out of 20,869 documents, there are 19,954 documents that have the corresponding MEDLINE records (95.6% matching). In 19,954 documents, 8,412 documents have MeSH terms (42.2%)

Results and Discussion (Cont’d) • Topic Modeling

Results and Discussion (Cont’d) Relationship between a paper and its citation

Results and Discussion (Cont’d) Publication productivity by year

Results and Discussion (Cont’d) Relationship between an author and the number of citations received

Results and Discussion (Cont’d) • Important Articles Identified by PageRank

Results and Discussion (Cont’d) Research productivity by country

Results and Discussion (Cont’d) Research Productivity by Institute

Results and Discussion (Cont’d) Visualization of Citation Graph

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

Summary and Future Work • We have analyzed the field of Bioinformatics with Text Mining techniques and citation analysis • We proposed several novel approaches to detect the field of Bioinformatics • We identified that Bioinformatics has grown very fast and collaboration among authors widely spreads out cross the disciplines. • We also identified that Bradford law is not applied to Bioinformatics. It will require further analysis on why Bioinformatics is an unique field that Bradford law is not applicable. • Fine tune Visualization • Compare to Web of Science Data

References • Nagarajan M., Mohamed Idhris L., Chellappandi P., Kumaravel J.P.S. and Premalatha. V. Information Use by Scholars in Bioinformatics: A Bibliometric View, 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) • Church, K., and Hanks, P., Word Association Norms, Mutual Information and Lexicography, Computational Linguistics, Vol 16:1, pp. 22-29, (1991). • Patra, S K, Mishram S. (2006), Bibliometric study of bioinformatics literature, Scientometrics, 67 : 477–489. • Zhao, D. (2006) Towards All-Author Co-Citation Analysis, Information Processing and Management, 42: 1578-1591 • Butler, L. (2006) RQF Pilot Study Project – History and Political Science Methodology for Citation Analysis, November 2006, accessed from: http://www.chass.org.au/papers/bibliometrics/CHASS_Methodology.pdf, 15 Jan 2007. • Belew, R.K. (2005) Scientific impact quantity and quality: Analysis of two sources of bibliographic data, arXiv:cs.IR/0504036 v1, 11 April 2005. • Brusic, V. (2007) The growth of bioinformatics, Briefings in Bioinformatics. VOL 8. NO 2. 69-70

References • Bansard Y, Rebholz-Schuhmann D, Cameron G, Clark D, van Mulligen E, Beltrame E, Barbolla E, Hoyo D., Martin-Sanchez H, Milanesi L, Tollis I, van der Lei J, Coatrieux J L: Medical informatics and bioinformatics: a bibliometric study. IEEE transactions on information technology in biomedicine : a publication of the IEEE Engineering in Medicine and Biology Society 2007, 11(3): 237-243 • Perez-Iratxeta C, Andrade-Navarro M A, Wren J D: Evolving research trends in bioinformatics. Briefings in Bioinformatics 2007, 8(2): 88-95. • Glänzel W, Janssens F, Thijs B: A comparative analysis of publication activity and citation impact based on the core literature in bioinformatics. Scientometrics2009, 79:109-129. • Blei, D., Ng A., and Jordan, M. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993{1022, January 2003. • Huang H, Andrews J, Tang J: Citation characterization and impact normalization in bioinformatics journals. Journal of the American Society of Information Science and Technology 2011, doi: 10.1002/asi.21707

Questions? • Thank you! Questions? Thank You!

Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis

Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis

Presentation Transcript

Text-Mining: analysis of text data

Data Mining and Bioinformatics

Chapter 16: Text Mining for Translational Bioinformatics

MLA Citation- In text citation

In-text citation

In-Text Citation

On WordNet, Text Mining, and Knowledge Bases of the Future

Notes with in-text citation

Text Mining the technology to convert text into knowledge

Text Structure Analysis

텍스트마이닝 기법들을 통한 생물정보학분야의 이해 (Detecting Bioinformatics by Text Mining Techniques)

Citation Searching with Web of Knowledge

In-text citation

In-text Citation

In Text Citation

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Citation Searching with Web of Knowledge

On WordNet, Text Mining, and Knowledge Bases of the Future

Text Analysis and Knowledge Mining System

Text-Mining: analysis of text data

Opportunities for Text Mining in Bioinformatics