1 / 34

Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis

Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis. Min Song, PhD Associate Professor Department of Library and Information Science Yonsei University. Outline. Introduction and Background Research Problem Methods Data Processing Topic Modeling

fagan
Download Presentation

Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis Min Song, PhD Associate Professor Department of Library and Information Science Yonsei University

  2. Outline • Introduction and Background • Research Problem • Methods • Data Processing • Topic Modeling • Citation Analysis • Identification of Important Articles by PageRank • Visualization • Results & Discussion • Summary & Future Work

  3. Introduction • Bioinformatics has grown into the cross-disciplinary field and proliferated into new areas of life Sciences • 400,000 biological researchers – worldwide • sequencing industry to grow from $1.5B to $100B in 20 years (NextGen Informatics, 2011) • Increasing number of biological databases including PubMed and PubMed Central • Understanding the trends in and the structure of Bioinformatics is increasingly important • Bibliometrics analysis has been applied to Bioinformatics for this purpose (Glänzel et al., 2009; Bansard et al., 2007; Huang et al., 2010)

  4. Research Problem • Bibliometrics analysis utilizes quantitative analysis and statistics to describe patterns of publication within a given field or body of literature (Osareh, 1996) • Problems of Current Approaches • The current Bibliometrics analysis relies primarily on Thomson’s Web of Science product which results in the following problems: • Manually processing citation data • Incomplete coverage • Only use citation analysis • Can’t handle big data

  5. Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.

  6. Goal • Detecting the trends in and the structure of the field of Bioinformatics • We introduce novel techniques to detect the knowledge structure of and trends in Bioinformatics by Text Mining techniques and automated citation analysis • Mining PubMed Central full-text with • topic modeling • word co-occurrence • named entity recognition • MeSH • Novel author co-citation analysis • Visualization

  7. What is PubMed Central? • PubMed Central (PMC) is the U.S. National Library of Medicine's digital archive of biomedical and life sciences journal literature • Provides free and unrestricted access (XML format) • Integrates journal literature with other valuable information resources in the NCBI database family (e.g., PubMed, Nucleotide, Protein) • Launched in February 2000 • 383 journals, 1,512,652 articles, 4.3m unique visitors in April 2008

  8. Citation Analysis • Citation Graphs • Link-based algorithms • HITS • PageRank Representative Publications Combine Cosine Bibliographic coupling (BC) QUANTIFY SIMILARITIES Text-based Citation-based Documents Boolean Input Vectors Co-citation

  9. Methods – Data Collections Total 20,869 articles from 47 Journals

  10. Overall Procedure of Our Approach

  11. MeSH = Medical Subject Headings

  12. Word co-occurrence analysis and MeSHterm frequency • Important concept identifications by word co-occurrence • The most widely used measure of co-occurrence is mutual information (MI) • We use the log-likelihood ratio (LLR) in that it is more appropriate than MI in the treatment of a mixture of high-frequency bigrams and low-frequency bigrams • Important concept identifications by MeSH Term • Counting MeSH terms assigned to each article • MeSH terms are not assigned to PubMed Central • Mapping from PubMed Central to PubMed record and then extract MeSH terms

  13. Topic Modeling • Topic Modeling by LDA • We are to explore the salient topics in core literature of Bioinformatics. • We use Latent Dirichlet Allocation (LDA) proposed in (Blei et al., 2003) for topic model generation • LDA is a generative model that enables sets of observations to be accounted for by unobserved groups which explains similarity of documents in the collection • In LDA, each group is described as a random mixture over latent topics where each topic is a discrete distribution over the vocabulary of the collection

  14. Detection of Organization and Country • We apply a Named Entity Recognition (NER) technique to identify country and organization from the text

  15. Citation Analysis • Build a Citation Network from the Datasets • 990,000 citation nodes from about 20,000 papers • Apply the PageRank algorithm to the network to identify the important articles Citation Network (Complexity and Social Networks, 2012)

  16. PageRank - definition • u: a web page • Fu: set of pages u points to • Bu: set of pages that point to u • Nu=|Fu|: the number of links from u • c: a factor used for normalization • The equation is recursive, but it may be computed by starting with any set of ranks and iterating the computation until it converges. • The definition corresponds to the probability • distribution of a random walk on the web graphs.

  17. Results and Discussion • Term Co-location Analysis Keywords with High Ranked Word Co-occurrence

  18. Results and Discussion • Term Relationship based on Latent Semantic Indexing

  19. Results and Discussion (Cont’d) Top Ranked Word Pairs by LLC

  20. Results and Discussion (Cont’d) • Out of 20,869 documents, there are 19,954 documents that have the corresponding MEDLINE records (95.6% matching). In 19,954 documents, 8,412 documents have MeSH terms (42.2%)

  21. Results and Discussion (Cont’d) • Topic Modeling

  22. Results and Discussion (Cont’d) • Topic Modeling

  23. Results and Discussion (Cont’d) Relationship between a paper and its citation

  24. Results and Discussion (Cont’d) Publication productivity by year

  25. Results and Discussion (Cont’d) Relationship between an author and the number of citations received

  26. Results and Discussion (Cont’d) • Important Articles Identified by PageRank

  27. Results and Discussion (Cont’d) Research productivity by country

  28. Results and Discussion (Cont’d) Research Productivity by Institute

  29. Results and Discussion (Cont’d) Visualization of Citation Graph

  30. Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

  31. Summary and Future Work • We have analyzed the field of Bioinformatics with Text Mining techniques and citation analysis • We proposed several novel approaches to detect the field of Bioinformatics • We identified that Bioinformatics has grown very fast and collaboration among authors widely spreads out cross the disciplines. • We also identified that Bradford law is not applied to Bioinformatics. It will require further analysis on why Bioinformatics is an unique field that Bradford law is not applicable. • Fine tune Visualization • Compare to Web of Science Data

  32. References • Nagarajan M., Mohamed Idhris L., Chellappandi P., Kumaravel J.P.S. and Premalatha. V. Information Use by Scholars in Bioinformatics: A Bibliometric View, 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) • Church, K., and Hanks, P., Word Association Norms, Mutual Information and Lexicography, Computational Linguistics, Vol 16:1, pp. 22-29, (1991). • Patra, S K, Mishram S. (2006), Bibliometric study of bioinformatics literature, Scientometrics, 67 : 477–489. • Zhao, D. (2006) Towards All-Author Co-Citation Analysis, Information Processing and Management, 42: 1578-1591 • Butler, L. (2006) RQF Pilot Study Project – History and Political Science Methodology for Citation Analysis, November 2006, accessed from: http://www.chass.org.au/papers/bibliometrics/CHASS_Methodology.pdf, 15 Jan 2007. • Belew, R.K. (2005) Scientific impact quantity and quality: Analysis of two sources of bibliographic data, arXiv:cs.IR/0504036 v1, 11 April 2005. • Brusic, V. (2007) The growth of bioinformatics, Briefings in Bioinformatics. VOL 8. NO 2. 69-70

  33. References • Bansard Y, Rebholz-Schuhmann D, Cameron G, Clark D, van Mulligen E, Beltrame E, Barbolla E, Hoyo D., Martin-Sanchez H, Milanesi L, Tollis I, van der Lei J, Coatrieux J L: Medical informatics and bioinformatics: a bibliometric study. IEEE transactions on information technology in biomedicine : a publication of the IEEE Engineering in Medicine and Biology Society 2007, 11(3): 237-243 • Perez-Iratxeta C, Andrade-Navarro M A, Wren J D: Evolving research trends in bioinformatics. Briefings in Bioinformatics 2007, 8(2): 88-95. • Glänzel W, Janssens F, Thijs B: A comparative analysis of publication activity and citation impact based on the core literature in bioinformatics. Scientometrics2009, 79:109-129. • Blei, D., Ng A., and Jordan, M. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993{1022, January 2003. • Huang H, Andrews J, Tang J: Citation characterization and impact normalization in bioinformatics journals. Journal of the American Society of Information Science and Technology 2011, doi: 10.1002/asi.21707

  34. Questions? • Thank you! Questions? Thank You!

More Related