Overview of Text Data Mining

Overview of Text Data Mining (CS591-CXZ Text Data Mining Seminar) Sept. 1, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Email Insurance claims News articles Web pages Patent portfolios … Customer complaint letters Contracts Transcripts of phone calls with customers Technical documents … Most Data are Unstructured (Text) or Semi-Structured… The more data we have, the more likely we can find patterns in data (Adapted from J. Dorre et al. “Text Mining: Finding Nuggets in Mountains of Textual Data”)

Text Management Applications Mining Access Select information Create Knowledge Add Structure/Annotations Organization

Elements of Text Info Management Technologies Retrieval Applications Summarization Visualization Mining Applications Filtering Mining Information Organization Information Access Knowledge Acquisition Search Extraction Categorization Clustering Natural Language Content Analysis Text

What Is Text Mining? “The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001) “Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999) (Slide from Rebecca Hwa’s “Intro to Text Mining”)

Text Mining vs. NLP, IR, DM… • How does it relate to data mining in general? • How does it relate to computational linguistics? • How does it relate to information retrieval? (Adapted from Rebecca Hwa’s “Intro to Text Mining”)

Challenges in Text Mining • Data collection is “free text” • Data is not well-organized • Semi-structured or unstructured • Natural language text contains ambiguities on many levels • Lexical, syntactic, semantic, and pragmatic • Learning techniques for processing text typically need annotated training examples • Consider bootstrapping techniques • What to mine? (adapted from Rebecca Hwa’s “Intro to Text Mining”)

Applications of Text Mining • Direct applications • Domain-dependent (Bioinformatics, Business Intelligence, etc) • Data-dependent (WWW, literature, email, customer reviews, etc) • Indirect applications • Assist information access • Assist information organization

Text Mining for Hypertext Creation A general topic Concept map Subtopic i Subtopic M Subtopic 1 ... Doc 1 Doc 2 Doc N Hypertext

Type of Links Term Term Links DocTerm Links A general topic TermDoc Links Subtopic i Subtopic M Subtopic 1 ... Doc 1 Doc 2 Doc N Doc Doc Links

Examples of Linkages in Text

Related Areas/Conferences • Natural Language Processing (NLP): ACL, EMNLP, COLING • Information Retrieval: SIGIR, CIKM • Machine Learning: ICML, NIPS, UAI • Data Mining & Knowledge Discovery: SIGKDD • World Wide Web: WWW • Bioinformatics: ISMB, PSC

Candidate Papers – SIGKDD 04 • Probabilistic Author-Topic Models for Information Discovery Authors: Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, Thomas Griffiths • Mining Reference Tables for Automatic Text Segmentation Authors: Eugene Agichtein, Venkatesh Ganti • Exploiting Dictionaries in Named Entity Extraction: Combining SemiMarkov Extraction Processes and Data Integration Methods Authors: William Cohen, Sunita Sarawagi • Mining and Summarizing Customer Reviews Authors: Minqing Hu, Bing Liu • Cluster-based Concept Invention for Statistical Relational Learning Authors: Alexandrin Popescul, Lyle Ungar

Candidate Papers –WWW 04 • Unsupervised Learning of Soft Patterns for Generating Definitions from Online News (page 90)H. Cui, M.-Y. Kan, T.-S. Chua, National University of Singapore , WWW2004 • Web-Scale Information Extraction in KnowItAll (Preliminary Results) (page 100)O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, A. Yates, University of Washington, WWW 2004 • LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora (page 184)C.-C. Huang, S.-L. Chuang, Academia SinicaL.-F. Chien, Academia Sinica, National Taiwan University, WWW 2004 • Towards the Self-Annotating Web (page 462)P. Cimiano, S. Handschuh, University of KarlsruheS. Staab, University of Karlsruhe, Ontoprise GmbH • Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty (page 482)E. Gabrilovich, Technion, Microsoft ResearchS. Dumais, E. Horvitz, Microsoft Research • A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results (page 658)K. Kummamuru, R. Lotlikar, S. Roy, IBM India Research LabK. Singal, IIT-GuwahatiR. Krishnapuram, IBM India Research Lab WWW2004

Candidate Papers – PSB & ISMB 03/04 • Biological Nomenclatures: A Source of Lexical Knowledge and Ambiguity , O. Tuason, L. Chen, H. Liu, J.A Blake, and C. Friedman; Pacific Symposium on Biocomputing 9:238-249(2004) • Playing Biology's Name Game: Identifying Protein Names in Scientific Text , D. Hanisch, J. Fluck, HT. Mevissen, R. Zimmer; Pacific Symposium on Biocomputing 8:403-414(2003). • Mining Terminological Knowledge in Large Biomedical Corpora , H. Liu, C. Friedman; Pacific Symposium on Biocomputing 8:415-426(2003). • A Biological Named Entity Recognizer , M. Narayanaswamy, K. E. Ravikumar, K. Vi jay-Shanker; Pacific Symposium on Biocomputing 8: • A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , A.S. Schwartz, M.A. Hearst; Pacific Symposium on Biocomputing 8:451-462(2003). • Evaluation of Text Data mining for Database Curation: LessonsLearned from the KDD Challenge CupAlexander Yeh, Lynette Hirschman, Alexander Morgan ISMB 2003 • Extracting Synonymous Gene and Protein Terms from Biological LiteratureHong Yu and Eugene Agichtein ISMB 2003 • Mining MEDLINE for Implicit Links between Dietary Substances and Diseases, Padmini Srinivasan - University of IowaBisharah Libbus - National Library of Medicine ISMB 2004 • Protein Names Precisely Peeled Off Free Text, Sven Mika - Columbia UniversityBurkhard Rost - CUBIC/C2B2/NESG, Dept Biochemistry and Molecular Biophysics, Columbia University 2004 ISMB

Overview of Text Data Mining