overview of text data mining n.
Skip this Video
Loading SlideShow in 5 Seconds..
Overview of Text Data Mining PowerPoint Presentation
Download Presentation
Overview of Text Data Mining

Loading in 2 Seconds...

play fullscreen
1 / 15

Overview of Text Data Mining - PowerPoint PPT Presentation

  • Uploaded on

Overview of Text Data Mining. (CS591-CXZ Text Data Mining Seminar) Sept. 1, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Email Insurance claims News articles Web pages Patent portfolios …. Customer complaint letters Contracts

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Overview of Text Data Mining' - trina

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview of text data mining

Overview of Text Data Mining

(CS591-CXZ Text Data Mining Seminar)

Sept. 1, 2004

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

most data are unstructured text or semi structured

Insurance claims

News articles

Web pages

Patent portfolios

Customer complaint letters


Transcripts of phone calls with customers

Technical documents

Most Data are Unstructured (Text) or Semi-Structured…

The more data we have,

the more likely we can find patterns in data

(Adapted from J. Dorre et al. “Text Mining: Finding Nuggets in Mountains of Textual Data”)

text management applications
Text Management Applications





Create Knowledge




elements of text info management technologies
Elements of Text Info Management Technologies



















Natural Language Content Analysis


what is text mining
What Is Text Mining?

“The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001)

“Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999)

(Slide from Rebecca Hwa’s “Intro to Text Mining”)

text mining vs nlp ir dm
Text Mining vs. NLP, IR, DM…
  • How does it relate to data mining in general?
  • How does it relate to computational linguistics?
  • How does it relate to information retrieval?

(Adapted from Rebecca Hwa’s “Intro to Text Mining”)

challenges in text mining
Challenges in Text Mining
  • Data collection is “free text”
    • Data is not well-organized
      • Semi-structured or unstructured
    • Natural language text contains ambiguities on many levels
      • Lexical, syntactic, semantic, and pragmatic
    • Learning techniques for processing text typically need annotated training examples
      • Consider bootstrapping techniques
  • What to mine?

(adapted from Rebecca Hwa’s “Intro to Text Mining”)

applications of text mining
Applications of Text Mining
  • Direct applications
    • Domain-dependent (Bioinformatics, Business Intelligence, etc)
    • Data-dependent (WWW, literature, email, customer reviews, etc)
  • Indirect applications
    • Assist information access
    • Assist information organization
text mining for hypertext creation
Text Mining for Hypertext Creation

A general topic

Concept map

Subtopic i

Subtopic M

Subtopic 1


Doc 1

Doc 2

Doc N


type of links
Type of Links

Term Term Links

DocTerm Links

A general topic

TermDoc Links

Subtopic i

Subtopic M

Subtopic 1


Doc 1

Doc 2

Doc N

Doc Doc Links

related areas conferences
Related Areas/Conferences
  • Natural Language Processing (NLP): ACL, EMNLP, COLING
  • Information Retrieval: SIGIR, CIKM
  • Machine Learning: ICML, NIPS, UAI
  • Data Mining & Knowledge Discovery: SIGKDD
  • World Wide Web: WWW
  • Bioinformatics: ISMB, PSC
candidate papers sigkdd 04
Candidate Papers – SIGKDD 04
  • Probabilistic Author-Topic Models for Information Discovery Authors: Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, Thomas Griffiths
  • Mining Reference Tables for Automatic Text Segmentation Authors: Eugene Agichtein, Venkatesh Ganti
  • Exploiting Dictionaries in Named Entity Extraction: Combining SemiMarkov Extraction Processes and Data Integration Methods Authors: William Cohen, Sunita Sarawagi
  • Mining and Summarizing Customer Reviews Authors: Minqing Hu, Bing Liu
  • Cluster-based Concept Invention for Statistical Relational Learning Authors: Alexandrin Popescul, Lyle Ungar
candidate papers www 04
Candidate Papers –WWW 04
  • Unsupervised Learning of Soft Patterns for Generating Definitions from Online News (page 90)H. Cui, M.-Y. Kan, T.-S. Chua, National University of Singapore , WWW2004
  • Web-Scale Information Extraction in KnowItAll (Preliminary Results) (page 100)O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, A. Yates, University of Washington, WWW 2004
  • LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora (page 184)C.-C. Huang, S.-L. Chuang, Academia SinicaL.-F. Chien, Academia Sinica, National Taiwan University, WWW 2004
  • Towards the Self-Annotating Web (page 462)P. Cimiano, S. Handschuh, University of KarlsruheS. Staab, University of Karlsruhe, Ontoprise GmbH
  • Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty (page 482)E. Gabrilovich, Technion, Microsoft ResearchS. Dumais, E. Horvitz, Microsoft Research
  • A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results (page 658)K. Kummamuru, R. Lotlikar, S. Roy, IBM India Research LabK. Singal, IIT-GuwahatiR. Krishnapuram, IBM India Research Lab WWW2004
candidate papers psb ismb 03 04
Candidate Papers – PSB & ISMB 03/04
  • Biological Nomenclatures: A Source of Lexical Knowledge and Ambiguity , O. Tuason, L. Chen, H. Liu, J.A Blake, and C. Friedman; Pacific Symposium on Biocomputing 9:238-249(2004)
  • Playing Biology's Name Game: Identifying Protein Names in Scientific Text , D. Hanisch, J. Fluck, HT. Mevissen, R. Zimmer; Pacific Symposium on Biocomputing 8:403-414(2003).
  • Mining Terminological Knowledge in Large Biomedical Corpora , H. Liu, C. Friedman; Pacific Symposium on Biocomputing 8:415-426(2003).
  • A Biological Named Entity Recognizer , M. Narayanaswamy, K. E. Ravikumar, K. Vi jay-Shanker; Pacific Symposium on Biocomputing 8:
  • A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , A.S. Schwartz, M.A. Hearst; Pacific Symposium on Biocomputing 8:451-462(2003).
  • Evaluation of Text Data mining for Database Curation: LessonsLearned from the KDD Challenge CupAlexander Yeh, Lynette Hirschman, Alexander Morgan ISMB 2003
  • Extracting Synonymous Gene and Protein Terms from Biological LiteratureHong Yu and Eugene Agichtein ISMB 2003
  • Mining MEDLINE for Implicit Links between Dietary Substances and Diseases, Padmini Srinivasan - University of IowaBisharah Libbus - National Library of Medicine ISMB 2004
  • Protein Names Precisely Peeled Off Free Text, Sven Mika - Columbia UniversityBurkhard Rost - CUBIC/C2B2/NESG, Dept Biochemistry and Molecular Biophysics, Columbia University 2004 ISMB