NLP for Text Mining

NLP for Text Mining Towards systems capable of extracting semantic information from texts Presentation by: Tiago Vaz Maia

Introduction • Keyword-models of text are very poor (e.g. search with google). • There is great advantage to a system that ‘understands’ texts, at some level. • Need for semantic understanding.

CRYSTAL (UMass) • CRYSTAL: Inducing a Conceptual Dictionary S. Soderland, D. Fisher, J. Aseltine, W. Lehnert In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, 1995.

Information Extraction Systems • Generally domain-specific (e.g. medical records, financial news). • Work by having “a dictionary of linguistic patterns that can be used to identify references to relevant information in a text”. Op. Cit.

A ‘Concept-Node’ Definition CN-type: Sign or Symptom Subtype: Absent Extract from Direct Object Active voice verb Subject constraints: words include “PATIENT” head class: <Patient or Disabled Group> Verb constraints: words include “DENIES” Direct Object constraints: head class <Sign or Symptom> Op. Cit.

Rationale for CRYSTAL • Building domain-specific dictionaries is time-consuming (knowledge-engineering bottleneck). • CRYSTAL builds such dictionaries automatically, from annotated texts (supervised learning).

Annotation of Texts • E.g. “Unremarkable with the exception of mild shortness of breath and chronically swollen ankles”. • Domain expert marks “shortness of breath” and “swollen ankles” with CN type “sign or symptom” and subtype “present”. (Example from Op. Cit.)

CRYSTAL’s Output • A dictionary of information extraction rules (i.e. concept-nodes) specific for the domain. • These rules should be general enough to apply to other texts in the same domain.

Algorithms • Next time… • Five minutes go by very fast!

Conclusions • Domain-specific information extraction systems capable of semantic understanding are within reach of current technology. • CRYSTAL makes such systems scalable and easily portable by automating the process of dictionary construction.

CRYSTAL’s Dictionary Induction Algorithms Tiago V. Maia

Review: Information Extraction • Information Extraction Systems extract useful information from free text. • This information can, for example, be stored in a database, where it can be data-mined, etc. • E.g., from hospital discharge reports we may want to extract:

Review: Information Extraction • Name of patient. • Diagnosis. • Symptoms. • Prescribed treatments. • Etc.

Review: Dictionary of Concept-Nodes • In the UMass system extraction rules are stored in concept-nodes. • The set of all concept-nodes is called a dictionary.

A ‘Concept-Node’ Definition CN-type: Sign or Symptom Subtype: Absent Extract from Direct Object Active voice verb Subject constraints: words include “PATIENT” head class: <Patient or Disabled Group> Verb constraints: words include “DENIES” Direct Object constraints: head class <Sign or Symptom> Op. Cit.

Review: CRYSTAL • CRYSTAL automatically induces a domain-specific dictionary, from an annotated training corpus.

Learning Algorithm • Two steps: • 1. Create one concept-node per positive training instance. • 2. Gradually merge different concept-nodes to achieve a more general and compact representation.

Constructing Initial CN’s • E.g. of step 1: • Sentence: “Unremarkable with the exception of mild shortness of breath and chronically swollen ankles”. • Annotation: “shortness of breath” and “swollen ankles” are marked with CN type “sign or symptom”, subtype “present”.

Initial CN Definition CN-type: Sign or Symptom Subtype: Present Extract from Prep. Phrase “WITH” Verb= <NULL> Subject constraints: words include “UNREMARKABLE” Prep. Phrase constraints: words include “THE EXCEPTION OF MILD SHORTNESS OF BREATH AND CHRONICALLY SWOLLEN ANKLES” head class <Sign or Symptom> Op. Cit.

Need for Induction • Initial concept-nodes are too specific to be useful for any texts other than the training corpus. • One needs an inductive step, capable of constructing more general definitions  Step 2.

Inducing General CN’s • Main idea: Merge sufficiently similar CN’s, until doing more merges starts generating too many errors. • How do we merge similar CN’s? • The goal is to obtain a general CN that ‘covers’ both CN’s and provides a good generalization for unseen cases.

Merging CN’s • The unification of two CN’s is found by relaxing the constraints in such a way that they cover both nodes. • Word constraints: Intersection of the word constraints from each CN. E.g.: Verb constraints: • “vehemently denies” • “denies categorically” •  “denies”

Merging CN’s • Semantic class constraints: Found by moving up the semantic hierarchy. E.g.: Prep. phrase constraints: • head class <Sign or Symptom> • head class <Lab or Test Result> •  head class <Finding>, if in the semantic hierarchy we have:

Semantic Hierarchy Finding Sign or Symptom Lab or Test Result

Evaluating Merges • Every merged CN is tested against the training corpus. If its error rate is above a certain threshold, it is discarded. • The system continues merging CN’s until no more can be merged without resulting in a CN whose error rate exceeds the pre-specified tolerance.

Results • MUC-3: dictionary built using 1500 hours of work by two advanced graduate students and one post-doc. • MUC-4: using Autoslog, a precursor of CRYSTAL, dictionary was built using 8 hours of work by a first-year graduate student! • Both dictionaries presented roughly the same functionality.

Conclusions • Automated induction of domain-specific information extraction dictionaries is very good alternative to hand-coding. • Knowledge engineering effort drastically reduced, allowing for widespread real-world applications.

Combining Information Extraction and Data Mining Tiago V. Maia

“Using Information Extraction to Aid the Discovery of Prediction Rules from Text” U.Y. Nahm and R.J. Mooney In: KDD-2000 Workshop on Text Mining

Text IE DB KDD Rules An Approach to Text Mining

The Application • Step 1. Starting from free-text job postings in a newsgroup, build a database of jobs. • Step 2. Mine that database, to find interesting rules.

Sample Job Posting • Sample job posting: • “Leading Internet Provider using cutting edge web technology in Austin is accepting applications for a Senior Software Developer. The candidate must have 5 years of software development, which includes coding in C/C++ and experience with databases (Oracle, Sybase, Informix, etc.)…”

Sample Job Record • Title: Senior Software Developer • Salary: $70-85K • City: Austin • Language: Perl, C, Javascript, Java, C++ • Platform: Windows • Application: Oracle, Informix, Sybase • Area: RDBMS, Internet, Intranet, E-commerce • Required years of experience: 5 • Required degree: BS

Sample Extracted Rule • “If a computer-related job requires knowledge of Java and graphics then it also requires knowledge of Photoshop”

Information Extraction • Uses RAPIER: a system similar to CRYSTAL, that also constructs the extraction rules automatically, from an annotated training corpus.

Rule Induction • The induced rules predict the value in a database field, given the values in the rest of the record. • Each slot-value pair is treated as a distinct binary feature. E.g., Oracle  Application. • An example of an induced rule: HTML  Language  Windows NT  Platform  Active Server Pages  Application  Database  Area

Algorithms for Rule Induction • Uses C4.5. • Decision trees are learned using the binary representation for slot-value pairs, and pruned to yield the rules.

Conclusions • Text mining can be achieved by information extraction followed by the application of standard KDD techniques to the resulting structured database. • Both IE and KDD are well understood, and their combination should yield practical real-world systems.

Learning Probabilistic Relational Models Getoor, L., Friedman, N., Koller,D. and Pfeffer, A. Invited contribution to the book Relational Data Mining, Dzeroski, S. and Lavrac, N. (Eds.), Springer-Verlag, 2001.

Applicability Text IE DB KDD JPD

Probabilistic Relational Models • Probabilistic relational models are a marriage of: 1. Probabilistic graphical models, and 2. Relational models.

Why a Probabilistic Model? • Probabilistic graphical models (e.g. Bayesian networks) have proven very successful for representing statistical patterns in data. • Algorithms have been developed for learning such models from data. • Because what is learnt is a joint probability distribution, these models are not restricted to answering questions about specific attributes.

Why a Relational Model? • Typically, Bayesian networks (or graphical models in general) use a representation that consists of several variables (e.g. height, weight, etc.) but has no structure. • The most common way of structuring data is in a relational form (e.g. relational databases). • Data structured in this way consists of individuals, their attributes, and relations between individuals (e.g. database of students, classes, etc.).

Probabilistic Relational Models • Probabilistic relational models are a natural way to represent and learn statistical regularities in structured information. • Moreover, because they are close to relational databases, they are ideal for data mining.

Introduction to Bayes Nets • The problem with using joint probability distributions is their exponential character in the general case. • E.g., assume that there are four random variables: • Student Intelligence, Course Difficulty, Student Understands Material, Student Grade. • Assume Student Intelligence, Course Difficulty and Student Understands Material have three possible values: {low, medium, high}.

Exponential Complexity of JPDs • Further assume that the Student Grade has six possible values: {A, B, C, D, E, F}. • Then, to have a joint probability distribution, we need to specify (or learn) 3x3x3x6 = 162 values. • Imagine if we had hundreds of variables...

Independence Assumptions in Bayes Nets • Bayes nets help because they exploit the fact that each variable typically depends directly only on a small number of other variables. • In our example, we have:

Intelligence Difficulty Understands Material Grade Example of a Bayes Net

Conditional Independence • That is, for example the difficulty of the test is independent of the intelligence of the student. • Also, importantly, the grade is independent of the intelligence of the student or the difficulty of the test, given the student’s understanding of the material  Conditional independence.

Conditional Independence • Formally, we have: • Every node X is independent of every other node that does not descend from X, given the values of X’s parents.

NLP for Text Mining

NLP for Text Mining

Presentation Transcript

Text Mining

Text mining- text analytics- data mining

Text Mining Overview

Text Mining

NLP “Text Analysis for picture/movie generation”

Text Mining

Text Mining

NLP Tools for Biology Literature Mining

NLP for Biomedicine - Ontology building and Text Mining -

Text Mining

Comparative Text Mining

NLP for Health Informatics: text-mining patient records

Statistical Methods for Text Mining

The NLP TOOLTORIAL: Tools for Natural Language Processing and Text Mining

Text Mining