Information Extraction

Information Extraction Martin Ester Simon Fraser University School of Computing Science CMPT 884 Spring 2009 CMPT 884, SFU, Martin Ester, 1-09

Information Extraction • Outline • Introduction motivation, applications, issues • Entity extraction hand-coded, machine learning • Relation extraction supervised, partially supervised • Entity resolution string similarity, finding similar pairs, creating groups • Future research •  [Feldman 2006] [Agichtein & Sarawagi 2006] CMPT 884, SFU, Martin Ester, 1-09

Introduction • Motivation • 80% of all human-generated data is natural language text • search engines return whole documents, requiring the user to read documents and manually extract relevant information (entities, facts, . . .) very time-consuming • need for automatic extraction of such information from collections of natural language text documents  information extraction (IE) CMPT 884, SFU, Martin Ester, 1-09

Introduction • Definitions • Entity: an object of interest such as a person or organization. • Attribute: a property of an entity such as its name, alias, descriptor, or type. • Relation: a relationship held between two or more entities such as Position of a Person in a Company. • Event: an activity involving several entities such as a terrorist act, aircraft crash, management change, new product introduction. CMPT 884, SFU, Martin Ester, 1-09

Introduction • Example CMPT 884, SFU, Martin Ester, 1-09

Introduction • Applications • • question answering Who is the president of the US? Where was Martin Luther born? • automatic creation of databases • e.g., database of protein localizations or adverse reactions to a drug • opinion mining analyzing online product reviews to get user feedback CMPT 884, SFU, Martin Ester, 1-09

Introduction • Challenges • • Complexity of natural language e.g., identifying word and sentence boundaries is fairly easy in European languages, much harder in Chinese / Japanese • Ambiguity of natural language e.g., homonyms • Diversity of natural language • many ways of expressing a given information, e.g. synonyms • Diversity of writing styles • e.g., scientific papers, newspaper articles, maintenance reports, emails, . . . CMPT 884, SFU, Martin Ester, 1-09

Introduction • Challenges • • names are hard to discover • – impossible to enumerate • – new candidates are generated all the time • – hard to provide syntactic rules • • types of proper names • – people • – companies • – products • – genes - . . . CMPT 884, SFU, Martin Ester, 1-09

Introduction • Architecture of IE System Local analysis Discourse (global) analysis CMPT 884, SFU, Martin Ester, 1-09

Introduction • Knowledge Engineering Approach • Extraction rules are hand-crafted by linguists in cooperation with domain experts. • Most of the work is done by inspecting a set of relevant documents. • Development of rule set is very time-consuming. • Requires substantial CS and domain expertise. • Rule sets are domain-specific, do not transfer to other domains. • Knowledge engineering (KE) approach often achieves higher accuracy than machine learning approach. CMPT 884, SFU, Martin Ester, 1-09

Introduction • Machine Learning Approach • Automatically learn model („rules“) from annotated training corpus. • Techniques based on pure statistics and little linguistic knowledge. • No CS expertise required when building model. • However creating the annotated corpus is very laborious, since very large number of training examples needed. • Transfer to other domains is easier than KE approach. • Accuracy of machine learning (ML) approach is typically lower. CMPT 884, SFU, Martin Ester, 1-09

Introduction • Topics Not Covered • • co-reference resolution e.g., article referencing a noun (entity) of another sentence • event extraction event has type, actor, time . . . • sentiment detection a certain statement (opinion) is classified as positive / negative CMPT 884, SFU, Martin Ester, 1-09

Entity Extraction • Lexical Analysis • • breaking up the input document into individual words = tokens • token: sequence of characters treated as a unit • punctuation marks also considered as token e.g., „,“ (comma) • often, use regular expressions to define format of token CMPT 884, SFU, Martin Ester, 1-09

Entity Extraction • Syntactic Analysis • part-of-speech tagging [Charniak 1997] • marking up the tokens in a text as corresponding to a particular part of speech (POS), based on both its definition, as well as its context • coarse POS tags: e.g., N, V, A, Aux, …. • finer POS tags: - PRP: personal pronouns (you, me, she, he, them, him, her, …) - PRP$: possessive pronouns (my, our, her, his, …) - NN: singular common nouns (sky, door, theorem, …) - NNS: plural common nouns (doors, theorems, women, …) - NNP: singular proper names (Fifi, IBM, Canada, …) - NNPS: plural proper names (Americas, Carolinas, …) CMPT 884, SFU, Martin Ester, 1-09

Entity Extraction • Syntactic Analysis • Words often have more than one POS, e.g. back • The back door = JJ • On my back = NN • Win the voters back = RB • Promised to back the bill = VB • The POS tagging problem is to determine the POS tag for a particular instance of a word. • e.g., input: the lead paint is unsafe • output: the/Det lead/N paint/N is/V unsafe/Adj CMPT 884, SFU, Martin Ester, 1-09

Entity Extraction • Knowledge Engineering Approach [Chaudhuri 2005] • • hand-coded rules often relatively straightforward • easy to incorporate domain knowledge • require substantial CS expertise • example rule<token> INITIAL</token> • <token>DOT </token> • <token>CAPSWORD</token> • <token>CAPSWORD</token> finds person names with a salutation and two capitalized words, e.g. Dr. Laura Haas CMPT 884, SFU, Martin Ester, 1-09

Entity Extraction • Machine Learning Approach • • We can view the named entity extraction as a sequence classification problem: classify each word as belonging to one of the named entity classes or to the noname class. • Class label of sequence element depends on neighboring ones. • • One of the most popular techniques for dealing with classifying sequences is Hidden Markov Models (HMM). • • Other popular ML method for entity extraction: Conditional Random Fields [Lafferty et al 2001]. • Requires large enough labeled (annotated) training dataset. CMPT 884, SFU, Martin Ester, 1-09

Entity Extraction • Hidden Markov Models [Rabiner 1989] • • HMM (Hidden Markov Model) is a finite state automaton with stochastic state transitions and symbol emissions. • • The automaton models a probabilistic generative process. • • In this process a sequence of symbols is produced by starting in an initial state, transitioning to a new state, emitting a symbol selected by the state and repeating this transition/emission cycle until a designated final state is reached. • Very successful in many sequence classification tasks. CMPT 884, SFU, Martin Ester, 1-09

Entity Extraction • Example • HMM for addresses CMPT 884, SFU, Martin Ester, 1-09

Entity Extraction • Hidden Markov Models • T = length of the sequence of observations (training set) • • N = number of states in the model • • qt = the actual state at time t • • S = {S1,...SN} (finite set of possible states) • • V = {O1,...OM} (finite set of observation symbols) • • π = {πi} = {P(q1 = Si)} starting probabilities • • A = {aij}=P(qt+1= Si| qt = Sj) transition probabilities • • B = {bi(Ot)} = {P(Ot| qt = Si)} emission probabilities • λ = (π, A, B) hidden Markov model CMPT 884, SFU, Martin Ester, 1-09

Entity Extraction • Hidden Markov Models • • How to find P( O | λ ): the probability of an observation sequence given the HMM model? forward-backward algorithm • • How to find λ that maximizes P( O |λ )? This is the task of the training phase. Baum-Welch algorithm • • How to find the most likely state trajectory given λ and O? • This is the task of the test phase. Viterbi algorithm CMPT 884, SFU, Martin Ester, 1-09

Relation Extraction • Example Organization Location Microsoft's central headquarters in Redmond is home to almost every product group and division. Microsoft Apple Computer Nike Redmond Cupertino Portland Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different." Apple's programmers "think different" on a "campus" in Cupertino, Cal.Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore. CMPT 884, SFU, Martin Ester, 1-09

Relation Extraction • Introduction • • No single source contains all the relations • Each relation appears on many web pages • There are repeated patterns in the way relations are represented on web pages exploit redundancy • Components of relation appear “close” together use context of occurrence of relation to determine patterns • pattern consists of constants (tokens) and variables (placeholders for entities) • • tuple: instance / occurrence of a relation CMPT 884, SFU, Martin Ester, 1-09

Relation Extraction • Introduction • • Typically requires entity extraction (tagging) as preprocessing • Knowledge engineering approach • - patterns defined over lexical items • “<company> located in <location>” • - patterns defined over parsed text • “((Obj <company>) (Verb located) (*) (Subj <location>))” • Machine learning approach • - learn rules/patterns from examples • - partially-supervised: bootstrap from example tuples • [Agichtein & Gravano 2000, Etzioni et al 2004] CMPT 884, SFU, Martin Ester, 1-09

Relation Extraction • Snowball [Agichtein & Gravano 2000] • • Exploit duality between patterns and tuples - find tuples that match a set of patterns • find patterns that match a lot of tuples bootstrapping approach Initial Seed Tuples Occurrences of Seed Tuples Generate New Seed Tuples Tag Entities Generate Extraction Patterns Augment Table CMPT 884, SFU, Martin Ester, 1-09

Relation Extraction • Snowball • how to represent patterns of occurrences? Computer servers at Microsoft’s headquarters in Redmond… In mid-afternoon trading, share ofRedmond-based Microsoft fell… The Armonk-based IBM introduceda new line… The combined company will operate from Boeing’s headquarters in Seattle. Intel, Santa Clara, cut prices of itsPentium processor. initial seed tuples occurrences of seed tuples CMPT 884, SFU, Martin Ester, 1-09

Relation Extraction • Patterns • (extraction)pattern has format <left, tag1, middle, tag2, right>, where tag1, tag2 are named-entity tags and left, middle, and right are vectors of weighted terms • patterns derived directly from occurrences are too specific ORGANIZATION 's central headquarters in LOCATION is home to... {<'s 0.5>, <central 0.5> <headquarters 0.5>, < in 0.5>} {<is 0.75>, <home 0.75> } LOCATION ORGANIZATION < left , tag1 , middle , tag2 , right > CMPT 884, SFU, Martin Ester, 1-09

Cluster 1 {<servers 0.75><at 0.75>} {<’s 0.5> <central 0.5> <headquarters 0.5> <in 0.5>} LOCATION ORGANIZATION {<operate 0.75><from 0.75>} {<’s 0.7> <headquarters 0.7> <in 0.7>} LOCATION ORGANIZATION Cluster 2 {<- 0.75> <based 0.75> } {<fell 1>} {<shares 0.75><of 0.75>} LOCATION ORGANIZATION {<- 0.75> <based 0.75> } {<introduced 0.75> <a 0.75>} {<the 1>} LOCATION ORGANIZATION Relation Extraction • Pattern Clusters • cluster patterns, cluster centroids define patterns CMPT 884, SFU, Martin Ester, 1-09

Relation Extraction • Evaluation of Patterns • • How good are new extraction patterns? • Measure their performance through their accuracy vs. the initial seed tuples (ground truth). Boeing, Seattle, said… Positive Intel, Santa Clara, cut prices… Positive invest in Microsoft, New York-based Negativeanalyst Jane Smith said extraction with pattern “ORGANIZATION, LOCATION” CMPT 884, SFU, Martin Ester, 1-09

Relation Extraction • Evaluation of Patterns • • Trust only patterns with high “support” and “confidence”, i.e. that produce many correct (positive) tuples and only a few false (negative) tuples. • conf(p) = pos(p)/(pos(p)+neg(p)) where p denotes a pattern and pos(p), neg(p) denote the numbers of positive, negative tuples produced CMPT 884, SFU, Martin Ester, 1-09

Relation Extraction • Evaluation of Tuples • • Trust only tuples that match many patterns. • Suppose candidate tuple t matches patterns p1 and p2. What is the probability that t is a valid tuple? • Assume matches of different patterns are independent events. • Pr[t matches p1 and t is not valid] = 1-conf(p1) • Pr[t matches p2 and t is not valid] = 1-conf(p2) • Pr[t matches {p1,p2} and t is not valid] = (1-conf(p1))(1-conf(p2)) • Pr[t matches {p1,p2} and t is valid] = 1 - (1-conf(p1))(1-conf(p2)) • If tuple t matches a set of patterns P conf(t) = 1 - p in P(1-conf(p)) CMPT 884, SFU, Martin Ester, 1-09

Relation Extraction • Snowball Algorithm • 1. Start with seed set R of tuples 2. Generate set P of patterns from R compute support and confidence for each pattern in P discard patterns with low support or confidence 3. Generate new set T of tuples matching patterns P compute confidence of each tuple in T add to R the tuples t in T with conf(t)>threshold. 4. go back to step 2 CMPT 884, SFU, Martin Ester, 1-09

Relation Extraction • Discussion • bootstrapping approach requires only a relatively small number of training tuples (semi-supervised) • is effective for binary, 1:1 relations • bootstrapping approach has been adopted by lots of subsequent work • pattern evaluation is heuristic and has no theory behind •  Statistical Snowball, WWW 09 • what about n-ary relations? • what about 1:m relations? CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • Introduction CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • Introduction • Entity resolution - map entity mentions to the corresponding entities - entities stored in database or ontology • Challenges • - large lists with multiple noisy mentions of the same entity - no single attribute to order or cluster likely duplicates while separating them from similarbut different entities - need to depend on fuzzy and computationally expensive string similarity functions. CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • Introduction • Typical approach - define string similarity numeric attributes are easy to compare, hard are string attributes needs to perform approximate matches - find similar pairs of entities - create groups from duplicate entity pairs (clustering) CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • String Similarity • Token-based • Jaccard TF-IDF cosine similarities •  suitable for large documents • Character-based • Edit-distance and variants like Levenshtein, Jaro-Winkler Soundex •  suitable for short strings with spelling mistakes • Hybrids CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • Token-Based String Similarity • Tokens/words • ‘AT&T Corporation’  ‘AT&T’ , ‘Corporation’ • Similarity: various measures of overlap of two sets S,T • Jaccard(S,T) = |S∩T|/|S∪T| • Example • S = ‘AT&T Corporation’  ‘AT&T’ , ‘Corporation’ T = ‘AT&T Corp’  ‘AT&T’ , ‘Corp.’ • Jaccard(S,T) = 1/3 • Variants: weights attached with each token CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • Token-Based String Similarity • Sets transformed to vectors with each term as dimension • Cosine similarity: dot-product of two vectors each normalized to unit length •  cosine of angle between them • Term weight = TF/IDF log (tf+1) * log idf where • tf : frequency of ‘term’ in a document d • idf : number of documents / number of documents containing ‘term’ •  rare ‘terms’ are more important CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • Token-Based String Similarity • Widely used in traditional IR • Example: • ‘AT&T Corporation’, ‘AT&T Corp’ or ‘AT&T Inc’ • low weights for ‘Corporation’,’Corp’,’Inc’, higher weight for ‘AT&T’ CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • Character-Based String Similarity • Given two strings, S,T, edit(S,T): • minimum cost sequence of operations to transform S to T. • Character operations: I (insert), D (delete), R (Replace). • Example: edit(Error,Eror) = 1, edit(great,grate) = 2 • Dynamic programming algorithm to compute edit(); • Several variants (gaps,weights) becomes NP-complete • Varying costs of operations: can be learnt • Suitable for common typing mistakes on small strings CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • Find Duplicate Pairs • Input: a large list of entities with string attributes • Output: all pairs (S,T) of entities which satisfy a similarity criteria such as • Jaccard(S,T) > 0.7 • Edit-distance(S,T) < k • Naive method: for each record pair, compute similarity score • I/O and CPU intensive, not scalable to millions of entities • Goal: reduce O(n2) cost to O(n*w), where w << n • Reduce number of pairs on which similarity is computed CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • Find Duplicate Pairs • Method: filter and refinement • Use inexpensive filter to filter out as many pairs as possible e.g. EditDistance(s,t) ≤ d → |q-grams(s) ∩ q-grams(t)| ≥ max(|s|,|t|) - (d-1)*q - 1 • q-gram: subsequence of q consecutive characterse.g. 3-grams for ‘AT&T Corporation’ {‘AT&’,’T&T’,’&T ‘, ‘T C’,’ Co’, ’orp’,’rpo’,’por’,’ora’,’rat’,’ati’,’tio’,’ion’} • If a pair (s, t) does not satisfy the filter, it cannot satisfy the similarity criteria e.g., |q-grams(s) ∩ q-grams(t)| < max(|s|,|t|) - (d-1)*q - 1 → EditDistance(s,t) > d CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • Find Duplicate Pairs • Do not have to apply the filter to all pairs of entities use index to retrieve subset of entities that share q-grams • Compute the expensive similarity function only to pairs that survive the filter step e.g. EditDistance(s,t) CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • Create Groups of Duplicates • Given pairs of duplicate entities • Group them such that each group corresponds to one entity • Many clustering algorithms have been applied • Number of clusters hard to specify in advance • Ground truth may be available for some entity pairs semi-supervised clustering CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • Create Groups of Duplicates • Agglomerative clustering: repeatedly merge closest clusters • Definition of closeness of clusters subject to tuning Average/Max/Min similarity • Efficient implementations possible using special data structures CMPT 884, SFU, Martin Ester, 1-09

Entity Resolution • Challenges • Collective entity resolution consider relationships between entities and propagate resolution decisions along these relationships use Markov Logic Networks [Parag & Domingos 2005] • Mapping to existing background knowledge ontology of real world entities may be given map entities / clusters of entities to ontology entries k-nearest neighbor methods CMPT 884, SFU, Martin Ester, 1-09

Information Extraction • References • Eugene Agichtein, Luis Gravano: Snowball: Extracting Relations from Large Plain-Text Collections, ACM DL, 2000 • Eugene Agichtein, Sunita Sarawagi: Scalable Information Extraction and Integration, Tutorial KDD 2006 • Eugene Charniak: Statistical Techniques for Natural Language Parsing“, AI Magazine 18(4), 1997 • S. Chaudhuri, R. Ramakrishnan, and G. Weikum. Integrating db and ir technologies: What is the sound of one hand clapping?, CIDR 2005 • Ronen Feldman:Information Extraction: Theory and Practice, Tutorial ICML 2006 CMPT 884, SFU, Martin Ester, 1-09

Information Extraction • References • John Lafferty, Andrew McCallum, Fernando Pereira: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, ICML 2001 • L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77(2), 1989 CMPT 884, SFU, Martin Ester, 1-09

Information Extraction