Information Extraction, Data Mining & Joint Inference

Information Extraction, Data Mining & Joint Inference Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Khashayar Rohanemanesh, Ben Wellner, Karl Schultz, Michael Hay, Michael Wick, David Mimno.

My Research Building models that mine actionable knowledgefrom unstructured text.

foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1 Extracting Job Openings from the Web

A Portal for Job Openings

Job Openings: Category = High Tech Keyword = Java Location = U.S.

Data Mining the Extracted Job Information

IE fromChinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * Free Soft.. Microsoft Microsoft TITLE ORGANIZATION * founder * CEO VP * Stallman NAME Veghte Bill Gates Richard Bill

From Text to Actionable Knowledge Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Documentcollection Actionableknowledge Prediction Outlier detection Decision support

A Natural Language Processing Pipeline Pragmatics Anaphora Resolution Semantic Role Labeling Entity Recognition Parsing Chunking POS tagging Errorscascade &accumulate

Unified Natural Language Processing Pragmatics Anaphora Resolution Semantic Role Labeling Entity Recognition Parsing Chunking POS tagging Unified, joint inference.

Solution: Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Documentcollection Actionableknowledge Emerging Patterns Prediction Outlier detection Decision support

Discriminatively-trained undirected graphical models Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Complex Inference and Learning Just what we researchers like to sink our teeth into! Solution: Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Probabilistic Model Documentcollection Actionableknowledge Prediction Outlier detection Decision support

Scientific Questions • What model structures will capture salient dependencies? • Will joint inference actually improve accuracy? • How to do inference in these large graphical models? • How to do parameter estimation efficiently in these models,which are built from multiple large components? • How to do structure discovery in these models?

Methods of Inference • Exact • Exhaustively explore all interpretations • Graphical model has low “tree-width” • Variational • Represent distribution in simpler model that is close • Monte-Carlo • Randomly (but cleverly) sample to explore interpretations

Outline  • Examples of IE and Data Mining. • Motivate Joint Inference • Brief introduction to Conditional Random Fields • Joint inference: Information Extraction Examples • Joint Labeling of Cascaded Sequences (Belief Propagation) • Joint Labeling of Distant Entities (BP by Tree Reparameterization) • Joint Co-reference Resolution (Graph Partitioning) • Joint Segmentation and Co-ref (Sparse BP) • Probability + First-order Logic, Co-ref on Entities (MCMC) • Demo: Rexa, a Web portal for researchers 

Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Graphical model Finite state model S S S transitions t - 1 t t+1 ... ... observations ... O O O t - t +1 t 1 Generates: State sequence Observation sequence o1 o2 o3 o4 o5 o6 o7 o8

IE with Hidden Markov Models Given a sequence of observations: Yesterday Ron Parr spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) YesterdayRon Parr spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Ron Parr

We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1

Problems with Richer Representationand a Joint Model These arbitrary features are not independent. • Multiple levels of granularity (chars, words, phrases) • Multiple dependent modalities (words, formatting, layout) • Past & future Two choices: Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! S S S S S S t - 1 t t+1 t - 1 t t+1 O O O O O O t - t +1 t - t +1 t 1 t 1

Conditional Sequence Models • We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(s|o) instead of P(s,o): • Can examine features, but not responsible for generating them. • Don’t have to explicitly model their dependencies. • Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

From HMMs to Conditional Random Fields [Lafferty, McCallum, Pereira 2001] St-1 St St+1 Joint ... ... Ot-1 Ot Ot+1 Conditional St-1 St St+1 ... Ot-1 Ot Ot+1 ... where (A super-special case of Conditional Random Fields.) Set parameters by maximum likelihood, using optimization method on dL.

where Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] (Linear Chain) Conditional Random Fields [Lafferty, McCallum, Pereira 2001] Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence) Finite state model Graphical model OTHERPERSONOTHERORGTITLE … output seq y y y y y t+2 t+3 t - 1 t t+1 FSM states . . . observations x x x x x t t +2 +3 t - t +1 t 1 input seq said Jones a Microsoft VP …

Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from www.fedstats.gov Labels: CRF • Non-Table • Table Title • Table Header • Table Data Row • Table Section Data Row • Table Footnote • ... (12 in all) Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Features: • Percentage of digit chars • Percentage of alpha chars • Indented • Contains 5+ consecutive spaces • Whitespace in this line aligns with prev. • ... • Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Line labels, percent correct Table segments, F1 HMM 65 % 64 % Stateless MaxEnt 85 % - CRF 95 % 92 %

IE from Research Papers [McCallum et al ‘99]

IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) 75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) 93.9 [Peng, McCallum, 2004] error 40%

Outline • The Need for IE and Data Mining. • Motivate Joint Inference • Brief introduction to Conditional Random Fields • Joint inference: Information Extraction Examples • Joint Labeling of Cascaded Sequences (Belief Propagation) • Joint Labeling of Distant Entities (BP by Tree Reparameterization) • Joint Co-reference Resolution (Graph Partitioning) • Probability + First-order Logic, Co-ref on Entities (MCMC) • Joint Information Integration (MCMC + Sample Rank) • Demo: Rexa, a Web portal for researchers

1. Jointly labeling cascaded sequencesFactorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words

1. Jointly labeling cascaded sequencesFactorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words But errors cascade--must be perfect at every stage to do well.

1. Jointly labeling cascaded sequencesFactorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Inference: Loopy Belief Propagation

2. Jointly labeling distant mentionsSkip-chain CRFs [Sutton, McCallum, SRL 2004] … Senator Joe Green said today … . Green ran for … Dependency among similar, distant mentions ignored.

2. Jointly labeling distant mentionsSkip-chain CRFs [Sutton, McCallum, SRL 2004] … Senator Joe Green said today … . Green ran for … 14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterized BP See also [Finkel, et al, 2005] [Wainwright et al, 2002]

3. Joint co-reference among all pairsAffinity Matrix CRF “Entity resolution”“Object correspondence” Mr. Hill Dana Hill Amy Hall ~25% reduction in error on co-reference of proper nouns in newswire. Dana she Inference: Correlational clustering graph partitioning [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] [Bansal, Blum, Chawla, 2002]

Coreference Resolution AKA "record linkage", "database record deduplication", "citation matching", "object correspondence", "identity uncertainty" Output Input News article, with named-entity "mentions" tagged Number of entities, N = 3 #1 Secretary of State Colin Powell he Mr. Powell Powell #2 Condoleezza Rice she Rice #3 President Bush Bush Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . he . . . . . . . . . . . . . . . . . . . Condoleezza Rice . . . . . . . . . Mr Powell . . . . . . . . . .she . . . . . . . . . . . . . . . . . . . . . Powell . . . . . . . . . . . . . . . President Bush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rice . . . . . . . . . . . . . . . . Bush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Inside the Traditional Solution Pair-wise Affinity Metric Mention (3) Mention (4) Y/N? . . . Powell . . . . . . Mr Powell . . . N Two words in common 29 Y One word in common 13 Y "Normalized" mentions are string identical 39 Y Capitalized word in common 17 Y > 50% character tri-gram overlap 19 N < 25% character tri-gram overlap -34 Y In same sentence 9 Y Within two sentences 8 N Further than 3 sentences apart -1 Y "Hobbs Distance" < 3 11 N Number of entities in between two mentions = 0 12 N Number of entities in between two mentions > 4 -3 Y Font matches 1 Y Default -19 OVERALL SCORE = 98 > threshold=0

Entity Resolution “mention” Mr. Hill “mention” “mention” Dana Hill Amy Hall “mention” “mention” Dana she

Entity Resolution Mr. Hill “entity” Dana Hill Amy Hall Dana she “entity”

Entity Resolution Mr. Hill “entity” Dana Hill Amy Hall “entity” Dana she

Entity Resolution Mr. Hill “entity” “entity” Dana Hill Amy Hall Dana she “entity”

Independent pairwise affinity with connected components The Problem Mr. Hill Affinity measures are noisy and imperfect. N C Dana Hill Amy Hall N Pair-wise merging decisions are being made independently from each other C N C C N N They should be made jointly. Dana she C

CRF for Co-reference [McCallum & Wellner, 2003, ICML] Mr. Hill N C Dana Hill Amy Hall N C N Make pair-wise merging decisions jointly by: - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. C C N N Dana she C

Information Extraction, Data Mining & Joint Inference