Scalable Information Extraction

Scalable Information Extraction Eugene Agichtein

Example: Angina treatments Structured databases (e.g., drug info, WHO drug adverse effects DB, etc) Medical reference and literature Web search results

Research Goal Accurate, intuitive, and efficient access to knowledge in unstructured sources Approaches: • Information Retrieval • Retrieve the relevant documents or passages • Question answering • Human Reading • Construct domain-specific “verticals” (MedLine) • Machine Reading • Extract entities and relationships • Network of relationships: Semantic Web

] M essageU nderstandingC onferences Semantic Relationships “Buried” in Unstructured Text RecommendedTreatment … A number of well-designed and -executed large-scale clinical trials have now shown that treatment with statins reduces recurrent myocardial infarction, reduces strokes, and lessens the need for revascularization or hospitalization for unstable angina pectoris … • Web, newsgroups, web logs • Text databases (PubMed, CiteSeer, etc.) • Newspaper Archives • Corporate mergers, succession, location • Terrorist attacks

What Structured Representation Can Do for You: • … allow precise and efficient querying • … allow returning answers instead of documents • … support powerful query constructs • … allow data integration with (structured) RDBMS • … provide useful content for Semantic Web Structured Relation

Challenges in Information Extraction • Portability • Reduce effort to tune for new domains and tasks • MUC systems: experts would take 8-12 weeks to tune • Scalability, Efficiency, Access • Enable information extraction over large collections • 1 sec / document * 5 billion docs = 158 CPU years • Approach: learn from data ( “Bootstrapping” ) • Snowball: Partially Supervised Information Extraction • Querying Large Text Databases for Efficient Information Extraction

Outline • Snowball: partially supervised information extraction (overview and key results) • Effective retrieval algorithms for information extraction (in detail) • Current: mining user behavior for web search • Future work

1 2 3 The Snowball System: Overview Snowball ... ... ..

GetExamples Find Example Occurrences in Text Evaluate Tuples Tag Entities Extract Tuples Generate Extraction Patterns Snowball: Getting User Input ACM DL 2000 • User input: • a handful of example instances • integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc…

GetExamples Find Example Occurrences in Text Evaluate Tuples Tag Entities Extract Tuples Generate Extraction Patterns Snowball: Finding Example Occurrences Can use any full-text search engine Search Engine Computer servers at Microsoft’s headquarters in Redmond… In mid-afternoon trading, shares of Redmond, WA-based MicrosoftCorp The Armonk-based IBM introduced a new line…Change of guard at IBM Corporation’s headquarters near Armonk, NY ...

GetExamples Find Example Occurrences in Text Evaluate Tuples Tag Entities Extract Tuples Generate Extraction Patterns Snowball: Tagging Entities Named entitytaggers can recognize Dates, People, Locations, Organizations, … MITRE’s Alembic, IBM’s Talent, LingPipe, … Computer servers at Microsoft ’s headquarters in Redmond… In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp The Armonk -based IBM introduced a new line…Change of guard at IBM Corporation‘s headquarters near Armonk, NY ...

Computer servers at Microsoft’s headquarters in Redmond… Snowball: Extraction Patterns • General extraction pattern model: acceptor0, Entity, acceptor1, Entity, acceptor2 • Acceptor instantiations: • String Match (accepts string “’s headquarters in”) • Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5), (in, 0.5)] ) • Classifier (estimate P(T=valid | ‘s, headquarters, in) )

GetExamples Find Example Occurrences in Text Evaluate Tuples Tag Entities Extract Tuples Generate Extraction Patterns LOCATION {<- 0.71>, < based 0.71>} ORGANIZATION LOCATION {<- 0.71>, < based 0.71>} ORGANIZATION ORGANIZATION LOCATION {<‘s 0.57>, <headquarters 0.57>, < near 0.57>} Snowball: Generating Patterns Represent occurrences as vectors of tags and terms 1 Cluster similar occurrences. 2 LOCATION {<'s 0.57>, <headquarters 0.57>, < in 0.57>} ORGANIZATION

GetExamples Find Example Occurrences in Text Evaluate Tuples Tag Entities Extract Tuples Generate Extraction Patterns LOCATION {<- 0.71>, < based 0.71>} ORGANIZATION Snowball: Generating Patterns Represent occurrences as vectors of tags and terms 1 Cluster similar occurrences. 2 Create patternsas filtered clustercentroids 3 LOCATION { <'s 0.71>, <headquarters 0.71>} ORGANIZATION

GetExamples Find Example Occurrences in Text Evaluate Tuples Tag Entities Extract Tuples Generate Extraction Patterns Snowball: Extracting New Tuples Match tagged text fragments against patterns Google 's new headquarters in Mountain Vieware … V LOCATION ORGANIZATION {<'s 0.5>, <new 0.5> <headquarters 0.5>, < in 0.5>} {<are 1>} LOCATION {<located 0.71>, < in 0.71>} ORGANIZATION Match=0.4 P2 ORGANIZATION {<'s 0.71>, <headquarters 0.71> } LOCATION Match=0.8 P1 {<- 0.71>, <based 0.71> LOCATION ORGANIZATION Match=0 P3

GetExamples Find Example Occurrences in Text Evaluate Tuples Tag Entities Extract Tuples Generate Extraction Patterns Snowball: Evaluating Patterns Automatically estimate patternconfidence:Conf(P4)= Positive / Total = 2/3 = 0.66 Current seed tuples P4 ORGANIZATION { < , 1> } LOCATION  IBM,Armonk, reported… Positive Intel,SantaClara, introduced... Positive “Bet on Microsoft”,New York-based analyst Jane Smith said... Negative  x

GetExamples Find Example Occurrences in Text Evaluate Tuples Tag Entities Extract Tuples Generate Extraction Patterns Snowball: Evaluating Tuples Automatically evaluate tuple confidence: Conf(T) = A tuple has high confidence if generated by high-confidence patterns. Conf(T): 0.83 ORGANIZATION P4: 0.66 { < , 1> } LOCATION 3COMSanta Clara 0.4 {<- 0.75>, <based 0.75>} ORGANIZATION LOCATION 0.8 P3: 0.95

GetExamples Find Example Occurrences in Text Evaluate Tuples Tag Entities Extract Tuples Generate Extraction Patterns Snowball: Evaluating Tuples ... .... .. ... .... .. Keep only high-confidence tuples for next iteration

GetExamples Find Example Occurrences in Text Evaluate Tuples Tag Entities Extract Tuples Generate Extraction Patterns Snowball: Evaluating Tuples Start new iteration with expandedexample set Iterate until no new tuples are extracted

Pattern-Tuple Duality • A “good” tuple: • Extracted by “good” patterns • Tuple weight  goodness • A “good” pattern: • Generated by “good” tuples • Extracts “good” new tuples • Pattern weight  goodness • Edge weight: • Match/Similarity of tuple context to pattern

How to Set Node Weights • Constraint violation (from before) • Conf(P) = Log(Pos) Pos/(Pos+Neg) • Conf(T) = • HITS [Hassan et al., EMNLP 2006] • Conf(P) = ∑Conf(T) • Conf(T) = ∑Conf(P) • URNS [Downey et al., IJCAI 2005] • EM-Spy [Agichtein, SDM 2006] • Unknown tuples = Neg • Compute Conf(P), Conf(T) • Iterate

Snowball: EM-based Pattern Evaluation

Evaluating Patterns and Tuples: Expectation Maximization • EM-Spy Algorithm • “Hide” labels for some seed tuples • Iterate EM algorithm to convergence on tuple/pattern confidence values • Set threshold t such that (t > 90% of spy tuples) • Re-initialize Snowball using new seed tuples …..

Adapting Snowball for New Relations • Large parameter space • Initial seed tuples (randomly chosen, multiple runs) • Acceptor features: words, stems, n-grams, phrases, punctuation, POS • Feature selection techniques: OR, NB, Freq, ``support’’, combinations • Feature weights: TF*IDF, TF, TF*NB, NB • Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy • Automatically estimate parameter values: • Estimate operating parameters based on occurrences of seed tuples • Run cross-validation on hold-out sets of seed tuples for optimal perf. • Seed occurrences that do not have close “neighbors” are discarded

SDM 2006 Example Task 1: DiseaseOutbreaks Proteus: 0.409 Snowball: 0.415

ISMB 2003 Example Task 2: Bioinformaticsa.k.a. mining the “bibliome” “APO-1, also known as DR6…”“MEK4, also called SEK1…” • 100,000+gene and protein synonyms extracted from 50,000+ journal articles • Approximately 40% of confirmed synonyms not previously listedin curated authoritative reference (SWISSPROT)

Snowball Used in Various Domains • News: NYT, WSJ, AP [DL’00, SDM’06] • CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks • Medical literature: PDRHealth, Micromedex… [Thesis] • AdverseEffects, DrugInteractions, RecommendedTreatments • Biological literature: GeneWays corpus [ISMB’03] • Gene and Protein Synonyms

PresidentGeorgeWBush’s three-day visit to India CIKM 2005 Limits of Bootstrapping for Extraction • Task “easy” when context term distributions diverge from background • Quantify as relative entropy (Kullback-Liebler divergence) • After calibration, metric predicts if bootstrapping likely to work

25 relations cover > 50% of question types, 5 relations cover > 55% question instances SIGIR 2005 Few Relations Cover Common Questions

Outline • Snowball, a domain-independent, partially supervised information extraction system • Retrieval algorithms for scalable information extraction • Current: mining user behavior for web search • Future work

] Expensive for large collections Extracting A Relation From a Large Text Database InformationExtraction System • Brute force approach: feed all docs to information extraction system • Only a tiny fraction of documents are often useful • Many databases are not crawlable • Often a search interface is available, with existing keyword index • How to identify “useful” documents? StructuredRelation

Accessing Text DBs via Search Engines Search engines impose limitations • Limit on documents retrieved per query • Support simple keywords and phrases • Ignore “stopwords” (e.g., “a”, “is”) InformationExtraction System Search Engine StructuredRelation

QXtract: Querying Text Databases for Robust Scalable Information EXtraction User-Provided Seed Tuples Query Generation Queries Promising Documents Information Extraction System Problem: Learn keyword queries to retrieve “promising” documents Extracted Relation

Learning Queries to Retrieve Promising Documents User-Provided Seed Tuples • Get document sample with “likely negative” and “likely positive” examples. • Labelsample documents using information extraction system as “oracle.” • Train classifiers to “recognize” useful documents. • Generate queries from classifier model/rules. Seed Sampling Information Extraction System Classifier Training Query Generation Queries

Training Classifiers to Recognize “Useful” Documents + D1 Document features: words + D2 - D3 - D4 Okapi (IR) SVM Ripper products disease disease AND reported => USEFUL exported reported used epidemic far infected virus

Generating Queries from Classifiers Ripper SVM Okapi (IR) disease AND reported => USEFUL products disease exported reported epidemic used infected far virus epidemicvirus virusinfected disease AND reported QCombined disease and reportedepidemicvirus

SIGMOD 2003 Demonstration

Convert given tuples into queries Retrieve matching documents Extract new tuples from documents and iterate Tuples: A Simple Querying Strategy “Ebola” and “Zaire” Search Engine InformationExtraction System

Comparison of Document Access Methods QXtract: 60% of relation extracted from 10% of documents of 135,000 newspaper article database Tuplesstrategy: Recall at most 46%

How to choose the best strategy? • Tuples: Simple, no training, but limited recall • QXtract: Robust, but has training and query overhead • Scan: No overhead, but must process all documents

WebDB 2003 Predicting Recall of Tuples Strategy Seed Tuple Seed Tuple SUCCESS! FAILURE  Can we predict if Tuples will succeed?

Abstract the problem: Querying Graph Tuples Documents “Ebola” and “Zaire” t1 Search Engine d1 t2 d2 t3 d3 t4 d4 t5 d5 Note: Only top K docs returned for each query. <Violence, U.S.>  retrieves many documents that do not contain tuples;  searching for an extracted tuple may not retrieve source document

Information ReachabilityGraph Tuples Documents t1 t1 d1 t2 t3 d2 t2 t3 d3 t4 t5 t4 d4 t1retrieves document d1that contains t2 t2, t3, and t4 “reachable” from t1 t5 d5

Connected Components Tuples that retrieve other tuples and themselves Reachable Tuples, do not retrieve tuples in Core Tuples that retrieve other tuples but are not reachable

Sizes of Connected Components How many tuples are in largest Core + Out? • Conjecture: • Degree distribution in reachability graphs follows “power-law.” • Then, reachability graph has at most one giant component. • Define Reachability as Fraction of tuples in largest Core + Out Out In Out In Core Core t0 Out In (strongly Core connected)

NYT Reachability Graph: Outdegree Distribution Matches the power-law distribution MaxResults=10 MaxResults=50

NYT: Component Size Distribution Not “reachable” “reachable” MaxResults=10 MaxResults=50 CG / |T| = 0.297 CG / |T| = 0.620

Connected Components Visualization DiseaseOutbreaks, New York Times 1995

Estimating Reachability In a power-law random graph G a giant component CG emerges* if d (the average outdegree) > 1, and: • Estimate: Reachability ~ CG / |T| • Depends only on d (average outdegree) Chung and Lu, Annals of Combinatorics, 2002 * For b < 3.457

Estimating Reachability Algorithm Tuples Documents t1 t1 d1 • Pick some random tuples • Use tuples to query database • Extract tuples from matching documents to compute reachability graph edges • Estimate average outdegree • Estimate reachability using results of Chung and Lu, Annals of Combinatorics, 2002 d2 t2 t3 t3 d3 t4 d4 t2 t2 d =1.5 t4

Scalable Information Extraction

Scalable Information Extraction

Presentation Transcript

Information Extraction

Information Extraction

Information Extraction

Scalable Information Extraction and Integration

Information Extraction

Information Extraction

Information Extraction

Information Extraction

information extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction