Discovery and Reconciliation of Entity Type Taxonomies

Discovery and Reconciliation of Entity Type Taxonomies Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen

Searching with types and entities • Answer types • How far is it from Rome to Paris? • type=distance#n#1 near words={Rome, Paris} • Restrictions on match conditions • How many movies did 20th Century Fox release in 1923? • …by 1950, the forest had only 1923 foxes left… • type={number#n#1,hasDigit} NEAR … year=1923 organization=“20th Century Fox” • Corpus = set of docs, doc = token sequence, tokens connected to lexical networks

Email1 …your ECML submission titledXYZ has been accepted… PDF file XYZ A. ThorAbstract:… Email2 Could I get a preprint of yourrecent ECML paper? Searching personal info networks EmailTo Time ECML A. U. Thor EmailDate EmailTo XYZ • No clean schema, data changes rapidly • Lots of generic “graph proximity” information EmailDate Canonical node Want toquickly find this file LastMod

Building blocks • Structured “seeds” of type info • WordNet, Wikipedia, OpenCYC, … • Semi-structured sources • List of faculty members in a department • Catalog of products at ecommerce site • Unstructured open-domain text • Email bodies, text of papers, blog text, Web page, … • Discovery and extension of type attachments • Hearst patterns, list extraction, NE taggers • Reconciling federated type systems • Schema and data integration • Query execution engines using type catalogs

Hearst patterns and enhancements • Hearst, 1992; KnowItAll (Etzioni+ 2004) • T such as x, x and other Ts, x or other Ts, T x, x is a T, x is the only T, … • C-PANKOW (Cimiano and Staab 2005) • Suitable for unformatted natural language • Generally high precision, low recall • If few possible Ts, use a named entity tagger

Set extraction • Each node of a graph is a word • Edge connects words wi and wj if these words occur together in more than k docs • Use apriori style searches to enumerate • Edge weight depends on #docs • Given a set of words Q, set up Pagerank • Random surfer on word graph • W.p. d, jump to some element of Q • W.p. 1−d, walk to a neighbor • Present nodes (words) with largest Pagerank

Example

List extraction • Given a current set of candidate Ts • Limit to candidates having high confidence • Select random subset of k=4 candidates • Generate query from selected candidates • Download response documents • Look for lists containing candidate mentions • Extract more instances from lists found • Boosts extraction rate 2—5 folds

Wrapping, scraping, tagging • HTML formatting clues • Help extract records and fields • Extensive work in the DB, KDD, ML communities Paper is-a P167 Gerhard Pass has-author Gregor Heinrich has-title Investigating word correlation…

Reconciling type systems • WordNet: small and precise • Wikipedia: much larger, less controlled • Collect into a common is-a database

Mapping between taxonomies • Each type has a set of instances • Assoc Prof: K. Burn, R. Cook • Synset: lemmas from leaf instances • Wikipedia concept: list of instances • Yahoo topic: set of example Web pages • Goal: establish connections between types • Connections could be “soft” or probabilistic

Cross-training • Set of labels or types B, partly related but not identical to type set A • A=Dmoz topics, B=Yahoo topics • A=Personal bookmark topics, B=Yahoo topics • Training docs come in two flavors now • Fully labeled with A and B labels (rare) • Half-labeled with either an A or a B label Can B make classification for A more accurate (and vice versa)? • Inductive transfer, multi-task learning (Sarawagi+ 2003) DA DB

Motivation • Symmetric taxonomy mapping • Ecommerce catalogs: A=distributor, B=retailer • Web directories: A = Dmoz, B = Yahoo • Incomplete taxonomies, small training sets • Bookmark taxonomy vs. Yahoo • Cartesian label spaces Region Top Label-pair-conditionedterm distribution Sports Regional Topic … Baseball Cricket UK USA

Labels as features • A-label known, estimate B-label • Suppose we have A+B labeled training set • Discrete valued “label column”  • Most text classifiers cannot balance importance of very heterogeneous features • Do not have fully-labeled data • Must guess  (use soft scores instead of 0/1) Term feature values   Augmented feature vector Target label

DA–DB Docs having only A-labels One-vs-rest SVMensemble for A:returns |A| scoresfor each test doc(signed distancefrom separator) Docs having only B-labels Train Test DB–DA S(A,0) Label Testoutput Text features   t  |A| Test case withA-label known(coded using avector of +1 and –1) Train S(B,1) One-vs-rest SVMensemble for B(target label set) Term features –1,…,–1,+1,–1,… SVM-CT: Cross-trained SVM S(A,1)S(B,2)S(A,2)…

SVM-CT anecdotes • Discriminant reveals relations between A and B • One-to-one, many-to-one, related, antagonistic • However, accuracy gains are meager Positive Negative

EM1D: Info from unlabeled docs • Use training docs to induce initial classifier for taxonomy B, say • Repeat until classifier satisfactory • Estimate Pr(|d) for unlabeled doc d, B • Reweigh d by factor Pr(|d) and add to training set for label  • Retrain classifier EM1D: Expectation maximization with one label set B (Nigam et al.) • Ignores labels from another taxonomy A

DB–DA: docswith B-labels Docs in DA–DBlabeled ’ Stratified EM1D • Target labels = B • B-labeled docs are labeled training instances • Consider A-labeled docs labeled  • These are unlabeled for taxonomy B • Run EM1D for each row  • Test instance has  known • Invoke semi-supervised model for row  to classify B-topics … Docs in DA–DBlabeled  A topics

EM2D: Cartesian product EM • Initialize with fully labeled docs which go to a specific (,) cell • Smear training doc across label row or column • Uniform smear could be bad • Use a naïve Bayes classifier to seed • Parameters extended from EM1D • , prior probability for label pair (,) • ,,tmultinomial term probability for (,) A-labeled doc B-labeled doc Labels in B Labels in A

Updatedclass-pairpriors Updatedclass-pair-conditionedterm stats EM2D updates • E-step for an A-labeled document • M-step

Applying EM2D to a test doc • Mapping a B-labeled test doc d to an A label (e-commerce catalogs) • Given , find argmax Pr(,|d) • Classifying a document d with no labels to an A label • Aggregation • For each  compute  Pr(,|d), pick best  • Guessing (EM2D-G) • Guess the best * using a B-classifier • Find argmax Pr(,*|d) • EM pitfalls: damping factor, early stopping

Experiments • Selected 5 Dmoz and Yahoo subtree pairs • Compare EM2D against • Naïve Bayes, best #features and smoothing • EM1D: ignore labels from other taxonomy, consider as unlabeled docs • Stratified EM1D • Mapping test doc with A-label to B-label or vice versa • Classifying zero-labeled test doc • Accuracy = fraction with correct labels

Accuracy benefits in mapping Improvementover NB: 30% best,10% average • EM1D and NB are close, because training set sizes for each taxonomy are not too small • EM2D > Stratified EM1D > NB • 2d transfer of model info seems important

Asymmetric setting • Few (only 300) bookmarked URLs(taxonomy B, target) • Many Yahoo URLs, larger number of classes (taxonomy A) • Need to control damping factor (= importance of labeled :: unlabeled) to tackle population skew

Zero-labeled test documents • EM1D improves accuracy only for 12 train docs • EM2D with guessing improves beyond EM1D • In fact, better than aggregating scores to 1d • Choice of unlabeled:labeled damping ratio L may be important to get benefits

Robustness to initialization NaïveBayessmear Uniformsmear • Seeding choices: hard (best class), NB scores, uniform • Smear a fraction uniformly, rest by NB scores • EM2D is robust to wide range of smear fractions • Fully uniform smearing can fail (local optima)

Handling constraints • Type systems are often hierarchical • Neighborhood constraints • Two nodes match if their children also match • Two nodes match if their parents match and some of their descendants also match • If all children of node X match node Y, then X also matches Y • If a node in the neighborhood of node X matches ASSOCIATE- PROFESSOR, then the chance that X matches PROFESSOR is increased • Three-stage approach (Doan+ 2002) …

Three-stage reconciliation • Use EM2D to build a distribution estimator • For every type pair A, B, find, over the domain of entities, Pr(A,B), Pr(A,!B), Pr(!A,B), Pr(!A,!B) • From these, compute a local similarity measure • E.g. Jaccard: • Perform relaxation labeling to find a good mapping (Chakrabarti+ 1998, Doan+ 2002)

Relaxation labeling Label in O2 Label in O1 Everything known Mapping Features of the immediateneighborhood of X Markovian assumption: Initialize some reasonable f and reassign via Bayes rule until f stabilizes

Sample results

Summary To realize the semantic Web vision we must • Assemble type schema from diverse sources • Mine type instances automatically • Annotate large corpora efficiency • Build indices integrating text and annotations • Support schema-agnostic query languages • When did you last type XQuery into Google • Design high-performance query execution engines • New family of ranking functions

Discovery and Reconciliation of Entity Type Taxonomies

Discovery and Reconciliation of Entity Type Taxonomies

Presentation Transcript

Standard Taxonomies

Joint Modeling of Entity-Entity Links and Entity-Annotated Text

Evaluating Taxonomies

Taxonomies and Laws

Learning Taxonomies

Sacrament of Penance and Reconciliation

Sacrament of Penance and Reconciliation

VI.3 Named Entity Reconciliation

XBRL Taxonomies

Quality Taxonomies

Text Analytics and Taxonomies

Taxonomies of Collaborative Applications

Named Entity Discovery from Multilingual Corpora

Cultural patterns and taxonomies

Taxonomies

Taxonomies of Multirobot Systems

Quality Taxonomies

Taxonomies of Learning and LEARNING OBJECTIVES

Benchmarking Taxonomies

Reconciliation and Truth and Reconciliation Commissions

Reconciliation Solution | Reconciliation And Settlement | Mindgate