310 likes | 424 Views
Discovery and Reconciliation of Entity Type Taxonomies. Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen. Searching with types and entities. Answer types How far is it from Rome to Paris? type=distance#n#1 near words={Rome, Paris} Restrictions on match conditions
E N D
Discovery and Reconciliation of Entity Type Taxonomies Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen
Searching with types and entities • Answer types • How far is it from Rome to Paris? • type=distance#n#1 near words={Rome, Paris} • Restrictions on match conditions • How many movies did 20th Century Fox release in 1923? • …by 1950, the forest had only 1923 foxes left… • type={number#n#1,hasDigit} NEAR … year=1923 organization=“20th Century Fox” • Corpus = set of docs, doc = token sequence, tokens connected to lexical networks
Email1 …your ECML submission titledXYZ has been accepted… PDF file XYZ A. ThorAbstract:… Email2 Could I get a preprint of yourrecent ECML paper? Searching personal info networks EmailTo Time ECML A. U. Thor EmailDate EmailTo XYZ • No clean schema, data changes rapidly • Lots of generic “graph proximity” information EmailDate Canonical node Want toquickly find this file LastMod
Building blocks • Structured “seeds” of type info • WordNet, Wikipedia, OpenCYC, … • Semi-structured sources • List of faculty members in a department • Catalog of products at ecommerce site • Unstructured open-domain text • Email bodies, text of papers, blog text, Web page, … • Discovery and extension of type attachments • Hearst patterns, list extraction, NE taggers • Reconciling federated type systems • Schema and data integration • Query execution engines using type catalogs
Hearst patterns and enhancements • Hearst, 1992; KnowItAll (Etzioni+ 2004) • T such as x, x and other Ts, x or other Ts, T x, x is a T, x is the only T, … • C-PANKOW (Cimiano and Staab 2005) • Suitable for unformatted natural language • Generally high precision, low recall • If few possible Ts, use a named entity tagger
Set extraction • Each node of a graph is a word • Edge connects words wi and wj if these words occur together in more than k docs • Use apriori style searches to enumerate • Edge weight depends on #docs • Given a set of words Q, set up Pagerank • Random surfer on word graph • W.p. d, jump to some element of Q • W.p. 1−d, walk to a neighbor • Present nodes (words) with largest Pagerank
List extraction • Given a current set of candidate Ts • Limit to candidates having high confidence • Select random subset of k=4 candidates • Generate query from selected candidates • Download response documents • Look for lists containing candidate mentions • Extract more instances from lists found • Boosts extraction rate 2—5 folds
Wrapping, scraping, tagging • HTML formatting clues • Help extract records and fields • Extensive work in the DB, KDD, ML communities Paper is-a P167 Gerhard Pass has-author Gregor Heinrich has-title Investigating word correlation…
Reconciling type systems • WordNet: small and precise • Wikipedia: much larger, less controlled • Collect into a common is-a database
Mapping between taxonomies • Each type has a set of instances • Assoc Prof: K. Burn, R. Cook • Synset: lemmas from leaf instances • Wikipedia concept: list of instances • Yahoo topic: set of example Web pages • Goal: establish connections between types • Connections could be “soft” or probabilistic
Cross-training • Set of labels or types B, partly related but not identical to type set A • A=Dmoz topics, B=Yahoo topics • A=Personal bookmark topics, B=Yahoo topics • Training docs come in two flavors now • Fully labeled with A and B labels (rare) • Half-labeled with either an A or a B label Can B make classification for A more accurate (and vice versa)? • Inductive transfer, multi-task learning (Sarawagi+ 2003) DA DB
Motivation • Symmetric taxonomy mapping • Ecommerce catalogs: A=distributor, B=retailer • Web directories: A = Dmoz, B = Yahoo • Incomplete taxonomies, small training sets • Bookmark taxonomy vs. Yahoo • Cartesian label spaces Region Top Label-pair-conditionedterm distribution Sports Regional Topic … Baseball Cricket UK USA
Labels as features • A-label known, estimate B-label • Suppose we have A+B labeled training set • Discrete valued “label column” • Most text classifiers cannot balance importance of very heterogeneous features • Do not have fully-labeled data • Must guess (use soft scores instead of 0/1) Term feature values Augmented feature vector Target label
DA–DB Docs having only A-labels One-vs-rest SVMensemble for A:returns |A| scoresfor each test doc(signed distancefrom separator) Docs having only B-labels Train Test DB–DA S(A,0) Label Testoutput Text features t |A| Test case withA-label known(coded using avector of +1 and –1) Train S(B,1) One-vs-rest SVMensemble for B(target label set) Term features –1,…,–1,+1,–1,… SVM-CT: Cross-trained SVM S(A,1)S(B,2)S(A,2)…
SVM-CT anecdotes • Discriminant reveals relations between A and B • One-to-one, many-to-one, related, antagonistic • However, accuracy gains are meager Positive Negative
EM1D: Info from unlabeled docs • Use training docs to induce initial classifier for taxonomy B, say • Repeat until classifier satisfactory • Estimate Pr(|d) for unlabeled doc d, B • Reweigh d by factor Pr(|d) and add to training set for label • Retrain classifier EM1D: Expectation maximization with one label set B (Nigam et al.) • Ignores labels from another taxonomy A
DB–DA: docswith B-labels Docs in DA–DBlabeled ’ Stratified EM1D • Target labels = B • B-labeled docs are labeled training instances • Consider A-labeled docs labeled • These are unlabeled for taxonomy B • Run EM1D for each row • Test instance has known • Invoke semi-supervised model for row to classify B-topics … Docs in DA–DBlabeled A topics
EM2D: Cartesian product EM • Initialize with fully labeled docs which go to a specific (,) cell • Smear training doc across label row or column • Uniform smear could be bad • Use a naïve Bayes classifier to seed • Parameters extended from EM1D • , prior probability for label pair (,) • ,,tmultinomial term probability for (,) A-labeled doc B-labeled doc Labels in B Labels in A
Updatedclass-pairpriors Updatedclass-pair-conditionedterm stats EM2D updates • E-step for an A-labeled document • M-step
Applying EM2D to a test doc • Mapping a B-labeled test doc d to an A label (e-commerce catalogs) • Given , find argmax Pr(,|d) • Classifying a document d with no labels to an A label • Aggregation • For each compute Pr(,|d), pick best • Guessing (EM2D-G) • Guess the best * using a B-classifier • Find argmax Pr(,*|d) • EM pitfalls: damping factor, early stopping
Experiments • Selected 5 Dmoz and Yahoo subtree pairs • Compare EM2D against • Naïve Bayes, best #features and smoothing • EM1D: ignore labels from other taxonomy, consider as unlabeled docs • Stratified EM1D • Mapping test doc with A-label to B-label or vice versa • Classifying zero-labeled test doc • Accuracy = fraction with correct labels
Accuracy benefits in mapping Improvementover NB: 30% best,10% average • EM1D and NB are close, because training set sizes for each taxonomy are not too small • EM2D > Stratified EM1D > NB • 2d transfer of model info seems important
Asymmetric setting • Few (only 300) bookmarked URLs(taxonomy B, target) • Many Yahoo URLs, larger number of classes (taxonomy A) • Need to control damping factor (= importance of labeled :: unlabeled) to tackle population skew
Zero-labeled test documents • EM1D improves accuracy only for 12 train docs • EM2D with guessing improves beyond EM1D • In fact, better than aggregating scores to 1d • Choice of unlabeled:labeled damping ratio L may be important to get benefits
Robustness to initialization NaïveBayessmear Uniformsmear • Seeding choices: hard (best class), NB scores, uniform • Smear a fraction uniformly, rest by NB scores • EM2D is robust to wide range of smear fractions • Fully uniform smearing can fail (local optima)
Handling constraints • Type systems are often hierarchical • Neighborhood constraints • Two nodes match if their children also match • Two nodes match if their parents match and some of their descendants also match • If all children of node X match node Y, then X also matches Y • If a node in the neighborhood of node X matches ASSOCIATE- PROFESSOR, then the chance that X matches PROFESSOR is increased • Three-stage approach (Doan+ 2002) …
Three-stage reconciliation • Use EM2D to build a distribution estimator • For every type pair A, B, find, over the domain of entities, Pr(A,B), Pr(A,!B), Pr(!A,B), Pr(!A,!B) • From these, compute a local similarity measure • E.g. Jaccard: • Perform relaxation labeling to find a good mapping (Chakrabarti+ 1998, Doan+ 2002)
Relaxation labeling Label in O2 Label in O1 Everything known Mapping Features of the immediateneighborhood of X Markovian assumption: Initialize some reasonable f and reassign via Bayes rule until f stabilizes
Summary To realize the semantic Web vision we must • Assemble type schema from diverse sources • Mine type instances automatically • Annotate large corpora efficiency • Build indices integrating text and annotations • Support schema-agnostic query languages • When did you last type XQuery into Google • Design high-performance query execution engines • New family of ranking functions