1 / 57

Breaking through the syntax barrier: Searching with entities and relations

Breaking through the syntax barrier: Searching with entities and relations. Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen. Wish upon a textbox, 1996. Your information need here. “ A rising tide of data lifts all algorithms ”. Wish upon a textbox, 1998. Your information need here.

zorion
Download Presentation

Breaking through the syntax barrier: Searching with entities and relations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Breaking through the syntax barrier:Searching with entities and relations Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen

  2. Wish upon a textbox, 1996 Your information need here Chakrabarti 2004

  3. “A rising tide of data lifts all algorithms” Wish upon a textbox, 1998 Your information need here Chakrabarti 2004

  4. Wish upon a textbox, post-IPO • Now indexing 4,285,199,774 pages • Same interface, therefore same 2-word queries • Mind-reading wizard-in-black-box saves the day Your information need (still) here Chakrabarti 2004

  5. If music had been invented tenyears ago along with the Web,we would all be playingone-string instruments(and not making great music). Udi Manber, A9.comPlenary speechWWW 2004 Chakrabarti 2004

  6. Examples of “great music”… • The commercial angle: they’ll call you • Want to buy X, find reviews and prices • Cheap tickets to Y • Noun-phrase + Pagerank saves the day • Find info about diabetes • Find the homepage of Tom Mitchell • Searching vertical portals: garbage control • Searching Citeseer or IMDB through Google • Someone out there had the same problem • +adaptec +aha2940uw +lilo +bios Chakrabarti 2004

  7. … and not-so-great music • Which produces better responses? • Opera fails to connect to secure IMAP tunneled through SSH • opera connect imap ssh tunnel • Unable to express manydetails of information need • Opera the email client, not a kind of music • The problem is with Opera, not ssh, imap, applet • “Secure” is an attribute of imap, but may not juxtapose Chakrabarti 2004

  8. Why telegraphic queries fail • Information need relates to entities and relationships in the real world • But the search engine gets only strings • Risk over-/under- specified queries • Never know true recall • No time to deal with poor precision • Query word distribution dramatically different from corpus distribution • Query is inherently incomplete • Fix some info, look for other Chakrabarti 2004

  9. Broad, deep andad-hoc schema anddata integration: Very difficult! Defined many real problems away; “solved” apart fromperformance issues SQL XQuery 1 So far, largelyfor power-users,can be generatedby programs 3 No schema; either map to schema orsupport query on uninterpreted graphs Defining, labeling,extracting and rankinganswers are the major issues; no universal models; need many moreapplications Bool,Prox Domain of IR,time to analyzequeries muchmore deeply Structure in the query Bag-of-words 2 “Free-format text” HTML XML Entity+relations Structure in the corpus Chakrabarti 2004

  10. Past the syntax barrier: early steps 1 • Taking the question apart • Question has known parts and unknown “slots” • Query-dependent information extraction • Compiling relations from the Web • is-instance-of (is-a), is-subclass-of • is-part-of, has-attribute • Graph models for textual data • Searching graphs with keywords and twigs • Global probabilistic graph labeling 2 3 Chakrabarti 2004

  11. Part-1Working harder on the question

  12. Atypes and ground constants • Specialize given domain to a token related to ground constants in the query • What animal is Winnie the Pooh? • instance-of(“animal”) NEAR “Winnie the Pooh” • When was television invented? • instance-of(“time”) NEAR “television” NEAR synonym(“invented”) • FIND x NEAR GroundConstants(question) WHERE x IS-A Atype(question) • Ground constants: Winnie the Pooh, television • Atypes: animal, time Chakrabarti 2004

  13. Taking the question apart • Atype: the type of the entity that is an answer to the question • Problem: don’t want to compile a classification hierarchy of entities • Laborious, can’t keep up • Offline rather than question-driven • Instead • Set up a very large basis of features • “Project” question and corpus to basis Chakrabarti 2004

  14. Scoring tokens for correct Atypes • FIND x “NEAR” GroundConstants(question) WHERE x IS-A Atype(question) • No fixed question or answer type system • Convert “x IS-A Atype(question)” to a soft match DoesAtypeMatch(x, question) Passage Question Answer tokens IE-style surfacefeature extractors IE-style surfacefeature extractors IS-A featureextractors Question feature vector Learn joint distrib. …other extractors… Snippet feature vector Chakrabarti 2004

  15. Features for Atype matching • Question features: 1, 2, 3-token sequences starting with standard wh-words • where, when, who, how_X, … • Passage surface features: hasCap, hasXx, isAbbrev, hasDigit, isAllDigit, lpos, rpos,… • Passage IS-A features: all generalizations of all noun senses of token • Use WordNet: horseequidungulate, hoofed mammalplacental mammalanimal…entity • These are node IDs (“synsets”) in WordNet, not strings Chakrabarti 2004

  16. Supervised learning setup • Get top 300 passages from IR engine • “Promising but negative” instances • Crude approximation to active learning • For each token invoke feature extractors • Question vector xq, passage vector xp • How to represent combined vector x? • Label = 1 if token is in answer span, 0 o/w • Question and answers from logs Chakrabarti 2004

  17. how_far region#n#3 when entity#n#1 what_city Joint feature-vector design • Obvious “linear” juxtaposition x =(xp,xq) • Does not expose pairwise dependencies • “Quadratic” form x = xq xp • All pairwise product of elements • Model has param for every pair • Can discount for redundancy in pair info • If xq(xp) is fixed, what xp(xq) will yield the largest Pr(Y=1|x)? Chakrabarti 2004

  18. Classification accuracy • Pairing more accurate than linear model • Are the estimated w parameters meaningful? • Given question, can return most favorable answer feature weights Chakrabarti 2004

  19. Parameter anecdotes • Surface and WordNet features complement each other • General concepts get negative params: use in predictive annotation • Learning is symmetric (QA) Chakrabarti 2004

  20. Taking the question apart • Atype: the type of the entity that is an answer to the question • Ground constants: Which question words are likely to appear (almost) unchanged in an answer passage? • Arises in Web search sessions too • Opera login fails • problem with login Opera email • Opera login accept password • Opera account authentication • … Chakrabarti 2004

  21. POS@-1 POS@0 POS@1 Features to identify ground constants • Local and global features • POS of word, POS of adjacent words, case info, proximity to wh-word • Suppose word is associated with synset set S • NumSense: size of S (is word very polysemous?) • NumLemma: average #lemmas describing sS(are there many aliases?) • Model as a sequential learning problem • Each token has local context and global features • Label: does token appear near answer? Chakrabarti 2004

  22. Ground constants: sample results • Global features (IDF, NumSense, NumLemma) essential for accuracy • Best F1 accuracy with local features alone: 71—73% • With local and global features: 81% • Decision trees better than logistic regression • F1=81% as against LR F1=75% • Intuitive decision branches Chakrabarti 2004

  23. Summary of the Atype strategy • “Basis” of atypes A, a  A could be synset, surface pattern, feature of a parse tree • Question q “projected” to vector (wa: a  A) in atype space via learning conditional model • If q is “when…” or “how long…” whasDigit and wtime_period#n#1 are large, wregion#n#1 is small • Each corpus token t has associated indicator features a(t ) for every a • hasDigit(3,000)= is-a(region#n#1)(Japan)=1 Chakrabarti 2004

  24. Reward proximity to ground constants • A token t is a candidate answer if • Hq(t ): Reward tokens appearing “near” ground constants matched from question • Order tokens by decreasing Projection of questionto “atype space” Atype indicator features of the token …the armadillo, found in Texas, is covered with strong horny plates Chakrabarti 2004

  25. Evaluation: Mean reciprocal rank (MRR) • nq = smallest rank among answer passages • MRR = (1/|Q|) qQ(1/nq) • Dropping passage from #1 to #2 as bad as dropping it from #2 to not reporting it at all Experiment setup: • 300 top IR score passages • If Pr(Y=1|token) < thresholdreject token • If tokens rejected reject passage • Points below diagonal are good Chakrabarti 2004

  26. Sample results • Accept all tokens  IR baseline MRR • Moderate acceptance threshold  non-answer passages eliminated, improves answer ranks • High threshold  true answers eliminated • Another answer with poor rank, or rank =  • Additional benefits from proximity filtering Chakrabarti 2004

  27. Part-2Compiling fragments of soft schema

  28. Who provides is-a info? • Compiled KBs: WordNet, CYC • Automatic “soft” compilations • Google sets • KnowItAll • BioText • Can use asevidencein scoringanswers Chakrabarti 2004

  29. Extracting is-instance-of info • Which researcher built the WHIRL system? • WordNet may not know Cohen IS-A researcher • Google has over 4.2 billion pages • “william cohen” on 86100 (p1=86.1k/4.2B) • researcher on 4.55M (p2=4.55M/4.2B) • +researcher +"william cohen“ on 1730: 18.55x more frequent than expected if independent • Pointwise mutual information PMI • Can add high-precision, low-recall patterns • “cities such as New York” (26600 hits) • “professor Michael Jordan” (101 hits ) Chakrabarti 2004

  30. Bootstrapping lists of instances • Hearst 1992, Brin 1997, Etzioni 2004 • A “propose-validate” approach • Using existing patterns, generate queries • For each web page w returned • Extract potential fact e and assign confidence score • Add fact to database if it has high enough score • Example patterns • NP1 {,} {such as|and other|including} NPList2 • NP1 is a NP2, NP1 is the NP2 of NP3 • the NP1 of NP2 is NP3 • Start with NP1 = researcher etc. Chakrabarti 2004

  31. System details • The importance of shallow linguistics working together with statistical tests • China is a (country)NP in Asia • Garth Brooks is a (countryADJ (singer)N)NP • Unary relation example • NP1 such as NPList2 & head(NP1)=plural(name(Class1)) & properNoun(head(each(NPList2)))  instanceOf(Class1, head(each(NPList2)) ) “Head” of phrase Chakrabarti 2004

  32. Compilation performance • Recall-vs-precision exposes size and difficulty of domain • “US state” is easy • “Country” is difficult • To improve signal-to-noise (STN) ratio, stop when confidence score is lower than threshold • Substantially improves recall-vs-precision Chakrabarti 2004

  33. Exploiting is-a info for ranking Passage Question Answer tokens Atype IE-style surfacefeature extractors IE-style surfacefeature extractors WN IS-Afeature extractors Question feature vector Learn joint distrib. PMI scores fromsearch engine probes • Use PMI scores as additional features • Challenge: make frugal use of expensive inverted index probes Snippet feature vector Chakrabarti 2004

  34. Broad, deep andad-hoc schema anddata integration: Very difficult! Defined many real problems away; “solved” apart fromperformance issues SQL XQuery 1 So far, largelyfor power-users,can be generatedby programs 3a: Schema-free search on labeled graphs No schema; either map to schema orsupport query on uninterpreted graphs Bool,Prox 3b: Labeling graphs 3 Domain of IR,time to analyzequeries muchmore deeply Defining, labeling,extracting and rankinganswers are the major issues Structure in the query Bag-of-words 2 “Free-format text” HTML XML Entity+relations Structure in the corpus Chakrabarti 2004

  35. Email1 …your ECML submission titledXYZ has been accepted… PDF file XYZ A. ThorAbstract:… Email2 Could I get a preprint of yourrecent ECML paper? Mining graphs: applications abound EmailTo Time ECML A. U. Thor EmailDate EmailTo • No clean schema, data changes rapidly • Lots of generic “graph proximity” information XYZ EmailDate Canonical node Want toquickly find this file LastMod Chakrabarti 2004

  36. Part-3aSearching graphs

  37. Some useful graph query styles • Find single node with specified type strongly activated by given nodes • paper NEAR { author=“A. U. Thor” conference=“ECML” time=“now” } • May not know exact types of activators • Find connected subgraph that best explains how/why two or more nodes are related • connect {“Faloutsos” “Roussopulous } • Reward short paths, parallel paths, few distractions • Query is a small graph with match clauses Chakrabarti 2004

  38. Measuring activation of single nodes movie (near “richard gere” “julia roberts”)http://www.cse.iitb.ac.in/banks/ Chakrabarti 2004

  39. Reporting connecting subgraphs http://www.cse.iitb.ac.in/banks/ Chakrabarti 2004

  40. Schema-free twig queries • Query is a small graph, very light on types • Node type (dblpauthor, vldbjournal) • Simple predicates to qualify nodes (string, number) • Match to data graph • Precisely defined matches for local predicates • Reward for proximity Chakrabarti 2004

  41. Sample twig result VLDB journalarticle, 2001 GerhardWeikum’shome page Max-PlanckInstitutehome page Chakrabarti 2004

  42. Main issues • What style of queries to support? • No end-user will type Xquery directly • But bag-of-words is too weak • How to rank result nodes and subgraphs? • Reward for text match, graph proximity, … • “Iceberg queries”: report top few after a lot of computation? • Can ranking be learnt as with question answering? • How to reuse and retarget supervision? Chakrabarti 2004

  43. Part-3bProbabilistic graph labeling

  44. Identifying node types • How do we know a node has a given type? • Find person near “Pagerank” • Find student near (homepage of Tom Mitchell) • Find {verb, ####, “television”} close together • Nodes in our graph have • Visible attributes “hobbies”, “my advisor is”, “start-of-sentence” • Invisible labels “student”, “verb” • Neighbors “HREF” “linear juxtaposition” • Goal: given some node labels, guess all others Chakrabarti 2004

  45. Example: Hypertext classification • Want to assign labels to Web pages • Text on a single page may be too little or misleading • Page is not an isolated instance by itself • Problem setup • Web graph G=(V,E) • Node uV is a page having text uT • Edges in E arehyperlinks • Some nodes are labeled • Make collective guess at missing labels • Probabilistic model? Benefits? Chakrabarti 2004

  46. For the moment we don’t need to worry about this Graph labeling model • Seek a labeling f of all unlabeled nodes so as to maximize • Let VK be the nodes with known labels and f(VK) their known label assignments • Let N(v) be the neighbors of v and NK(v)N(v) be neighbors with known labels • Markov assumption: f(v) is conditionally independent of rest of G given f(N(v)) Chakrabarti 2004

  47. Probability of labelingspecific node v… Sum over all possible labelingsof unknown neighbors of v …given edges, text,parital labeling Markov assumption: label of v does not depend onunknown labels outside the set NU(v) Markov graph labeling • Circularity between f(v) and f(NU(v)) • Some form of iterative Gibb’s sampling or MCMC Chakrabarti 2004

  48. Iterative labeling in pictures • c=class, t=text, N=neighbors • Text-only model: Pr[c|t] • Using neighbors’ labelsto update my estimates:Pr[c|t, c(N)] • Repeat untilhistograms stabilize • Local optima possible ? Chakrabarti 2004

  49. Label estimates of v in the next iteration Take the expectation of thisterm over NU(v) labelings Joint distribution over neighborsapproximated as product of marginals Iterative labeling: formula • Sum over all possible NU(v) labelings still too expensive to compute • In practice, prune to most likely configurations By the Markov assumption, finally we need a distributioncoupling f(v) and vT(the text on v) and f(N(v)) Chakrabarti 2004

  50. Labeling results • 9600 patents from 12 classes marked by USPTO • Patents have text and cite other patents • Expand test patent to include neighborhood • ‘Forget’ fraction of neighbors’ classes • Good for other data sets as well Chakrabarti 2004

More Related