1 / 33

Breaking through the syntax barrier: Searching with entities and relations

Breaking through the syntax barrier: Searching with entities and relations. Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen. Wish upon a textbox, 1996. Your information need here. “ A rising tide of data lifts all algorithms ”. Wish upon a textbox, 1998. Your information need here.

jerrod
Download Presentation

Breaking through the syntax barrier: Searching with entities and relations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Breaking through the syntax barrier:Searching with entities and relations Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen

  2. Wish upon a textbox, 1996 Your information need here Chakrabarti 2005

  3. “A rising tide of data lifts all algorithms” Wish upon a textbox, 1998 Your information need here Chakrabarti 2005

  4. Wish upon a textbox, post-IPO • Now indexing 4,285,199,774 pages • Same interface, therefore same 2-word queries • Mind-reading wizard-in-black-box saves the day Your information need (still) here Chakrabarti 2005

  5. If music had been invented tenyears ago along with the Web,we would all be playingone-string instruments(and not making great music). Udi Manber, A9.comPlenary speechWWW 2004 Chakrabarti 2005

  6. Examples of “great music”… • The commercial angle: they’ll call you • Want to buy X, find reviews and prices • Cheap tickets to Y • Noun-phrase + Pagerank saves the day • Find info about diabetes • Find the homepage of Tom Mitchell • Searching vertical portals: garbage control • Searching Citeseer or IMDB through Google • Someone out there had the same problem • +adaptec +aha2940uw +lilo +bios Chakrabarti 2005

  7. … and not-so-great music • Which produces better responses? • Opera fails to connect to secure IMAP tunneled through SSH • opera connect imap ssh tunnel • Unable to express manydetails of information need • Opera the email client, not a kind of music • The problem is with Opera, not ssh, imap, applet • “Secure” is an attribute of imap, but may not juxtapose Chakrabarti 2005

  8. Why telegraphic queries fail • Information need relates to entities and relationships in the real world • But the search engine gets only strings • Risk over-/under- specified queries • Never know true recall • No time to deal with poor precision • Query word distribution dramatically different from corpus distribution • Query is inherently incomplete • Fix some known info, look for unknown info Chakrabarti 2005

  9. Past the syntax barrier: early steps 1 • Taking the question apart • Question has known parts and unknown “slots” • Query-dependent information extraction (IE) • Compiling relations from the Web • is-instance-of (is-a), is-subclass-of • is-part-of, has-attribute • Exploit patterns that happen “naturally”… • …rather than architect knowledge bases and precise inference mechanisms • Can only go so far, but Web search standards are low to begin with… 2 Chakrabarti 2005

  10. Part-1Working harder on the question

  11. Atypes and ground constants • Specialize given domain to a token related to ground constants in the query • What animal is Winnie the Pooh? • instance-of(“animal”) NEAR “Winnie the Pooh” • When was television invented? • instance-of(“time”) NEAR “television” NEAR synonym(“invented”) • FIND x NEAR GroundConstants(question) WHERE x IS-A Atype(question) • Ground constants: Winnie the Pooh, television • Atypes: animal, time Chakrabarti 2005

  12. Taking the question apart • Atype: the type of the entity that is an answer to the question • Problem: don’t want to compile a classification hierarchy of entities • Laborious, can’t keep up • Offline rather than question-driven • Instead • Set up a very large basis of features • “Project” question and corpus to basis Chakrabarti 2005

  13. Scoring tokens for correct Atypes • FIND x “NEAR” GroundConstants(question) WHERE x IS-A Atype(question) • No fixed question or answer type system • Convert “x IS-A Atype(question)” to a soft match DoesAtypeMatch(x, question) Passage Question Answer tokens IE-style surfacefeature extractors IE-style surfacefeature extractors IS-A featureextractors Question feature vector Learn joint distrib. …other extractors… Snippet feature vector Chakrabarti 2005

  14. Features for Atype matching • Question features: 1, 2, 3-token sequences starting with standard wh-words • where, when, who, how_X, … • Passage surface features: hasCap, hasXx, isAbbrev, hasDigit, isAllDigit, lpos, rpos,… • Passage IS-A features: all generalizations of all noun senses of token • Use WordNet: horseequidungulate, hoofed mammalplacental mammalanimal…entity • These are node IDs (“synsets”) in WordNet, not strings Chakrabarti 2005

  15. Supervised learning setup • Get top 300 passages from IR engine • “Promising but negative” instances • Crude approximation to active learning • For each token invoke feature extractors • Question vector xq, passage vector xp • How to represent combined vector x? • Label = 1 if token is in answer span, 0 o/w • Question and answers from logs Chakrabarti 2005

  16. how_far region#n#3 when entity#n#1 what_city Joint feature-vector design • Obvious “linear” juxtaposition x =(xp,xq) • Does not expose pairwise dependencies • “Quadratic” form x = xq xp • All pairwise product of elements • Model has param for every pair • Can discount for redundancy in pair info • If xq(xp) is fixed, what xp(xq) will yield the largest Pr(Y=1|x)? Chakrabarti 2005

  17. Classification accuracy • Pairing more accurate than linear model • Are the estimated w parameters meaningful? • Given question, can return most favorable answer feature weights Chakrabarti 2005

  18. Parameter anecdotes • Surface and WordNet features complement each other • General concepts get negative params: use in predictive annotation • Learning is symmetric (QA) Chakrabarti 2005

  19. Taking the question apart • Atype: the type of the entity that is an answer to the question • Ground constants: Which question words are likely to appear (almost) unchanged in an answer passage? • Arises in Web search sessions too • Opera login fails • problem with login Opera email • Opera login accept password • Opera account authentication • … Chakrabarti 2005

  20. POS@-1 POS@0 POS@1 Features to identify ground constants • Local and global features • POS of word, POS of adjacent words, case info, proximity to wh-word • Suppose word is associated with synset set S • NumSense: size of S (is word very polysemous?) • NumLemma: average #lemmas describing sS(are there many aliases?) • Model as a sequential learning problem • Each token has local context and global features • Label: does token appear near answer? Chakrabarti 2005

  21. Ground constants: sample results • Global features (IDF, NumSense, NumLemma) essential for accuracy • Best F1 accuracy with local features alone: 71—73% • With local and global features: 81% • Decision trees better than logistic regression • F1=81% as against LR F1=75% • Intuitive decision branches Chakrabarti 2005

  22. Summary of the Atype strategy • “Basis” of atypes A, a  A could be synset, surface pattern, feature of a parse tree • Question q “projected” to vector (wa: a  A) in atype space via learning conditional model • If q is “when…” or “how long…” whasDigit and wtime_period#n#1 are large, wregion#n#1 is small • Each corpus token t has associated indicator features a(t ) for every a • hasDigit(3,000)= is-a(region#n#1)(Japan)=1 Chakrabarti 2005

  23. Reward proximity to ground constants • A token t is a candidate answer if • Hq(t ): Reward tokens appearing “near” ground constants matched from question • Order tokens by decreasing Projection of questionto “atype space” Atype indicator features of the token …the armadillo, found in Texas, is covered with strong horny plates Chakrabarti 2005

  24. Evaluation: Mean reciprocal rank (MRR) • nq = smallest rank among answer passages • MRR = (1/|Q|) qQ(1/nq) • Dropping passage from #1 to #2 as bad as dropping it from #2 to not reporting it at all Experiment setup: • 300 top IR score passages • If Pr(Y=1|token) < thresholdreject token • If tokens rejected reject passage • Points below diagonal are good Chakrabarti 2005

  25. Sample results • Accept all tokens  IR baseline MRR • Moderate acceptance threshold  non-answer passages eliminated, improves answer ranks • High threshold  true answers eliminated • Another answer with poor rank, or rank =  • Additional benefits from proximity filtering Chakrabarti 2005

  26. Part-2Compiling fragments of soft schema

  27. Who provides is-a info? • Compiled KBs: WordNet, CYC • Automatic “soft” compilations • Google sets • KnowItAll • BioText • Can use asevidencein scoringanswers Chakrabarti 2005

  28. Extracting is-instance-of info • Which researcher built the WHIRL system? • WordNet may not know Cohen IS-A researcher • Google has over 4.2 billion pages • “william cohen” on 86100 (p1=86.1k/4.2B) • researcher on 4.55M (p2=4.55M/4.2B) • +researcher +"william cohen“ on 1730: 18.55x more frequent than expected if independent • Pointwise mutual information PMI • Can add high-precision, low-recall patterns • “cities such as New York” (26600 hits) • “professor Michael Jordan” (101 hits ) Chakrabarti 2005

  29. Bootstrapping lists of instances • Hearst 1992, Brin 1997, Etzioni 2004 • A “propose-validate” approach • Using existing patterns, generate queries • For each web page w returned • Extract potential fact e and assign confidence score • Add fact to database if it has high enough score • Example patterns • NP1 {,} {such as|and other|including} NPList2 • NP1 is a NP2, NP1 is the NP2 of NP3 • the NP1 of NP2 is NP3 • Start with NP1 = researcher etc. Chakrabarti 2005

  30. System details • The importance of shallow linguistics working together with statistical tests • China is a (country)NP in Asia • Garth Brooks is a (countryADJ (singer)N)NP • Unary relation example • NP1 such as NPList2 & head(NP1)=plural(name(Class1)) & properNoun(head(each(NPList2)))  instanceOf(Class1, head(each(NPList2)) ) “Head” of phrase Chakrabarti 2005

  31. Compilation performance • Recall-vs-precision exposes size and difficulty of domain • “US state” is easy • “Country” is difficult • To improve signal-to-noise (STN) ratio, stop when confidence score is lower than threshold • Substantially improves recall-vs-precision Chakrabarti 2005

  32. Exploiting is-a info for ranking Passage Question Answer tokens Atype IE-style surfacefeature extractors IE-style surfacefeature extractors WN IS-Afeature extractors Question feature vector Learn joint distrib. PMI scores fromsearch engine probes • Use PMI scores as additional features • Challenge: make frugal use of expensive inverted index probes Snippet feature vector Chakrabarti 2005

  33. Concluding messages • Work much harder on questions • Break down into what’s known, what’s not • Find fragments of structure when possible • Exploit user profiles and sessions • Perform limited pre-structuring of corpus • Difficult to anticipate all needs and applications • Extract graph structure where possible (e.g. is-a) • Do not insist on specific schema • Design indices and ranking strategies for matching strings and semantics annotations • “Tip of the iceberg” under very complex ranking functions Chakrabarti 2005

More Related