1 / 96

Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/

Knowledge Harvesting f rom Text and Web Sources. Part 2: Search and Ranking of Knowledge. Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/. The Web Speaks to Us. Source: DB & IR methods for knowledge discovery. Communications of

glynn
Download Presentation

Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KnowledgeHarvesting from Text and Web Sources Part 2: Searchand Ranking ofKnowledge Gerhard Weikum Max Planck Institute forInformatics http://www.mpi-inf.mpg.de/~weikum/

  2. The Web Speaks to Us Source: DB & IR methods for knowledge discovery. Communications of the ACM 52(4), 2009 • Web2012 containsmore DB-style datathanever • gettingbetteratmakingstructuredcontentexplicit: • entities, classes (types), relationships • but no (hopefor) schemahere!

  3. Structure Now! Paris IE-enriched Web pageswith embeddedentitiesandfacts Knowledgebaseswithfacts from Web andIE witnesses Nicolas Sarkozy Nicolas Sarkozy Bob Dylan Bob Dylan Joan Baez Bob Dylan France Grammy Champs Elysees Carla Bruni ActorAward Christoph Waltz Oscar Sandra Bullock Oscar Sandra Bullock Golden Raspberry … CompanyCEO Google Eric Schmidt Yahoo Overture Facebook FriendFeed Software AG IDS Scheer … MovieReportedRevenue Avatar $ 2,718,444,933 The Reader $ 108,709,522 Facebook FriendFeed Software AG IDS Scheer …

  4. Distributed Structure: Linking Open Data 30 Bio. triples 500 Mio. links http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

  5. Distributed Structure: Linking Open Data yago/wordnet: Artist109812338 rdfs:subclassOf rdfs:subclassOf yago/wordnet:Actor109765278 rdf:type yago/wikicategory:ItalianComposer rdf:type imdb.com/name/nm0910607/ dbpedia.org/resource/Ennio_Morricone prop:actedIn prop: composedMusicFor imdb.com/title/tt0361748/ dbpprop:citizenOf dbpedia.org/resource/Rome owl:sameAs owl:sameAs rdf.freebase.com/ns/en.rome data.nytimes.com/51688803696189142301 owl:sameAs geonames.org/3169070/roma Coord N 41° 54' 10'' E 12° 29' 2''

  6. Entity Search http://entitycube.research.microsoft.com/

  7. Semantic Search with Entities, Classes, Relationships propertiesofentity US presidentwhen Barack Obama was born? Nobel laureatewhooutlivedtwoworldwarsand all hischildren? instancesofclasses Politicianswhoare also scientists? European composerswhowonthe Oscar? Chinese femaleastronauts? relationships FIFA 2010 finalistswhoplayed in a Champions League final? German footballclubsthatwonagainst Real Madrid? multiple entities Commonalities & relationshipsamong: Li Lianjie, Steve Jobs, Carla Bruni, Plato, Andres Iniesta? applications Enzymes thatinhibit HIV? Antidepressantsthatinterferewithblood-pressuredrugs? German philosophersinfluencedby William of Ockham?

  8. Quiz Time Rank theFolllowing Numbers 25*109 # RDF triples in LOD 5 6 4 3 1 2 600*106 # Wikipedia links 12*106 # followersof Lady Gaga 1015 # ants on the Earth 80*109 # Yuan by Mark Zuckerberg 3*109 # Google queries per day 2-8

  9. Outline  Motivation SearchingforEntities & Relations Informative Ranking Efficient Query Processing User Interface Wrap-up ...

  10. RDF: Structure, Diversity, No Schema SPO triples (statements, facts): (EnnioMorricone, bornIn, Rome) (Rome, locatedIn, Italy) (JavierNavarrete, birthPlace, Teruel) (Teruel, locatedIn, Spain) (EnnioMorricone, composed, l‘Arena) (JavierNavarrete, composerOf, aTale) (uri1, hasName, EnnioMorricone) (uri1, bornIn, uri2) (uri2, hasName, Rome) (uri2, locatedIn, uri3) … bornIn (EnnioMorricone, Rome) locatedIn(Rome, Italy) Rome Rome Italy EnnioMorricone bornIn locatedIn Rome City instanceOf • SPO triples: Subject – Property/Predicate – Object/Value) • pay-as-you-go: schema-agnosticorschemalater • RDF triples form fine-grained ER graph • popularforLinked Data, comp.biology (UniProt, KEGG, etc.) • open-sourceengines: Jena, Sesame, RDF-3X, etc.

  11. Facts about Facts facts: (EnnioMorricone, composed, l‘Arena) (JavierNavarrete, composerOf, aTale) (Berlin, capitalOf, Germany) (Madonna, marriedTo, GuyRitchie) (NicolasSarkozy, marriedTo, CarlaBruni) facts: 1: 2: 3: 4: 5: temporal facts: 6: (1, inYear, 1968) 7: (2, inYear, 2006) 8: (3, validFrom, 1990) 9: (4, validFrom, 22-Dec-2000) 10: (4, validUntil, Nov-2008) 11: (5, validFrom, 2-Feb-2008) provenance: 12: (1, witness, http://www.last.fm/music/Ennio+Morricone/) 13: (1, confidence, 0.9) 14: (4, witness, http://en.wikipedia.org/wiki/Guy_Ritchie) 15: (4, witness, http://en.wikipedia.org/wiki/Madonna_(entertainer)) 16: (10, witness, http://www.intouchweekly.com/2007/12/post_1.php) 17: (10, confidence, 0.1) • temporal annotations, witnesses/sources, confidence, etc. • can refer to reified facts via fact identifiers • (approx. equiv. to RDF quadruples: Col  Sub  Prop  Obj)

  12. Distributed Structure: Linking Open Data yago/wordnet: Artist109812338 rdfs:subclassOf rdfs:subclassOf yago/wordnet:Actor109765278 rdf:type yago/wikicategory:ItalianComposer rdf:type imdb.com/name/nm0910607/ dbpedia.org/resource/Ennio_Morricone prop:actedIn prop: composedMusicFor imdb.com/title/tt0361748/ dbpprop:citizenOf dbpedia.org/resource/Rome owl:sameAs owl:sameAs rdf.freebase.com/ns/en.rome data.nytimes.com/51688803696189142301 owl:sameAs geonames.org/3169070/roma Coord N 41° 54' 10'' E 12° 29' 2''

  13. SPARQL Query Language SPJ combinations of triple patterns (triples with S,P,O replaced by variable(s)) Select ?p, ?c Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . } Semantics: return all bindings to variables that match all triple patterns (subgraphs in RDF graph that are isomorphic to query graph) + filter predicates, duplicate handling, RDFS types, etc. Select Distinct ?c Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name ?n . ?p bornOn ?b . Filter (?b > 1945) . Filter(regex(?n, “Academy“) . }

  14. Querying the Structured Web flexible subgraph matching Structure but noschema: SPARQL well suited wildcardsforproperties (relaxed joins): Select ?p, ?c Where { ?p instanceOf Composer . ?p ?r1 ?t . ?t ?r2 ?c . ?c isa Country . ?c locatedIn Europe . } Extension: transitive paths [K. Anyanwu et al.: WWW‘07] Select ?p, ?c Where { ?p instanceOf Composer . ?p ??r ?c . ?c isa Country . ?c locatedIn Europe . PathFilter(cost(??r) < 5) . PathFilter (containsAny(??r,?t ) . ?t isa City . } Extension: regular expressions[G. Kasneci et al.: ICDE‘08] Select ?p, ?c Where { ?p instanceOf Composer . ?p (bornIn | livesIn | citizenOf) locatedIn* Europe . }

  15. Querying Facts & Text Problem: not everything is triplified • Consider witnesses/sources • (provenance meta-facts) • Allow text predicates with • each triple pattern (à la XQ-FT) Semantics: triplesmatchstruct. pred. witnessesmatchtextpred. European composers who have won the Oscar, whose music appeared in dramatic western scenes, and who also wrote classical pieces ? Select ?p Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . ?p contributedTo ?movie [western, gunfight, duel, sunset] . ?p composed ?music [classical, orchestra, cantata, opera] . }

  16. Querying Facts & Text Problem: not everything is triplified • Consider witnesses/sources • (provenance meta-facts) • Allow text predicates with • each triple pattern (à la XQ-FT) Groupingof keywordsorphrases boostsexpressiveness French politicians married to Italian singers? Select ?p1, ?p2 Where { ?p1 instanceOf politician [France] . ?p2 instanceOf singer [Italy] . ?p1 marriedTo ?p2 . } Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics] . ?p2 instanceOf ?c2 [Italy, singer] . ?p1 marriedTo ?p2 . } CS researchers whose advisors worked on the Manhattan project? Select ?r, ?a Where { ?r instOfresearcher[computerscience] . ?a workedOn ?x [Manhattan project] . ?r hasAdvisor ?a . } Select ?r, ?a Where { ?r ?p1 ?o1 [computerscience] . ?a ?p2 ?o2 [Manhattan project] . ?r ?p3 ?a . }

  17. Relatedness Queries Schema-agnostickeywordsearch (on RDF, ER graph, relational DB) becomes a specialcase RelationshipbetweenJet Li, Steve Jobs, Carla Bruni? Select ??p1, ??p2, ??p3 Where { LiLianjie ??p1 SteveJobs. SteveJobs ??p2 CarlaBruni. CarlaBruni??p3 LiLianjie. } Select ??p1, ??p2, ??p3 Where { ?e1 ?r1 ?c1 [“Li Lianjie“] . ?e2 ?r2 ?c2 [“Steve Jobs“] . ?e3 ?r3 ?c3 [“Carla Bruni“] . ?e1 ??p1 ?e2 . ?e2 ??p2 ?e3 . ?e3 ??p3 ?e1 . }

  18. Querying Temporal Facts [Y. Wang et al.: EDBT’10] [O. Udrea et al.: TOCL‘10 Problem: not all facts hold forever (e.g. CEOs, spouses, …) • Consider temporal scopes of reified facts • Extend Sparql with temporal predicates 1: (BayernMunich, hasWon, ChampionsLeague) 2: (BorussiaDortmund, hasWon, ChampionsLeague) 3: (1, validOn, 23May2001) 4: (1, validOn, 15May1974) 5: (2, validOn, 28May1997) 6: (OttmarHitzfeld, manages, BayernMunich) 7: (6, validSince, 1Jul1998) 8:(6, validUntil, 30Jun2004) Whendid a German soccerclubwinthe Champions League? Select ?c, ?t Where { ?c isasoccerClub . ?c inCountry Germany . ?id1: ?c hasWonChampionsLeague . ?id1 validOn ?t . } Managers of German clubswhowonthe Champions League? Select ?m Where { isasoccerClub . ?c inCountry Germany . ?id1: ?c hasWonChampionsLeague . ?id1 validOn ?t . ?id2: ?m manages ?c . ?id2 validSince ?s . ?id2 validUntil ?u . [?s,?u] overlaps [?t,?t] . }

  19. Querying with Vague Temporal Scope 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Problem: user‘s temporal interest is imprecise Problem: user‘s temporal interest is often imprecise • Consider temporal phrases as text conditions • Allow approximate matching • and rank results wisely German Champion League winners in the nineties? Select ?c Where { ?c isa soccerClub . ?c inCountry Germany . ?c hasWon ChampionsLeague [nineties] . } Soccer final winners in summer 2001? Select ?c Where { ?c isa soccerClub . ?id: ?c matchAgainst ?o [final] . ?id winner ?c [“summer 2001“] . }

  20. KeywordSearch on Graphs [BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS, NAGA, …] Schema-agnostickeywordsearchovermultiple tables: graphoftupleswithforeign-keyrelationshipsasedges Example: Conferences (CId, Title, Location, Year) Journals (JId, Title) CPublications (PId, Title, CId) JPublications (PId, Title, Vol, No, Year) Authors (PId, Person) Editors (CId, Person) Select * From * Where * Contains”Gray, DeWitt, XML, Performance“And Year > 95 Resultisconnectedtreewithnodesthatcontain asmanyquerykeywordsaspossible Ranking: withnodeScorebased on tf*idfor prob. IR andedgeScorereflectingimportanceofrelationships (orconfidence, authority, etc.) Top-k querying: compute best trees, e.g. Steiner trees (NP-hard)

  21. Best Result: Group Steiner Tree y x w z Resultisconnectedtreewithnodesthatcontain asmanyquerykeywordsaspossible • Group Steiner tree: • match individual keywords terminal nodes, groupedbykeyword • computetreethatconnectsat least one terminal node per keyword • andhasbest total edgeweight y w y w x w x z forquery: x w y z

  22. Outline  Motivation  SearchingforEntities & Relations Informative Ranking Efficient Query Processing User Interface Wrap-up ...

  23. Ranking Criteria • Confidence: • Prefer results that are likely correct • accuracy of info extraction • trust in sources • (authenticity, authority) bornIn (Jim Gray, San Francisco) from „Jim Gray was born in San Francisco“ (en.wikipedia.org) livesIn (Michael Jackson, Tibet) from „Fans believe Jacko hides in Tibet“ (www.michaeljacksonsightings.com) • Informativeness: • Prefer results with salient facts • Statistical estimation from: • frequency in answer • frequency on Web • frequency in query log q: Einstein isa ? Einstein isa scientist Einstein isa vegetarian q: ?x isa vegetarian Einstein isa vegetarian Whocares isa vegetarian Diversity: Prefer variety of facts E won … E discovered … E played… E won … E won … E won … E won … • Conciseness: • Preferresultsthataretightlyconnected • sizeofanswergraph • costof Steiner tree Einstein won NobelPrize Bohr won NobelPrize Einstein isa vegetarian Cruise isa vegetarian Cruise born 1962 Bohr died 1962

  24. Ranking Approaches • Confidence: • Prefer results that are likely correct • accuracy of info extraction • trust in sources • (authenticity, authority) empiricalaccuracyof IE PR/HITS-style estimateoftrust combineinto: max { accuracy (f,s) * trust(s) | s  witnesses(f) } • Informativeness: • Preferresultswithsalientfacts • Statistical LM withestimationsfrom: • frequency in answer • frequency in corpus (e.g. Web) • frequency in query log PR/HITS-style entity/factranking [V. Hristidis et al., S.Chakrabarti, …] or IR models: tf*idf … [K.Chang et al., …] Statistical Language Models Diversity: Prefer variety of facts Statistical Language Models • Conciseness: • Preferresultsthataretightlyconnected • sizeof answergraph • cost of Steiner tree graphalgorithms(BANKS, STAR, …) [J.X. Yu et al., S.Chakrabarti et al., B. Kimelfeld et al., G.Kasneciet al., …]

  25. Example: RDF-Express [S. Elbassuoni: SIGIR‘12, ESWC‘11, CIKM‘09] ?a1 isMarriedTo ?a2 . ?a1 actedIn ?m . ?a2 actedIn ?m . http://www.mpi-inf.mpg.de/yago-naga/rdf-express/

  26. Example: RDF-Express ?a1 isMarriedTo ?a2 {Bollywood} . ?a1 actedIn ?m . ?a2 actedIn ?m .

  27. Example: RDF-Express ?a1 isMarriedTo ?a2 . ?a1 actedIn ?m {thriller} . ?a2 actedIn ?m {love} .

  28. Example: RDF-Express [S. Elbassuoni: SIGIR‘12, ESWC‘11, CIKM‘09] ?a1 directed ?m . ?a2 actedIn ?m . ?a1 hasWonPrizeAcademy_Award . ?a2 type wordnet_musician .

  29. Information Retrieval Basics ...... ..... ...... ..... server farm with 10 000‘s of computers, distributed/replicated data in high-performance file system, massive parallelism for query processing [Textbooksby R. Baza-Yates, B. Croft, Manning/Raghavan/Schuetze, …] extract & clean index rank search present crawl handle dynamic pages, detect duplicates, detect spam fast top-k queries, query logging, auto-completion GUI, user guidance, personalization scoring function over many data and context criteria strategies for crawl schedule and priority queue for crawl frontier build and analyze Web graph, index all tokens or word stems

  30. Vector Space Model for Content Relevance Ranking Ranking by descending relevance Similarity metric: Search engine Query (set of weighted features) Documents are feature vectors (bags of words)

  31. Vector Space Model for Content Relevance Ranking Rankingby descending relevance Similarity metric: Search engine Query (Set of weighted features) Documentsarefeaturevectors (bagsofwords) tf*idf formula: termfrequency * documentfrequency

  32. Statistical Language Models (LM‘s) [Maron/Kuhns 1960, Ponte/Croft 1998, Hiemstra 1998, Lafferty/Zhai 2001] „God does not play dice“ (Albert Einstein) „IR does“ (anonymous) d1 ? LM(1) q „God rolls dice in places where you can‘t see them“ (Stephen Hawking) d2 ? LM(2) • each doc dihas LM: generative prob. distr. with params i • query q viewed as sample from LM(1), LM(2), … • estimate likelihood P[ q | LM(i) ] • that q is sample of LM of doc di (q is „generated by“ di) • rank by descending likelihoods (best „explanation“ of q)

  33. LM: Doc as Model, Query as Sample estimate likelihood of observing query A A B C E E P [ | M] query model M multinomial prob. distribution A A A A B B C C C D E E E E E document d: sample of M used for parameter estimation

  34. LM: Need for Smoothing + estimate likelihood of observing query A P [ | M] B C E F query C B D + background corpus and/or smoothing E F A A used for parameter estimation model M A A A A B B C C C D E E E E E document d Laplace smoothing Jelinek-Mercer Dirichlet smoothing …

  35. Some LM Basics independ. assumpt. simple MLE: overfitting mixture model forsmoothing P[q]est. from logorcorpus tf*idf family KL divergence (Kullback-Leibler div.) aka. relative entropy • Precomputeper-keywordscores • Store in postingsofinvertedindex • Score aggregrationfor (top-k) multi-keywordquery efficient implementation

  36. IR as LM Estimation user likes doc (R) given that it has features d and user poses query q P[R|d,q] prob. IR statist. LM query likelihood: top-k query result: MLE would be tf

  37. Multi-Bernoulli vs. Multinomial LM Multi-Bernoulli: with Xj(q)=1 if jq, 0 otherwise Multinomial: with fj(q) = f(j) = frequency of j in q multinomial LM more expressive and usually preferred

  38. LM ScoringbyKullback-LeiblerDivergence neg. cross-entropy neg. KL divergence of q and d

  39. Jelinek-Mercer Smoothing Idea: use linear combinationofdoc LM with background LM (corpus, commonlanguage); could also consider query log as background LM for query • parametertuningof bycross-validationwithheld-out data: • dividesetof relevant (d,q) pairsinto n partitions • build LM on thepairsfrom n-1 partitions • choose  tomaximizeprecision (orrecallor F1) on nthpartition • iteratewith different choiceofnthpartitionandaverage

  40. Jelinek-Mercer Smoothing:Relationship to tf*idf with absolute frequencies tf, df relative tf ~ relative idf

  41. Dirichlet-Prior Smoothing Posteriordistr. with Dirichletdistribution asprior withtermfrequencies f in document d MAP (Maximum Posterior) for withi setto  P[i|C]+1 fortheDirichlethypergenerator and  > 1 setto multiple ofaveragedocumentlength with Dirichlet (): (Dirichletisconjugatepriorforparametersofmultinomialdistribution: DirichletpriorimpliesDirichletposterior, onlywith different parameters)

  42. Dirichlet-Prior Smoothing:RelationshiptoJelinek-Mercer Smoothing tf j from corpus with MLEs P[j|d], P[j|C] with where 1=  P[1|C], ..., m=  P[m|C] are the parameters of the underlying Dirichlet distribution, with constant  > 1 typically set to multiple of average document length

  43. EntitySearchwith LM Ranking [Z. Nie et al.: WWW’07, H. Fang et al.: ECIR‘07, P. Serdyukov et al.: ECIR‘08, …] played for ManU, Real, LA Galaxy David Beckham champions league England lost match against France married to spice girl … Zizou champions league 2002 Real Madrid won final ... Zinedine Zidane best player France world cup 1998 ... query: keywords  answer: entities LM (entity e) = prob. distr. of words seen in context of e query q: „French player who won world championship“ candidate entities: e1: David Beckham e2: Ruud van Nistelroy e3: Ronaldinho e4: Zinedine Zidane e5: FC Barcelona weighted by conf.

  44. LM‘s: from Entities to Facts Document / Entity LM‘s LM for doc/entity: prob. distr. of words LM for query: (prob. distr. of) words LM‘s: richfordocs/entities, super-sparseforqueries richerquery LM withqueryexpansion, etc. Triple LM‘s LM for facts: (degen. prob. distr. of) triple LM for queries: (degen. prob. distr. of) triple pattern LM‘s: apples and oranges • expandquery variables by S,P,O valuesfrom DB/KB • enhancewithwitnessstatistics • queryLM thenisprob. distr. oftriples!

  45. LM‘s for Triples and Triple Patterns [G. Kasneci et al.: ICDE’08; S. Elbassuoni et al.: CIKM’09, ESWC‘11] triples (facts f): triplepatterns (queries q): q: Beckham p ?y LM(q) + smoothing 200 300 20 30 300 150 20 200 350 10 400 200 150 100 150 20 f1: Beckham p ManchesterU f2: Beckham p RealMadrid f3: Beckham p LAGalaxy f4: Beckham p ACMilan F5: Kaka p ACMilan F6: Kaka p RealMadrid f7: Zidane p ASCannes f8: Zidane p Juventus f9: Zidane p RealMadrid f10: Tidjani p ASCannes f11: Messi p FCBarcelona f12: Henry p Arsenal f13: Henry p FCBarcelona f14: Ribery p BayernMunich f15: Drogba p Chelsea f16: Casillas p RealMadrid q: Beckham p ManU q: Beckham p Real q: Beckham p Galaxy q: Beckham p Milan 200/550 300/550 20/550 30/550 q: ?x p ASCannes Zidane p ASCannes 20/30 Tidjani p ASCannes10/30 q: ?x p ?y LM(q): {t  P [t | t matches q] ~ #witnesses(t)} LM(answer f): {t  P [t | t matches f] ~ 1 for f} smooth all LM‘s rank resultsbyascending KL(LM(q)|LM(f)) Messi p FCBarcelona 400/2600 Zidane p RealMadrid 350/2600 Kaka p ACMilan 300/2600 … q: Cruyff ?r FCBarcelona Cruyff playedFor FCBarca 200/500 Cruyff playedAgainst FCBarca 50/500 Cruyff coached FCBarca 250/500 : 2600 witnessstatistics

  46. LM‘s for Composite Queries q: Select ?x,?cWhere { France ml ?x . ?x p ?c . ?c in UK . } queries q with subqueries q1 … qn P [ F ml Henry, Henry p Arsenal, Arsenal in UK ] P [ F ml Drogba, Drogbap Chelsea, Chelsea in UK ] resultsaren-tuplesoftriplest1 … tn LM(q): P[q1…qn] = i P[qi] LM(answer): P[t1…tn] = i P[ti] KL(LM(q)|LM(answer)) = i KL(LM(qi)|LM(ti)) f1: Beckham p ManU 200 f7: Zidane p ASCannes 20 f8: Zidane p Juventus 200 f9: Zidane p RealMadrid 300 f10: Tidjani p ASCannes 10 f12: Henry p Arsenal 200 f13: Henry p FCBarca 150 f14: Ribery p Bayern 100 f15: Drogba p Chelsea 150 f21: F ml Zidane 200 f22: F ml Tidjani 20 f23: F ml Henry 200 f24: F ml Ribery 200 f25: F ml Drogba 30 f26: IC ml Drogba 100 f27 ALG ml Zidane 50 f31: ManU in UK 200 f32: Arsenal in UK 160 f33: Chelsea in UK 140

  47. LM‘s for Keyword-Augmented Queries q: Select ?x, ?c Where { France ml ?x [goalgetter, “top scorer“]. ?x p ?c . ?c in UK [champion, “cupwinner“, double] . } subqueriesqiwithkeywords w1 … wm resultsare still n-tuplesoftriplesti LM(qi): P[tripleti | w1 … wm] = k  P[ti| wk] + (1) P[ti] LM(answerfi) analogous KL(LM(q)|LM(answerfi)) = i KL (LM(qi) | LM(fi)) resultrankingprefers (n-tuplesof) triples whosewitnesses score high on thesubquerykeywords

  48. LM‘s for Temporal Phrases [K.Berberich et al.: ECIR‘10] 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Problem: user‘s temporal interest is imprecise mid 90s q: Select ?c Where { ?c instOfnationalTeam . ?c hasWonWorldCup[nineties] . } July 1994 last century 1998 “nineties“  v = [1Jan90,31Dec99] summer 1990 “mid 90s“  xj = [1Jan93,31Dec97] extract tempexpr‘sxj fromwitnesses & normalize normalize tempexpr v in query P[q | t] ~ … jP[v | xj] ~ … j  { P[ [b,e] | xj] | interval [b,e]v} ~ … joverlap(v,xj) / union(v,xj) • enhancedranking • efficientlycomputable • plugintodoc/entity/triplesLM‘s P[v=„nineties“ | xj = „mid 90s“] = 1/2 P[„nineties“ | „summer 1990“] = 1/30 P[„nineties“ | „last century“] = 1/10

  49. Query Relaxation q: … Where{ France ml ?x . ?x p ?c . ?c in UK . } q(4): … Where{ IC ml ?x . ?x p ?c . ?c in UK . } q(3): … Where{ France ?r?x . ?x p ?c . ?c in UK . } q(2): … Where{ ?xml ?x . ?x p ?c . ?c in UK . } q(1): … Where{ France ml ?x . ?x p ?c . ?c in ?y. } [ F resOfDrogba, Drogbap Chelsea, Chelsea in UK] [ IC ml Drogba, Drogbap Chelsea, Chelsea in UK] [ IC ml Drogba, Drogbap Chelsea, Chelsea in UK] [ F ml Zidane, Zidane p Real, Real in ESP ] LM(q*) =  LM(q) + 1 LM(q(1)) + 2 LM(q(2)) + … replacee in q bye(i) in q(i): precompute P:=LM (e ?p ?o) and Q:=LM (e(i) ?p ?o) set i ~ 1/2 (KL (P|Q) + KL (Q|P)) replacer in q byr(i) in q(i)  LM (?s r(i) ?o) replacee in q by?x in q(i)  LM (?x r ?o) … LM‘sof e, r, ... are prob. distr.‘s oftriples! f21: F ml Zidane 200 f22: F ml Tidjani 20 F23: F ml Henry 200 F24: F ml Ribery 200 F26: IC ml Drogba 100 F27 ALG ml Zidane 50 f1: Beckham p ManU 200 f7: Zidane p ASCannes 20 f9: Zidane p Real 300 f10: Tidjani p ASCannes 10 f12: Henry p Arsenal 200 f15: Drogba p Chelsea 150 f31: ManU in UK 200 f32: Arsenal in UK 160 f33: Chelsea in UK 140

  50. Result Personalization [S. Elbassuoni et al.: PersDB‘08] q2 f3 f1 q1 f2 q3 f5 f4 q q q3 f3 f6 q4 q5 q6 same answer foreveryone? Personal historiesof queries & clickedfacts  LM(user u): prob. distr. oftriples ! LM(q|u] =  LM(q) + (1) LM(u) thenbusinessasusual u1 [classicalmusic]  q: ?p from Europe . ?p hasWonAcademyAward u2 [romanticcomedy]  q: ?p from Europe . ?p hasWonAcademyAward u3 [fromAfrica]  q: ?p isaSoccerPlayer . ?p hasWon ?a Open issue: „insightful“ results (newtotheuser)

More Related