1 / 22

ESTER: Efficient Search on Text, Entities, and Relations

ESTER: Efficient Search on Text, Entities, and Relations. Holger Bast , Alexandru Chitea , Fabian Suchanek , Ingmar Weber Presented by Krupakar Reddy Salguti. Keyword Search vs. Semantic Search. Keyword search Query: john lennon

arnie
Download Presentation

ESTER: Efficient Search on Text, Entities, and Relations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ESTER: Efficient Search on Text, Entities, and Relations HolgerBast, AlexandruChitea, Fabian Suchanek, Ingmar WeberPresented by Krupakar Reddy Salguti

  2. Keyword Search vs. Semantic Search • Keyword search • Query: john lennon • Answer: documents containing the words john and lennon • Semantic search • Query: musician • Answer: documents containing an instance ofmusician • Combined search • Query: beatles musician • Answer: documents containing the wordbeatles and an instance of musician

  3. Semantic Search: Challenges + Our System 1. Entity recognition • approach 1: let users annotate (semantic web) • approach 2: annotate (semi-)automatically • our system: uses Wikipedia links + learns from them 2. Query Processing • build a space-efficient index • which enables fast query answers • our system: as compact and fast as a standard full-text engine 3. User Interface • easy to use • yet powerful query capabilities • our system: standard interface with interactive suggestions

  4. In the Rest of this Talk … • Efficiency • three simple ideas (which all fail) • our approach (which works) • Queries supported • essentially all SPARQL queries, and • seamless integration with ordinary full-text search • Experiments • efficiency (great) • quality (not so great yet) • Conclusions • lots of interesting + challenging open problems

  5. Efficiency: Simple Idea 1 • Add “semantic tags” to the document • e.g., add the special word tag:musician before every occurrence of a musician in a document • Problem 1: Index blowup • e.g., John Lennon is a: Musician, Singer, Composer, Artist, Vegetarian, Person, Pacifist, … (28 classes) • Problem 2: Limited querying capabilities • e.g., could not produce list of musicians that occur in documents that also contain the word beatles • i.p., could not do all SPARQL queries (more on that later)

  6. Efficiency: Simple Idea 2 • Query Expansion • e.g., replace query word musician by disjunction musician:aaron_copland OR … OR musician:zarah_leander (7,593 musicians in Wikipedia) • Problem: Inefficient query processing • one intersection per element of the disjunction needed

  7. Efficiency: Simple Idea 3 • Use a database • map semantic queries to SQL queries on suitably constructed tables • that’s what the Artificial-Intelligence / Semantic-Web people usually do • Problem: Inefficient + Lack of control • building a search engine on top of an off-the-shelf database is orders of magnitude slower or uses orders of magnitude more space, or both • very limited control regarding efficiency aspects

  8. Efficiency: Our Approach • Two basic operations • prefix search of a special kind • join • An index data structure • which supports these two operations efficiently • Artificial words in the documents • such that a large class of semantic queries reduces to a combination of (few of) these operations

  9. Processing the query “beatles musician” position Gitanes … legend says that John Lennon entity:john_lennonof the Beatles smoked Gitanes to deepen his voice … John Lennon 0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer… entity:*. relation:is_a . class:musician beatles entity:* two prefix queries entity:john_lennonentity:1964 entity:liverpool etc. entity:wolfang_amadeus_mozart entity:johann_sebastian_bach entity:john_lennon etc. onejoin entity:john_lennon etc.

  10. position Gitanes … legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice … John Lennon 0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer… beatles entity:* entity:*. relation:is_a . class:musician • Problem: entity:* has a huge number of occurrences • ≈ 200 million for Wikipedia, which is ≈20% of all occurrences • prefix search efficient only for up to ≈ 1% (explanation follows) • Solution: frontier classes • classes at “appropriate” level in the hierarchy • e.g.: artist, believer, worker, vegetable, animal, …

  11. position Gitanes … legend says that John Lennon artist:john_lennonbeliever:john_lennon of the Beatles smoked … John Lennon 0 artist:john_lennon0 believer:john_lennon 1 relation:is_a 2 class:musician… beatles artist:* artist:*. relation:is_a . class:musician two prefix queries artist:john_lennonartist:graham_greene artist:pete_best etc. artist:wolfang_amadeus_mozart artist:johann_sebastian_bach artist:john_lennon etc. first figure out: musician  artist (easy) onejoin artist:john_lennon etc.

  12. The HYB Index [Bast/Weber,SIGIR’06] • Maintains lists for word ranges (not words) able ablaze abroad abnormal abl-abt • Looks like this for person:* person:graham_greene person:john_lennon person:ringo_starr person:john_lennon person:*

  13. Maintains lists for word ranges (not words) able ablaze abroad abnormal abl-abt • Provably efficient • no more space than an inverted index (on the same data) • each query = scan of a moderate number of (compressed) items • Extremely versatile • can do all kinds of things an inverted index cannot do (efficiently) • autocompletion, faceted search, query expansion, errorcorrection, select and join, …

  14. Queries we can handle • We prove the following theorem: • Any basic SPARQL graph query with m edges can be reduced to at most 2m prefix / join operations SELECT ?who WHERE { ?who is_a Musician ?who born_in_year ?whenJohn_Lennonborn_in_year ?when } • ESTER achieves seamless integration with full-text search • SPARQL has no means for dealing with full text search • XQuery can handle full-text search, but is not really suitable for semantic search

  15. Experiments: Corpus, Ontology, Index • Corpus: English Wikipedia (xml dump from Nov. 2006) ≈ 8 GB raw xml ≈ 2,8 million documents ≈ 1 billion words • Ontology: YAGO (Suchanek/Kasneci/Weikum, WWW’07) ≈ 2,5 million facts derived from clever combination of Wikipedia + WordNet(Entities from Wikipedia, Taxonomy from WordNet) • Our Index ≈ 1.5 billion words (original + artificial) ≈ 3.3 GB total index size; ontology-only is a mere 100 MB

  16. Experiments: Efficiency — What Baseline? • SPARQL engines • can’t do text search • and slow for ontology-only too (on Wikipedia: seconds) • XQuery engines • extremely slow for text search (on Wikipedia: minutes) • and slow for ontology-only too (on Wikipedia: seconds) • Other prototypes which do semantic + full-text search • efficiency is hardly considered • e.g., the system of Castells/Fernandez/Vallet (TKDE’07) “… average informally observed response time on a standard professional desktop computer [of] below 30 seconds [on 145,316 documents and an ontology with 465,848 facts] …” • our system: ~100ms, 2.8 million documents, 2.5 million facts

  17. Experiments: Efficiency — Stress Test 1 • Compare to ontology-only system • the YAGO engine from WWW’07 • Onto Simple : when was [person] born [1000 queries] • Onto Advanced: list all people from [profession][1000 queries] • Onto Hard : when did people die who were born in the same year as [person][1000 queries] • Note: comparison very unfair (for our system) 4 GB index 100 MB index

  18. Experiments: Efficiency — Stress Test 2 • Compare to text-only search engine • state-of-the-art system from SIGIR’06 • Onto+Text Easy: counties in [US state] [50 queries] • Onto+Text Hard: computer scientists [nationality][50 queries] • Full-text query: e.g. german computer scientists Note: hardly finds relevant documents • Note: comparison extremely unfair (for our system)

  19. Experiments: Quality — Entity Recognition • Use Wikipedia links as hints • “… following [[John Lennon | Lennon]] and Paul McCartney, two of the Beatles, …” • “… The southern terminus is located south of the town of [[Lennon, Michigan | Lennon]] …” • Learn other links • use words in neighborhood as features • Accuracy

  20. Experiments: Quality — Relevance • 2 Query Sets • People associated with [american university] [100 queries] • Counties of [american state] [50 queries] • Ground truth • Wikipedia has corresponding lists e.g., List of Carnegie Mellon University People • Precision and Recall

  21. Conclusions • Semantic Retrieval System ESTER • fast and scalable via reduction to prefix search and join • can handle all basic SPARQL queries • seamless integration with full-text search • standard user interface with (semantic) suggestions • Lots of interesting and challenging problems • simultaneous ranking of entities and documents • proper snippet generation and highlighting • search result quality • Source:www.mpi-inf.mpg.de/~bast/slides/xxx.ppt‎

More Related