ester efficient search on text entities and relations n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
ESTER: Efficient Search on Text, Entities, and Relations PowerPoint Presentation
Download Presentation
ESTER: Efficient Search on Text, Entities, and Relations

Loading in 2 Seconds...

play fullscreen
1 / 22

ESTER: Efficient Search on Text, Entities, and Relations - PowerPoint PPT Presentation


  • 136 Views
  • Uploaded on

ESTER: Efficient Search on Text, Entities, and Relations. Holger Bast , Alexandru Chitea , Fabian Suchanek , Ingmar Weber Presented by Krupakar Reddy Salguti. Keyword Search vs. Semantic Search. Keyword search Query: john lennon

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'ESTER: Efficient Search on Text, Entities, and Relations' - arnie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ester efficient search on text entities and relations

ESTER: Efficient Search on Text, Entities, and Relations

HolgerBast, AlexandruChitea, Fabian Suchanek, Ingmar WeberPresented by Krupakar Reddy Salguti

keyword search vs semantic search
Keyword Search vs. Semantic Search
  • Keyword search
    • Query: john lennon
    • Answer: documents containing the words john and lennon
  • Semantic search
    • Query: musician
    • Answer: documents containing an instance ofmusician
  • Combined search
    • Query: beatles musician
    • Answer: documents containing the wordbeatles and an instance of musician
semantic search challenges our system
Semantic Search: Challenges + Our System

1. Entity recognition

  • approach 1: let users annotate (semantic web)
  • approach 2: annotate (semi-)automatically
  • our system: uses Wikipedia links + learns from them

2. Query Processing

  • build a space-efficient index
  • which enables fast query answers
  • our system: as compact and fast as a standard full-text engine

3. User Interface

  • easy to use
  • yet powerful query capabilities
  • our system: standard interface with interactive suggestions
in the rest of this talk
In the Rest of this Talk …
  • Efficiency
    • three simple ideas (which all fail)
    • our approach (which works)
  • Queries supported
    • essentially all SPARQL queries, and
    • seamless integration with ordinary full-text search
  • Experiments
    • efficiency (great)
    • quality (not so great yet)
  • Conclusions
    • lots of interesting + challenging open problems
efficiency simple idea 1
Efficiency: Simple Idea 1
  • Add “semantic tags” to the document
    • e.g., add the special word tag:musician before every occurrence of a musician in a document
  • Problem 1: Index blowup
    • e.g., John Lennon is a: Musician, Singer, Composer, Artist, Vegetarian, Person, Pacifist, … (28 classes)
  • Problem 2: Limited querying capabilities
    • e.g., could not produce list of musicians that occur in documents that also contain the word beatles
    • i.p., could not do all SPARQL queries (more on that later)
efficiency simple idea 2
Efficiency: Simple Idea 2
  • Query Expansion
    • e.g., replace query word musician by disjunction

musician:aaron_copland OR … OR musician:zarah_leander

(7,593 musicians in Wikipedia)

  • Problem: Inefficient query processing
    • one intersection per element of the disjunction needed
efficiency simple idea 3
Efficiency: Simple Idea 3
  • Use a database
    • map semantic queries to SQL queries on suitably constructed tables
    • that’s what the Artificial-Intelligence / Semantic-Web people usually do
  • Problem: Inefficient + Lack of control
    • building a search engine on top of an off-the-shelf database is orders of magnitude slower or uses orders of magnitude more space, or both
    • very limited control regarding efficiency aspects
efficiency our approach
Efficiency: Our Approach
  • Two basic operations
    • prefix search of a special kind
    • join
  • An index data structure
    • which supports these two operations efficiently
  • Artificial words in the documents
    • such that a large class of semantic queries reduces to a combination of (few of) these operations
processing the query beatles musician
Processing the query “beatles musician”

position

Gitanes

… legend says that John Lennon entity:john_lennonof the Beatles smoked Gitanes to deepen his voice …

John Lennon

0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer…

entity:*. relation:is_a . class:musician

beatles entity:*

two

prefix

queries

entity:john_lennonentity:1964

entity:liverpool

etc.

entity:wolfang_amadeus_mozart

entity:johann_sebastian_bach

entity:john_lennon

etc.

onejoin

entity:john_lennon

etc.

slide11

position

Gitanes

… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …

John Lennon

0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer…

beatles entity:*

entity:*. relation:is_a . class:musician

  • Problem: entity:* has a huge number of occurrences
    • ≈ 200 million for Wikipedia, which is ≈20% of all occurrences
    • prefix search efficient only for up to ≈ 1% (explanation follows)
  • Solution: frontier classes
    • classes at “appropriate” level in the hierarchy
    • e.g.: artist, believer, worker, vegetable, animal, …
slide12

position

Gitanes

… legend says that John Lennon artist:john_lennonbeliever:john_lennon of the Beatles smoked …

John Lennon

0 artist:john_lennon0 believer:john_lennon 1 relation:is_a 2 class:musician…

beatles artist:*

artist:*. relation:is_a . class:musician

two

prefix

queries

artist:john_lennonartist:graham_greene

artist:pete_best

etc.

artist:wolfang_amadeus_mozart

artist:johann_sebastian_bach

artist:john_lennon

etc.

first figure out:

musician  artist

(easy)

onejoin

artist:john_lennon

etc.

the hyb index bast weber sigir 06
The HYB Index [Bast/Weber,SIGIR’06]
  • Maintains lists for word ranges (not words)

able

ablaze

abroad

abnormal

abl-abt

  • Looks like this for person:*

person:graham_greene

person:john_lennon

person:ringo_starr

person:john_lennon

person:*

slide14

Maintains lists for word ranges (not words)

able

ablaze

abroad

abnormal

abl-abt

  • Provably efficient
    • no more space than an inverted index (on the same data)
    • each query = scan of a moderate number of (compressed) items
  • Extremely versatile
    • can do all kinds of things an inverted index cannot do (efficiently)
    • autocompletion, faceted search, query expansion, errorcorrection, select and join, …
queries we can handle
Queries we can handle
  • We prove the following theorem:
    • Any basic SPARQL graph query with m edges can be reduced to at most 2m prefix / join operations

SELECT ?who WHERE { ?who is_a Musician ?who born_in_year ?whenJohn_Lennonborn_in_year ?when }

  • ESTER achieves seamless integration with full-text search
    • SPARQL has no means for dealing with full text search
    • XQuery can handle full-text search, but is not really suitable for semantic search
experiments corpus ontology index
Experiments: Corpus, Ontology, Index
  • Corpus: English Wikipedia (xml dump from Nov. 2006)

≈ 8 GB raw xml

≈ 2,8 million documents

≈ 1 billion words

  • Ontology: YAGO (Suchanek/Kasneci/Weikum, WWW’07)

≈ 2,5 million facts

derived from clever combination of Wikipedia + WordNet(Entities from Wikipedia, Taxonomy from WordNet)

  • Our Index

≈ 1.5 billion words (original + artificial)

≈ 3.3 GB total index size; ontology-only is a mere 100 MB

experiments efficiency what baseline
Experiments: Efficiency — What Baseline?
  • SPARQL engines
    • can’t do text search
    • and slow for ontology-only too (on Wikipedia: seconds)
  • XQuery engines
    • extremely slow for text search (on Wikipedia: minutes)
    • and slow for ontology-only too (on Wikipedia: seconds)
  • Other prototypes which do semantic + full-text search
    • efficiency is hardly considered
    • e.g., the system of Castells/Fernandez/Vallet (TKDE’07)

“… average informally observed response time on a standard professional desktop computer [of] below 30 seconds [on 145,316 documents and an ontology with 465,848 facts] …”

    • our system: ~100ms, 2.8 million documents, 2.5 million facts
experiments efficiency stress test 1
Experiments: Efficiency — Stress Test 1
  • Compare to ontology-only system
    • the YAGO engine from WWW’07
    • Onto Simple : when was [person] born [1000 queries]
    • Onto Advanced: list all people from [profession][1000 queries]
    • Onto Hard : when did people die who were born in the same year as [person][1000 queries]
  • Note: comparison very unfair (for our system)

4 GB index

100 MB index

experiments efficiency stress test 2
Experiments: Efficiency — Stress Test 2
  • Compare to text-only search engine
    • state-of-the-art system from SIGIR’06
    • Onto+Text Easy: counties in [US state] [50 queries]
    • Onto+Text Hard: computer scientists [nationality][50 queries]
    • Full-text query: e.g. german computer scientists Note: hardly finds relevant documents
  • Note: comparison extremely unfair (for our system)
experiments quality entity recognition
Experiments: Quality — Entity Recognition
  • Use Wikipedia links as hints
    • “… following [[John Lennon | Lennon]] and Paul McCartney, two of the Beatles, …”
    • “… The southern terminus is located south of the town of [[Lennon, Michigan | Lennon]] …”
  • Learn other links
    • use words in neighborhood as features
  • Accuracy
experiments quality relevance
Experiments: Quality — Relevance
  • 2 Query Sets
    • People associated with [american university] [100 queries]
    • Counties of [american state] [50 queries]
  • Ground truth
    • Wikipedia has corresponding lists

e.g., List of Carnegie Mellon University People

  • Precision and Recall
conclusions
Conclusions
  • Semantic Retrieval System ESTER
    • fast and scalable via reduction to prefix search and join
    • can handle all basic SPARQL queries
    • seamless integration with full-text search
    • standard user interface with (semantic) suggestions
  • Lots of interesting and challenging problems
    • simultaneous ranking of entities and documents
    • proper snippet generation and highlighting
    • search result quality
    • Source:www.mpi-inf.mpg.de/~bast/slides/xxx.ppt‎