Efficient search in semi structured data spaces
Sponsored Links
This presentation is the property of its rightful owner.
1 / 33

Efficient Search in Semi-structured Data Spaces PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on
  • Presentation posted in: General

Ralf Andreas Katja Steffen Tom Aleksandar Schenkel Broschart Hose Metzger Crecelius Stupar. Efficient Search in Semi-structured Data Spaces. General Approach.

Download Presentation

Efficient Search in Semi-structured Data Spaces

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ralf Andreas Katja Steffen Tom Aleksandar Schenkel Broschart Hose Metzger Crecelius Stupar

Efficient Search inSemi-structured Data Spaces


General Approach

Model & Ranking(Probabilistic, Language Models, Authority, …)

Efficient algorithms

Evaluation ofresult quality

Evaluation ofexecution cost

Problem(Information need on some data collection)

MMCI Retreat, Braunshausen


Selected Projects

  • Efficient Information Retrieval

  • Social Networks

  • Distributed Knowledge Management

  • Whatever is left

MMCI Retreat, Braunshausen


Text Retrieval

Problem:Find the best documents d from a large collectionthat match a query {t1,…,tn}

Modeling and ranking:Define score for documents

Importance of t in the collection(the less frequent, the better)

Importance of t for document d(the more frequent, the better)

Linear combination for query scores

tf(d,t): frequency of tag t for doc d

df(t): #docs tagged with t

MMCI Retreat, Braunshausen


What about efficiency?

  • Cannot compute this from scratch for each query(>>1010 documents)

  • Solution:

    • Precompute per-term scores for each document

    • For each term, store list of (d,score(d,t)) on disk

    • When query arrives:

      • combine entries from lists

      • sort results

      • return top-k

        (merge-then-sort algorithm)

MMCI Retreat, Braunshausen


Family of Threshold Algorithms

T: 0.99

G: 0.77

B: 0.51

A: 0.15

D: 0.01

decreasing score

But:

Lists can be very long (millions of entries)

 Simple merge-then-sort algorithm too expensive

Observation:

„Good“ results have high scores

  • Order lists by decreasing scores

  • Have „intelligent“ algorithmwith different list access modesand early stopping

MMCI Retreat, Braunshausen


Experiments with TREC Benchmark

  • TREC Terabyte collection:~24 million docs from .gov domain,~420GB (unpacked) size(we now have one with 109 docs, 5TB compressed size)

  • 50 keyword queries from TREC Terabyte 2005

  • Performance measures:

    • Number of sequential and random accesses

    • Weighted cost: #SA + C · #RA

    • Wall-clock runtime

MMCI Retreat, Braunshausen


Experiments: (TA and) CA on TREC

average abstract cost

average wallclock runtime

250

4,000,000

State-of-the-art-1

State-of-the-art-2

merge-then-sort

merge-then-sort

State-of-the-art-1

average running time (milliseconds)

average cost (#SA + 1000 x #RA)

OURS

100

OURS

lower bound

0

0

10

50

100

200

500

10

50

100

200

500

k

k

  • Lower bound: for each query [VLDB06, with H. Bast]

  • compute top-k results R and final mink

  • find minimum over all combinations of scan depths that see R

    • SA cost + RA cost for candidates with bestscore>mink

  • considers blocks of entries for tractability

You can safely ignore this part

MMCI Retreat, Braunshausen


Beyond Exact Top-K Results

  • Improve performance by considering approximate results with probabilistic guarantees

    • drop candidate when probability for being top-k result is <ε

    • estimate probabilities from per-list score distributions

    • reasonable improvement in performance (stop earlier)

    • probabilistic guarantee: E[relative recall @ k] = 1-

  • Maximize result quality within fixed budget for execution cost (number of accesses, time)

    • adaptive scheduling: initially prefer high scores,later high score drops

    • Experimental results close to optimal (offline) results

[VLDB04]

[ICDE09]

MMCI Retreat, Braunshausen


Even More Heuristics: Proximity

  • Observation: [SPIRE07]

    „Good“ results have term matches close together

     add second type of list:for each term pair, include documents with closeoccurrences of the terms, ordered bydistance-based score

TL(pianist)

TL(french)

CL(french, pianist)

A:9.3

F:9.1

B:(3.0,8.6,4.5)

F:(0.7,9.1,1.5)

B:8.6

T:7.2

A:5.9

E:5.0

T:(0.5,3.0,7.2)

descending score

G:(0.2,2.0,1.7)

D:4.6

B:4.5

MMCI Retreat, Braunshausen


Query Processing

top-k results

sort

merge join

Prune and reorganize index lists

B:(3.0,8.6,4.5)

A:9.3

A:5.9

F:(0.7,9.1,1.5)

B:4.5

B:8.6

ascending did

G:(0.2,2.0,1.7)

E:5.0

D:4.6

T:(0.5,3.0,7.2)

F:9.1

T:7.2

Observation:

very small prefixes of the lists yield good results

TL(french)

TL(pianist)

A:9.3

F:9.1

T:7.2

B:8.6

descending score

E:5.0

A:5.9

B:4.5

D:4.6

CL(french, pianist)

  • Parameters tuned through exhaustive searchin the parameter space(4h on 80-core Hadoop-cluster)

  • Resulting index approx. as large as the collection

B:(3.0,8.6,4.5)

F:(0.7,9.1,1.5)

T:(0.5,3.0,7.2)

descending score

G:(0.2,2.0,1.7)

MMCI Retreat, Braunshausen


Evaluation at INEX 2009

  • Standard benchmark for XML retrieval

  • 2.6 million XML documents with semantic annotation from YAGO

  • 113 human-defined queries, 75 come with list of relevant results

  • Explicit efficiency task

MMCI Retreat, Braunshausen


Runtime vs. Quality at INEX 2009

MMCI Retreat, Braunshausen


Selected Projects

  • Efficient Information Retrieval

  • Social Networks

  • Distributed Knowledge Management

  • Whatever is left

MMCI Retreat, Braunshausen


Querying Social Tagging Networks

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

traveltrip

travelicde

harrypotter

harrypotter

harrypotter

harrypotter

probabilitydata miningfoundations

MMCI Retreat, Braunshausen


Information Need 1: Globally Popular

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

travelicde

traveltrip

harrypotter

or ?

harrypotter

harrypotter

harrypotter

probabilitydata miningfoundations

harry potter

Most frequently tagged items „best“Tags by all users equally important

MMCI Retreat, Braunshausen


Information Need 2: Similar Users

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

travelicde

traveltrip

harrypotter

harrypotter

harrypotter

harrypotter

or ?

probabilitydata miningfoundations

travel

MMCI Retreat, Braunshausen


Information Need 2: Similar Users

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

travelicde

traveltrip

harrypotter

harrypotter

harrypotter

harrypotter

or ?

probabilitydata miningfoundations

travel

Tags by users with similar tags/items(„brothers in spirit“)more important

MMCI Retreat, Braunshausen


Information Need 3: Trusted Friends

probabilityselling

probabilityselling

probabilityselling

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

traveltrip

travelicde

harrypotter

or ?

harrypotter

harrypotter

harrypotter

probabilitydata miningfoundations

probability

MMCI Retreat, Braunshausen


Information Need 3: Trusted Friends

probabilityselling

probabilityselling

probabilityselling

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

traveltrip

travelicde

harrypotter

or ?

harrypotter

harrypotter

harrypotter

probabilitydata miningfoundations

probability

Tags by closely related and

well-known users more important

MMCI Retreat, Braunshausen


Towards Social-Aware Social Search

  • Search results may depend on

    • Global popularity of items

    • Spiritual context of the querying user(users with similar books and/or tags)

    • Social context of the querying user(known and trusted friends)

    • Combinations

  • Users can have differentimportance(„friendship strengths“) in different searches importance of user is convex combination of the three weights (with params α,β)

MMCI Retreat, Braunshausen


Prototype [VLDB/SIGIR 2008 demo]

results of global search for „dragon“

MMCI Retreat, Braunshausen


Prototype [VLDB/SIGIR 2008 demo]

results of social search for „dragon“

MMCI Retreat, Braunshausen


Preliminary User Study

LibraryThing user study: [Data Engineering Bulletin, June 2008]

  • 6 librarything users with reasonably large library and friend sets

  • 49 queries like „mystery magic“, „wizard“, „yakuza“

  • Crawled (part of) LibraryThing: ~1.3 mio books, ~15 mio tags, ~12,000 users, ~18,000 friend links

  • Measured NDCG[10] (weighted [email protected])

 (spiritual)

α(social)

  • Result quality generally very high

  • Combination of spiritual and social friends significantly better than pure global search

MMCI Retreat, Braunshausen


Algorithmic Overview

  • Input: query q={t1…tn} for user u, α, 

  • Output: k items with highest scores

+ „harry potter“

……………………..

MMCI Retreat, Braunshausen


Can we reuse Threshold Algorithms here?

No, scores specific to querying

user and parameter setting!

: harry (=0.2,=0.5)

: harry (=0.2,=0.5)

: harry (=0.2,=0.5)

: harry (=0.2,=0.5)

: harry (=0.0,=0.8)

: harry (=1.0,=0.0)

: harry (=0.0,=1.0)

: harry (=0.0,=1.0)

: harry (=0.5,=0.5)

: harry (=0.0,=0.8)

: harry (=1.0,=0.0)

: harry (=0.5,=0.5)

: harry (=0.0,=0.8)

: harry (=0.0,=1.0)

: harry (=0.0,=0.8)

: harry (=0.5,=0.5)

: harry (=1.0,=0.0)

: harry (=1.0,=0.0)

: harry (=0.0,=1.0)

: harry (=0.5,=0.5)

0.98

0.98

0.98

0.98

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.84

0.84

0.84

0.84

0.89

0.89

0.86

0.89

0.89

0.89

0.89

0.89

0.86

0.89

0.86

0.89

0.89

0.89

0.89

0.86

0.45

0.45

0.45

0.45

0.56

0.64

0.56

0.56

0.64

0.56

0.56

0.56

0.64

0.56

0.56

0.64

0.56

0.56

0.56

0.56

harry

travel

0.87

0.95

0.82

0.85

0.69

0.51

Number of lists to precompute would explode!(#tags  #users  parameter space)

MMCI Retreat, Braunshausen


Top-K in Social Networks: ContextMerge

[SIGIR 2008]

Precomputed lists:

  • ITEMS(t): pairs <i,tf(i,t)>, sorted by tf(i,t)↓

  • USERITEMS(u‘,t): pairs <i,tfu‘(i,t)>, unsorted

  • FRIENDS(u): pairs <u‘,F(u,u‘)>, sorted by F(u,u‘)↓

ITEMS(harry):

alreadyexist insystems

32

26

47

USERITEMS( , harry):

1

FRIENDS( ):

0.085

0.12

0.10

MMCI Retreat, Braunshausen


Experimental Evaluation: Efficiency

  • Testbed: 3 large crawls of real social networks

    • Flickr: 10 mio pictures, ~50,000 users

    • Del.icio.us: ~175,000 bookmarks, ~12,000 users

    • Librarything: ~6.5 mio books, ~10,000 users

  • Queries:

    • 150 frequent tag pairs

    • for each query pick user with „enough“ results & friends

  • Abstract cost measure  disk load

  • Baseline: full merge + sort

MMCI Retreat, Braunshausen


Experimental Evaluation: Efficiency (=0)

2-8 times better than baseline

α

MMCI Retreat, Braunshausen


Selected Projects

  • Efficient Information Retrieval

  • Social Networks

  • Distributed Knowledge Management

  • Whatever is left

MMCI Retreat, Braunshausen


WisNetGrid: Semantic Search for D-Grid

  • D-Grid: German science grid providing computing and storage resources

  • Many topic-specific communities: Astro-, Text-, Medi-, Interlog-, Wiss- (Science-), Finance-, …

  • Two services missing so far (among others):

    • Integrated search over all data sources

    • Extraction of facts from data and fact-based search

      WisNetGrid: BMBF project with 10 national partners

MMCI Retreat, Braunshausen


Selected Projects

  • Efficient Information Retrieval

  • Social Networks

  • Distributed Knowledge Management

  • Whatever is left

MMCI Retreat, Braunshausen


Whatever is left

  • Everlast: Distributed Web Archiving (with A. Anand, S. Bedathur, MPI-INF)

  • IR on knowledge graphs (with S. Elbassuoni, M. Ramanath, G. Weikum, MPI-INF)

  • Summarization of knowledge about entities (with M. Sydow, U Warsaw, PL)

  • Assessments for XML IR with Amazon Mechanical Turk (with O. Alonso, Microsoft, and M. Theobald, MPI-INF)

  • INEX Efficiency Task (with M. Theobald, MPI-INF, and A. Trotman, U Otago, NZ)

MMCI Retreat, Braunshausen


  • Login