Efficient search in semi structured data spaces
This presentation is the property of its rightful owner.
Sponsored Links
1 / 33

Efficient Search in Semi-structured Data Spaces PowerPoint PPT Presentation


  • 48 Views
  • Uploaded on
  • Presentation posted in: General

Ralf Andreas Katja Steffen Tom Aleksandar Schenkel Broschart Hose Metzger Crecelius Stupar. Efficient Search in Semi-structured Data Spaces. General Approach.

Download Presentation

Efficient Search in Semi-structured Data Spaces

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Efficient search in semi structured data spaces

Ralf Andreas Katja Steffen Tom Aleksandar Schenkel Broschart Hose Metzger Crecelius Stupar

Efficient Search inSemi-structured Data Spaces


General approach

General Approach

Model & Ranking(Probabilistic, Language Models, Authority, …)

Efficient algorithms

Evaluation ofresult quality

Evaluation ofexecution cost

Problem(Information need on some data collection)

MMCI Retreat, Braunshausen


Selected projects

Selected Projects

  • Efficient Information Retrieval

  • Social Networks

  • Distributed Knowledge Management

  • Whatever is left

MMCI Retreat, Braunshausen


Text retrieval

Text Retrieval

Problem:Find the best documents d from a large collectionthat match a query {t1,…,tn}

Modeling and ranking:Define score for documents

Importance of t in the collection(the less frequent, the better)

Importance of t for document d(the more frequent, the better)

Linear combination for query scores

tf(d,t): frequency of tag t for doc d

df(t): #docs tagged with t

MMCI Retreat, Braunshausen


What about efficiency

What about efficiency?

  • Cannot compute this from scratch for each query(>>1010 documents)

  • Solution:

    • Precompute per-term scores for each document

    • For each term, store list of (d,score(d,t)) on disk

    • When query arrives:

      • combine entries from lists

      • sort results

      • return top-k

        (merge-then-sort algorithm)

MMCI Retreat, Braunshausen


Family of threshold algorithms

Family of Threshold Algorithms

T: 0.99

G: 0.77

B: 0.51

A: 0.15

D: 0.01

decreasing score

But:

Lists can be very long (millions of entries)

 Simple merge-then-sort algorithm too expensive

Observation:

„Good“ results have high scores

  • Order lists by decreasing scores

  • Have „intelligent“ algorithmwith different list access modesand early stopping

MMCI Retreat, Braunshausen


Experiments with trec benchmark

Experiments with TREC Benchmark

  • TREC Terabyte collection:~24 million docs from .gov domain,~420GB (unpacked) size(we now have one with 109 docs, 5TB compressed size)

  • 50 keyword queries from TREC Terabyte 2005

  • Performance measures:

    • Number of sequential and random accesses

    • Weighted cost: #SA + C · #RA

    • Wall-clock runtime

MMCI Retreat, Braunshausen


Experiments ta and ca on trec

Experiments: (TA and) CA on TREC

average abstract cost

average wallclock runtime

250

4,000,000

State-of-the-art-1

State-of-the-art-2

merge-then-sort

merge-then-sort

State-of-the-art-1

average running time (milliseconds)

average cost (#SA + 1000 x #RA)

OURS

100

OURS

lower bound

0

0

10

50

100

200

500

10

50

100

200

500

k

k

  • Lower bound: for each query [VLDB06, with H. Bast]

  • compute top-k results R and final mink

  • find minimum over all combinations of scan depths that see R

    • SA cost + RA cost for candidates with bestscore>mink

  • considers blocks of entries for tractability

You can safely ignore this part

MMCI Retreat, Braunshausen


Beyond exact top k results

Beyond Exact Top-K Results

  • Improve performance by considering approximate results with probabilistic guarantees

    • drop candidate when probability for being top-k result is <ε

    • estimate probabilities from per-list score distributions

    • reasonable improvement in performance (stop earlier)

    • probabilistic guarantee: E[relative recall @ k] = 1-

  • Maximize result quality within fixed budget for execution cost (number of accesses, time)

    • adaptive scheduling: initially prefer high scores,later high score drops

    • Experimental results close to optimal (offline) results

[VLDB04]

[ICDE09]

MMCI Retreat, Braunshausen


Even more heuristics proximity

Even More Heuristics: Proximity

  • Observation: [SPIRE07]

    „Good“ results have term matches close together

     add second type of list:for each term pair, include documents with closeoccurrences of the terms, ordered bydistance-based score

TL(pianist)

TL(french)

CL(french, pianist)

A:9.3

F:9.1

B:(3.0,8.6,4.5)

F:(0.7,9.1,1.5)

B:8.6

T:7.2

A:5.9

E:5.0

T:(0.5,3.0,7.2)

descending score

G:(0.2,2.0,1.7)

D:4.6

B:4.5

MMCI Retreat, Braunshausen


Query processing

Query Processing

top-k results

sort

merge join

Prune and reorganize index lists

B:(3.0,8.6,4.5)

A:9.3

A:5.9

F:(0.7,9.1,1.5)

B:4.5

B:8.6

ascending did

G:(0.2,2.0,1.7)

E:5.0

D:4.6

T:(0.5,3.0,7.2)

F:9.1

T:7.2

Observation:

very small prefixes of the lists yield good results

TL(french)

TL(pianist)

A:9.3

F:9.1

T:7.2

B:8.6

descending score

E:5.0

A:5.9

B:4.5

D:4.6

CL(french, pianist)

  • Parameters tuned through exhaustive searchin the parameter space(4h on 80-core Hadoop-cluster)

  • Resulting index approx. as large as the collection

B:(3.0,8.6,4.5)

F:(0.7,9.1,1.5)

T:(0.5,3.0,7.2)

descending score

G:(0.2,2.0,1.7)

MMCI Retreat, Braunshausen


Evaluation at inex 2009

Evaluation at INEX 2009

  • Standard benchmark for XML retrieval

  • 2.6 million XML documents with semantic annotation from YAGO

  • 113 human-defined queries, 75 come with list of relevant results

  • Explicit efficiency task

MMCI Retreat, Braunshausen


Runtime vs quality at inex 2009

Runtime vs. Quality at INEX 2009

MMCI Retreat, Braunshausen


Selected projects1

Selected Projects

  • Efficient Information Retrieval

  • Social Networks

  • Distributed Knowledge Management

  • Whatever is left

MMCI Retreat, Braunshausen


Querying social tagging networks

Querying Social Tagging Networks

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

traveltrip

travelicde

harrypotter

harrypotter

harrypotter

harrypotter

probabilitydata miningfoundations

MMCI Retreat, Braunshausen


Information need 1 globally popular

Information Need 1: Globally Popular

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

travelicde

traveltrip

harrypotter

or ?

harrypotter

harrypotter

harrypotter

probabilitydata miningfoundations

harry potter

Most frequently tagged items „best“Tags by all users equally important

MMCI Retreat, Braunshausen


Information need 2 similar users

Information Need 2: Similar Users

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

travelicde

traveltrip

harrypotter

harrypotter

harrypotter

harrypotter

or ?

probabilitydata miningfoundations

travel

MMCI Retreat, Braunshausen


Information need 2 similar users1

Information Need 2: Similar Users

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

travelicde

traveltrip

harrypotter

harrypotter

harrypotter

harrypotter

or ?

probabilitydata miningfoundations

travel

Tags by users with similar tags/items(„brothers in spirit“)more important

MMCI Retreat, Braunshausen


Information need 3 trusted friends

Information Need 3: Trusted Friends

probabilityselling

probabilityselling

probabilityselling

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

traveltrip

travelicde

harrypotter

or ?

harrypotter

harrypotter

harrypotter

probabilitydata miningfoundations

probability

MMCI Retreat, Braunshausen


Information need 3 trusted friends1

Information Need 3: Trusted Friends

probabilityselling

probabilityselling

probabilityselling

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

traveltrip

travelicde

harrypotter

or ?

harrypotter

harrypotter

harrypotter

probabilitydata miningfoundations

probability

Tags by closely related and

well-known users more important

MMCI Retreat, Braunshausen


Towards social aware social search

Towards Social-Aware Social Search

  • Search results may depend on

    • Global popularity of items

    • Spiritual context of the querying user(users with similar books and/or tags)

    • Social context of the querying user(known and trusted friends)

    • Combinations

  • Users can have differentimportance(„friendship strengths“) in different searches importance of user is convex combination of the three weights (with params α,β)

MMCI Retreat, Braunshausen


Prototype vldb sigir 2008 demo

Prototype [VLDB/SIGIR 2008 demo]

results of global search for „dragon“

MMCI Retreat, Braunshausen


Prototype vldb sigir 2008 demo1

Prototype [VLDB/SIGIR 2008 demo]

results of social search for „dragon“

MMCI Retreat, Braunshausen


Preliminary user study

Preliminary User Study

LibraryThing user study: [Data Engineering Bulletin, June 2008]

  • 6 librarything users with reasonably large library and friend sets

  • 49 queries like „mystery magic“, „wizard“, „yakuza“

  • Crawled (part of) LibraryThing: ~1.3 mio books, ~15 mio tags, ~12,000 users, ~18,000 friend links

  • Measured NDCG[10] (weighted [email protected])

 (spiritual)

α(social)

  • Result quality generally very high

  • Combination of spiritual and social friends significantly better than pure global search

MMCI Retreat, Braunshausen


Algorithmic overview

Algorithmic Overview

  • Input: query q={t1…tn} for user u, α, 

  • Output: k items with highest scores

+ „harry potter“

……………………..

MMCI Retreat, Braunshausen


Can we reuse threshold algorithms here

Can we reuse Threshold Algorithms here?

No, scores specific to querying

user and parameter setting!

: harry (=0.2,=0.5)

: harry (=0.2,=0.5)

: harry (=0.2,=0.5)

: harry (=0.2,=0.5)

: harry (=0.0,=0.8)

: harry (=1.0,=0.0)

: harry (=0.0,=1.0)

: harry (=0.0,=1.0)

: harry (=0.5,=0.5)

: harry (=0.0,=0.8)

: harry (=1.0,=0.0)

: harry (=0.5,=0.5)

: harry (=0.0,=0.8)

: harry (=0.0,=1.0)

: harry (=0.0,=0.8)

: harry (=0.5,=0.5)

: harry (=1.0,=0.0)

: harry (=1.0,=0.0)

: harry (=0.0,=1.0)

: harry (=0.5,=0.5)

0.98

0.98

0.98

0.98

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.84

0.84

0.84

0.84

0.89

0.89

0.86

0.89

0.89

0.89

0.89

0.89

0.86

0.89

0.86

0.89

0.89

0.89

0.89

0.86

0.45

0.45

0.45

0.45

0.56

0.64

0.56

0.56

0.64

0.56

0.56

0.56

0.64

0.56

0.56

0.64

0.56

0.56

0.56

0.56

harry

travel

0.87

0.95

0.82

0.85

0.69

0.51

Number of lists to precompute would explode!(#tags  #users  parameter space)

MMCI Retreat, Braunshausen


Top k in social networks contextmerge

Top-K in Social Networks: ContextMerge

[SIGIR 2008]

Precomputed lists:

  • ITEMS(t): pairs <i,tf(i,t)>, sorted by tf(i,t)↓

  • USERITEMS(u‘,t): pairs <i,tfu‘(i,t)>, unsorted

  • FRIENDS(u): pairs <u‘,F(u,u‘)>, sorted by F(u,u‘)↓

ITEMS(harry):

alreadyexist insystems

32

26

47

USERITEMS( , harry):

1

FRIENDS( ):

0.085

0.12

0.10

MMCI Retreat, Braunshausen


Experimental evaluation efficiency

Experimental Evaluation: Efficiency

  • Testbed: 3 large crawls of real social networks

    • Flickr: 10 mio pictures, ~50,000 users

    • Del.icio.us: ~175,000 bookmarks, ~12,000 users

    • Librarything: ~6.5 mio books, ~10,000 users

  • Queries:

    • 150 frequent tag pairs

    • for each query pick user with „enough“ results & friends

  • Abstract cost measure  disk load

  • Baseline: full merge + sort

MMCI Retreat, Braunshausen


Experimental evaluation efficiency 0

Experimental Evaluation: Efficiency (=0)

2-8 times better than baseline

α

MMCI Retreat, Braunshausen


Selected projects2

Selected Projects

  • Efficient Information Retrieval

  • Social Networks

  • Distributed Knowledge Management

  • Whatever is left

MMCI Retreat, Braunshausen


Wisnetgrid semantic search for d grid

WisNetGrid: Semantic Search for D-Grid

  • D-Grid: German science grid providing computing and storage resources

  • Many topic-specific communities: Astro-, Text-, Medi-, Interlog-, Wiss- (Science-), Finance-, …

  • Two services missing so far (among others):

    • Integrated search over all data sources

    • Extraction of facts from data and fact-based search

      WisNetGrid: BMBF project with 10 national partners

MMCI Retreat, Braunshausen


Selected projects3

Selected Projects

  • Efficient Information Retrieval

  • Social Networks

  • Distributed Knowledge Management

  • Whatever is left

MMCI Retreat, Braunshausen


Whatever is left

Whatever is left

  • Everlast: Distributed Web Archiving (with A. Anand, S. Bedathur, MPI-INF)

  • IR on knowledge graphs (with S. Elbassuoni, M. Ramanath, G. Weikum, MPI-INF)

  • Summarization of knowledge about entities (with M. Sydow, U Warsaw, PL)

  • Assessments for XML IR with Amazon Mechanical Turk (with O. Alonso, Microsoft, and M. Theobald, MPI-INF)

  • INEX Efficiency Task (with M. Theobald, MPI-INF, and A. Trotman, U Otago, NZ)

MMCI Retreat, Braunshausen


  • Login