Efficient search in semi structured data spaces
Download
1 / 33

Efficient Search in Semi-structured Data Spaces - PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on

Ralf Andreas Katja Steffen Tom Aleksandar Schenkel Broschart Hose Metzger Crecelius Stupar. Efficient Search in Semi-structured Data Spaces. General Approach.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Efficient Search in Semi-structured Data Spaces' - magnar


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient search in semi structured data spaces

Ralf Andreas Katja Steffen Tom Aleksandar Schenkel Broschart Hose Metzger Crecelius Stupar

Efficient Search inSemi-structured Data Spaces


General approach
General Approach Steffen Tom Aleksandar

Model & Ranking(Probabilistic, Language Models, Authority, …)

Efficient algorithms

Evaluation ofresult quality

Evaluation ofexecution cost

Problem(Information need on some data collection)

MMCI Retreat, Braunshausen


Selected projects
Selected Projects Steffen Tom Aleksandar

  • Efficient Information Retrieval

  • Social Networks

  • Distributed Knowledge Management

  • Whatever is left

MMCI Retreat, Braunshausen


Text retrieval
Text Retrieval Steffen Tom Aleksandar

Problem:Find the best documents d from a large collectionthat match a query {t1,…,tn}

Modeling and ranking:Define score for documents

Importance of t in the collection(the less frequent, the better)

Importance of t for document d(the more frequent, the better)

Linear combination for query scores

tf(d,t): frequency of tag t for doc d

df(t): #docs tagged with t

MMCI Retreat, Braunshausen


What about efficiency
What about efficiency? Steffen Tom Aleksandar

  • Cannot compute this from scratch for each query(>>1010 documents)

  • Solution:

    • Precompute per-term scores for each document

    • For each term, store list of (d,score(d,t)) on disk

    • When query arrives:

      • combine entries from lists

      • sort results

      • return top-k

        (merge-then-sort algorithm)

MMCI Retreat, Braunshausen


Family of threshold algorithms
Family of Threshold Algorithms Steffen Tom Aleksandar

T: 0.99

G: 0.77

B: 0.51

A: 0.15

D: 0.01

decreasing score

But:

Lists can be very long (millions of entries)

 Simple merge-then-sort algorithm too expensive

Observation:

„Good“ results have high scores

  • Order lists by decreasing scores

  • Have „intelligent“ algorithmwith different list access modesand early stopping

MMCI Retreat, Braunshausen


Experiments with trec benchmark
Experiments with TREC Benchmark Steffen Tom Aleksandar

  • TREC Terabyte collection:~24 million docs from .gov domain,~420GB (unpacked) size(we now have one with 109 docs, 5TB compressed size)

  • 50 keyword queries from TREC Terabyte 2005

  • Performance measures:

    • Number of sequential and random accesses

    • Weighted cost: #SA + C · #RA

    • Wall-clock runtime

MMCI Retreat, Braunshausen


Experiments ta and ca on trec
Experiments: (TA and) CA on TREC Steffen Tom Aleksandar

average abstract cost

average wallclock runtime

250

4,000,000

State-of-the-art-1

State-of-the-art-2

merge-then-sort

merge-then-sort

State-of-the-art-1

average running time (milliseconds)

average cost (#SA + 1000 x #RA)

OURS

100

OURS

lower bound

0

0

10

50

100

200

500

10

50

100

200

500

k

k

  • Lower bound: for each query [VLDB06, with H. Bast]

  • compute top-k results R and final mink

  • find minimum over all combinations of scan depths that see R

    • SA cost + RA cost for candidates with bestscore>mink

  • considers blocks of entries for tractability

You can safely ignore this part

MMCI Retreat, Braunshausen


Beyond exact top k results
Beyond Exact Top-K Results Steffen Tom Aleksandar

  • Improve performance by considering approximate results with probabilistic guarantees

    • drop candidate when probability for being top-k result is <ε

    • estimate probabilities from per-list score distributions

    • reasonable improvement in performance (stop earlier)

    • probabilistic guarantee: E[relative recall @ k] = 1-

  • Maximize result quality within fixed budget for execution cost (number of accesses, time)

    • adaptive scheduling: initially prefer high scores,later high score drops

    • Experimental results close to optimal (offline) results

[VLDB04]

[ICDE09]

MMCI Retreat, Braunshausen


Even more heuristics proximity
Even More Heuristics: Proximity Steffen Tom Aleksandar

  • Observation: [SPIRE07]

    „Good“ results have term matches close together

     add second type of list:for each term pair, include documents with close occurrences of the terms, ordered by distance-based score

TL(pianist)

TL(french)

CL(french, pianist)

A:9.3

F:9.1

B:(3.0,8.6,4.5)

F:(0.7,9.1,1.5)

B:8.6

T:7.2

A:5.9

E:5.0

T:(0.5,3.0,7.2)

descending score

G:(0.2,2.0,1.7)

D:4.6

B:4.5

MMCI Retreat, Braunshausen


Query processing
Query Processing Steffen Tom Aleksandar

top-k results

sort

merge join

Prune and reorganize index lists

B:(3.0,8.6,4.5)

A:9.3

A:5.9

F:(0.7,9.1,1.5)

B:4.5

B:8.6

ascending did

G:(0.2,2.0,1.7)

E:5.0

D:4.6

T:(0.5,3.0,7.2)

F:9.1

T:7.2

Observation:

very small prefixes of the lists yield good results

TL(french)

TL(pianist)

A:9.3

F:9.1

T:7.2

B:8.6

descending score

E:5.0

A:5.9

B:4.5

D:4.6

CL(french, pianist)

  • Parameters tuned through exhaustive searchin the parameter space(4h on 80-core Hadoop-cluster)

  • Resulting index approx. as large as the collection

B:(3.0,8.6,4.5)

F:(0.7,9.1,1.5)

T:(0.5,3.0,7.2)

descending score

G:(0.2,2.0,1.7)

MMCI Retreat, Braunshausen


Evaluation at inex 2009
Evaluation at INEX 2009 Steffen Tom Aleksandar

  • Standard benchmark for XML retrieval

  • 2.6 million XML documents with semantic annotation from YAGO

  • 113 human-defined queries, 75 come with list of relevant results

  • Explicit efficiency task

MMCI Retreat, Braunshausen


Runtime vs quality at inex 2009
Runtime vs. Quality at INEX 2009 Steffen Tom Aleksandar

MMCI Retreat, Braunshausen


Selected projects1
Selected Projects Steffen Tom Aleksandar

  • Efficient Information Retrieval

  • Social Networks

  • Distributed Knowledge Management

  • Whatever is left

MMCI Retreat, Braunshausen


Querying social tagging networks
Querying Social Tagging Networks Steffen Tom Aleksandar

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

traveltrip

travelicde

harrypotter

harrypotter

harrypotter

harrypotter

probabilitydata miningfoundations

MMCI Retreat, Braunshausen


Information need 1 globally popular
Information Need 1: Globally Popular Steffen Tom Aleksandar

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

travelicde

traveltrip

harrypotter

or ?

harrypotter

harrypotter

harrypotter

probabilitydata miningfoundations

harry potter

Most frequently tagged items „best“Tags by all users equally important

MMCI Retreat, Braunshausen


Information need 2 similar users
Information Need 2: Similar Users Steffen Tom Aleksandar

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

travelicde

traveltrip

harrypotter

harrypotter

harrypotter

harrypotter

or ?

probabilitydata miningfoundations

travel

MMCI Retreat, Braunshausen


Information need 2 similar users1
Information Need 2: Similar Users Steffen Tom Aleksandar

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

travelicde

traveltrip

harrypotter

harrypotter

harrypotter

harrypotter

or ?

probabilitydata miningfoundations

travel

Tags by users with similar tags/items(„brothers in spirit“)more important

MMCI Retreat, Braunshausen


Information need 3 trusted friends
Information Need 3: Trusted Friends Steffen Tom Aleksandar

probabilityselling

probabilityselling

probabilityselling

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

traveltrip

travelicde

harrypotter

or ?

harrypotter

harrypotter

harrypotter

probabilitydata miningfoundations

probability

MMCI Retreat, Braunshausen


Information need 3 trusted friends1
Information Need 3: Trusted Friends Steffen Tom Aleksandar

probabilityselling

probabilityselling

probabilityselling

travelnorway

travelnorway

travelvldb

travelvldb

travel

travelmexico

traveltrip

travelicde

harrypotter

or ?

harrypotter

harrypotter

harrypotter

probabilitydata miningfoundations

probability

Tags by closely related and

well-known users more important

MMCI Retreat, Braunshausen


Towards social aware social search
Towards Social-Aware Social Search Steffen Tom Aleksandar

  • Search results may depend on

    • Global popularity of items

    • Spiritual context of the querying user(users with similar books and/or tags)

    • Social context of the querying user(known and trusted friends)

    • Combinations

  • Users can have differentimportance(„friendship strengths“) in different searches importance of user is convex combination of the three weights (with params α,β)

MMCI Retreat, Braunshausen


Prototype vldb sigir 2008 demo
Prototype [VLDB/SIGIR 2008 demo] Steffen Tom Aleksandar

results of global search for „dragon“

MMCI Retreat, Braunshausen


Prototype vldb sigir 2008 demo1
Prototype [VLDB/SIGIR 2008 demo] Steffen Tom Aleksandar

results of social search for „dragon“

MMCI Retreat, Braunshausen


Preliminary user study
Preliminary User Study Steffen Tom Aleksandar

LibraryThing user study: [Data Engineering Bulletin, June 2008]

  • 6 librarything users with reasonably large library and friend sets

  • 49 queries like „mystery magic“, „wizard“, „yakuza“

  • Crawled (part of) LibraryThing: ~1.3 mio books, ~15 mio tags, ~12,000 users, ~18,000 friend links

  • Measured NDCG[10] (weighted [email protected])

 (spiritual)

α(social)

  • Result quality generally very high

  • Combination of spiritual and social friends significantly better than pure global search

MMCI Retreat, Braunshausen


Algorithmic overview
Algorithmic Overview Steffen Tom Aleksandar

  • Input: query q={t1…tn} for user u, α, 

  • Output: k items with highest scores

+ „harry potter“

……………………..

MMCI Retreat, Braunshausen


Can we reuse threshold algorithms here
Can we reuse Threshold Algorithms here? Steffen Tom Aleksandar

No, scores specific to querying

user and parameter setting!

: harry (=0.2,=0.5)

: harry (=0.2,=0.5)

: harry (=0.2,=0.5)

: harry (=0.2,=0.5)

: harry (=0.0,=0.8)

: harry (=1.0,=0.0)

: harry (=0.0,=1.0)

: harry (=0.0,=1.0)

: harry (=0.5,=0.5)

: harry (=0.0,=0.8)

: harry (=1.0,=0.0)

: harry (=0.5,=0.5)

: harry (=0.0,=0.8)

: harry (=0.0,=1.0)

: harry (=0.0,=0.8)

: harry (=0.5,=0.5)

: harry (=1.0,=0.0)

: harry (=1.0,=0.0)

: harry (=0.0,=1.0)

: harry (=0.5,=0.5)

0.98

0.98

0.98

0.98

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.90

0.84

0.84

0.84

0.84

0.89

0.89

0.86

0.89

0.89

0.89

0.89

0.89

0.86

0.89

0.86

0.89

0.89

0.89

0.89

0.86

0.45

0.45

0.45

0.45

0.56

0.64

0.56

0.56

0.64

0.56

0.56

0.56

0.64

0.56

0.56

0.64

0.56

0.56

0.56

0.56

harry

travel

0.87

0.95

0.82

0.85

0.69

0.51

Number of lists to precompute would explode!(#tags  #users  parameter space)

MMCI Retreat, Braunshausen


Top k in social networks contextmerge
Top-K in Social Networks: ContextMerge Steffen Tom Aleksandar

[SIGIR 2008]

Precomputed lists:

  • ITEMS(t): pairs <i,tf(i,t)>, sorted by tf(i,t)↓

  • USERITEMS(u‘,t): pairs <i,tfu‘(i,t)>, unsorted

  • FRIENDS(u): pairs <u‘,F(u,u‘)>, sorted by F(u,u‘)↓

ITEMS(harry):

alreadyexist insystems

32

26

47

USERITEMS( , harry):

1

FRIENDS( ):

0.085

0.12

0.10

MMCI Retreat, Braunshausen


Experimental evaluation efficiency
Experimental Evaluation: Efficiency Steffen Tom Aleksandar

  • Testbed: 3 large crawls of real social networks

    • Flickr: 10 mio pictures, ~50,000 users

    • Del.icio.us: ~175,000 bookmarks, ~12,000 users

    • Librarything: ~6.5 mio books, ~10,000 users

  • Queries:

    • 150 frequent tag pairs

    • for each query pick user with „enough“ results & friends

  • Abstract cost measure  disk load

  • Baseline: full merge + sort

MMCI Retreat, Braunshausen


Experimental evaluation efficiency 0
Experimental Evaluation: Efficiency ( Steffen Tom Aleksandar=0)

2-8 times better than baseline

α

MMCI Retreat, Braunshausen


Selected projects2
Selected Projects Steffen Tom Aleksandar

  • Efficient Information Retrieval

  • Social Networks

  • Distributed Knowledge Management

  • Whatever is left

MMCI Retreat, Braunshausen


Wisnetgrid semantic search for d grid
WisNetGrid: Semantic Search for D-Grid Steffen Tom Aleksandar

  • D-Grid: German science grid providing computing and storage resources

  • Many topic-specific communities: Astro-, Text-, Medi-, Interlog-, Wiss- (Science-), Finance-, …

  • Two services missing so far (among others):

    • Integrated search over all data sources

    • Extraction of facts from data and fact-based search

      WisNetGrid: BMBF project with 10 national partners

MMCI Retreat, Braunshausen


Selected projects3
Selected Projects Steffen Tom Aleksandar

  • Efficient Information Retrieval

  • Social Networks

  • Distributed Knowledge Management

  • Whatever is left

MMCI Retreat, Braunshausen


Whatever is left
Whatever is left Steffen Tom Aleksandar

  • Everlast: Distributed Web Archiving (with A. Anand, S. Bedathur, MPI-INF)

  • IR on knowledge graphs (with S. Elbassuoni, M. Ramanath, G. Weikum, MPI-INF)

  • Summarization of knowledge about entities (with M. Sydow, U Warsaw, PL)

  • Assessments for XML IR with Amazon Mechanical Turk (with O. Alonso, Microsoft, and M. Theobald, MPI-INF)

  • INEX Efficiency Task (with M. Theobald, MPI-INF, and A. Trotman, U Otago, NZ)

MMCI Retreat, Braunshausen


ad