Efficient Search in Semi-structured Data Spaces

Ralf Andreas Katja Steffen Tom Aleksandar Schenkel Broschart Hose Metzger Crecelius Stupar Efficient Search inSemi-structured Data Spaces

General Approach Model & Ranking(Probabilistic, Language Models, Authority, …) Efficient algorithms Evaluation ofresult quality Evaluation ofexecution cost Problem(Information need on some data collection) MMCI Retreat, Braunshausen

Selected Projects • Efficient Information Retrieval • Social Networks • Distributed Knowledge Management • Whatever is left MMCI Retreat, Braunshausen

Text Retrieval Problem:Find the best documents d from a large collectionthat match a query {t1,…,tn} Modeling and ranking:Define score for documents Importance of t in the collection(the less frequent, the better) Importance of t for document d(the more frequent, the better) Linear combination for query scores tf(d,t): frequency of tag t for doc d df(t): #docs tagged with t MMCI Retreat, Braunshausen

What about efficiency? • Cannot compute this from scratch for each query(>>1010 documents) • Solution: • Precompute per-term scores for each document • For each term, store list of (d,score(d,t)) on disk • When query arrives: • combine entries from lists • sort results • return top-k (merge-then-sort algorithm) MMCI Retreat, Braunshausen

Family of Threshold Algorithms T: 0.99 G: 0.77 B: 0.51 A: 0.15 D: 0.01 decreasing score But: Lists can be very long (millions of entries)  Simple merge-then-sort algorithm too expensive Observation: „Good“ results have high scores • Order lists by decreasing scores • Have „intelligent“ algorithmwith different list access modesand early stopping MMCI Retreat, Braunshausen

Experiments with TREC Benchmark • TREC Terabyte collection:~24 million docs from .gov domain,~420GB (unpacked) size(we now have one with 109 docs, 5TB compressed size) • 50 keyword queries from TREC Terabyte 2005 • Performance measures: • Number of sequential and random accesses • Weighted cost: #SA + C · #RA • Wall-clock runtime MMCI Retreat, Braunshausen

Experiments: (TA and) CA on TREC average abstract cost average wallclock runtime 250 4,000,000 State-of-the-art-1 State-of-the-art-2 merge-then-sort merge-then-sort State-of-the-art-1 average running time (milliseconds) average cost (#SA + 1000 x #RA) OURS 100 OURS lower bound 0 0 10 50 100 200 500 10 50 100 200 500 k k • Lower bound: for each query [VLDB06, with H. Bast] • compute top-k results R and final mink • find minimum over all combinations of scan depths that see R • SA cost + RA cost for candidates with bestscore>mink • considers blocks of entries for tractability You can safely ignore this part MMCI Retreat, Braunshausen

Beyond Exact Top-K Results • Improve performance by considering approximate results with probabilistic guarantees • drop candidate when probability for being top-k result is <ε • estimate probabilities from per-list score distributions • reasonable improvement in performance (stop earlier) • probabilistic guarantee: E[relative recall @ k] = 1- • Maximize result quality within fixed budget for execution cost (number of accesses, time) • adaptive scheduling: initially prefer high scores,later high score drops • Experimental results close to optimal (offline) results [VLDB04] [ICDE09] MMCI Retreat, Braunshausen

Even More Heuristics: Proximity • Observation: [SPIRE07] „Good“ results have term matches close together  add second type of list:for each term pair, include documents with close occurrences of the terms, ordered by distance-based score TL(pianist) TL(french) CL(french, pianist) A:9.3 F:9.1 B:(3.0,8.6,4.5) F:(0.7,9.1,1.5) B:8.6 T:7.2 A:5.9 E:5.0 T:(0.5,3.0,7.2) descending score G:(0.2,2.0,1.7) D:4.6 B:4.5 MMCI Retreat, Braunshausen

Query Processing top-k results sort merge join Prune and reorganize index lists B:(3.0,8.6,4.5) A:9.3 A:5.9 F:(0.7,9.1,1.5) B:4.5 B:8.6 ascending did G:(0.2,2.0,1.7) E:5.0 D:4.6 T:(0.5,3.0,7.2) F:9.1 T:7.2 Observation: very small prefixes of the lists yield good results TL(french) TL(pianist) A:9.3 F:9.1 T:7.2 B:8.6 descending score E:5.0 A:5.9 B:4.5 D:4.6 CL(french, pianist) • Parameters tuned through exhaustive searchin the parameter space(4h on 80-core Hadoop-cluster) • Resulting index approx. as large as the collection B:(3.0,8.6,4.5) F:(0.7,9.1,1.5) T:(0.5,3.0,7.2) descending score G:(0.2,2.0,1.7) MMCI Retreat, Braunshausen

Evaluation at INEX 2009 • Standard benchmark for XML retrieval • 2.6 million XML documents with semantic annotation from YAGO • 113 human-defined queries, 75 come with list of relevant results • Explicit efficiency task MMCI Retreat, Braunshausen

Runtime vs. Quality at INEX 2009 MMCI Retreat, Braunshausen

Querying Social Tagging Networks travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter harrypotter harrypotter harrypotter probabilitydata miningfoundations MMCI Retreat, Braunshausen

Information Need 1: Globally Popular travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations harry potter Most frequently tagged items „best“Tags by all users equally important MMCI Retreat, Braunshausen

Information Need 2: Similar Users travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter harrypotter harrypotter harrypotter or ? probabilitydata miningfoundations travel MMCI Retreat, Braunshausen

Information Need 2: Similar Users travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter harrypotter harrypotter harrypotter or ? probabilitydata miningfoundations travel Tags by users with similar tags/items(„brothers in spirit“)more important MMCI Retreat, Braunshausen

Information Need 3: Trusted Friends probabilityselling probabilityselling probabilityselling travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations probability MMCI Retreat, Braunshausen

Information Need 3: Trusted Friends probabilityselling probabilityselling probabilityselling travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations probability Tags by closely related and well-known users more important MMCI Retreat, Braunshausen

Towards Social-Aware Social Search • Search results may depend on • Global popularity of items • Spiritual context of the querying user(users with similar books and/or tags) • Social context of the querying user(known and trusted friends) • Combinations • Users can have differentimportance(„friendship strengths“) in different searches importance of user is convex combination of the three weights (with params α,β) MMCI Retreat, Braunshausen

Prototype [VLDB/SIGIR 2008 demo] results of global search for „dragon“ MMCI Retreat, Braunshausen

Prototype [VLDB/SIGIR 2008 demo] results of social search for „dragon“ MMCI Retreat, Braunshausen

Preliminary User Study LibraryThing user study: [Data Engineering Bulletin, June 2008] • 6 librarything users with reasonably large library and friend sets • 49 queries like „mystery magic“, „wizard“, „yakuza“ • Crawled (part of) LibraryThing: ~1.3 mio books, ~15 mio tags, ~12,000 users, ~18,000 friend links • Measured NDCG[10] (weighted precision@10)  (spiritual) α(social) • Result quality generally very high • Combination of spiritual and social friends significantly better than pure global search MMCI Retreat, Braunshausen

Algorithmic Overview • Input: query q={t1…tn} for user u, α,  • Output: k items with highest scores + „harry potter“ …………………….. MMCI Retreat, Braunshausen

Can we reuse Threshold Algorithms here? No, scores specific to querying user and parameter setting! : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.0,=0.8) : harry (=1.0,=0.0) : harry (=0.0,=1.0) : harry (=0.0,=1.0) : harry (=0.5,=0.5) : harry (=0.0,=0.8) : harry (=1.0,=0.0) : harry (=0.5,=0.5) : harry (=0.0,=0.8) : harry (=0.0,=1.0) : harry (=0.0,=0.8) : harry (=0.5,=0.5) : harry (=1.0,=0.0) : harry (=1.0,=0.0) : harry (=0.0,=1.0) : harry (=0.5,=0.5) 0.98 0.98 0.98 0.98 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.84 0.84 0.84 0.84 0.89 0.89 0.86 0.89 0.89 0.89 0.89 0.89 0.86 0.89 0.86 0.89 0.89 0.89 0.89 0.86 0.45 0.45 0.45 0.45 0.56 0.64 0.56 0.56 0.64 0.56 0.56 0.56 0.64 0.56 0.56 0.64 0.56 0.56 0.56 0.56 harry travel 0.87 0.95 0.82 0.85 0.69 0.51 Number of lists to precompute would explode!(#tags  #users  parameter space) MMCI Retreat, Braunshausen

Top-K in Social Networks: ContextMerge [SIGIR 2008] Precomputed lists: • ITEMS(t): pairs <i,tf(i,t)>, sorted by tf(i,t)↓ • USERITEMS(u‘,t): pairs <i,tfu‘(i,t)>, unsorted • FRIENDS(u): pairs <u‘,F(u,u‘)>, sorted by F(u,u‘)↓ ITEMS(harry): alreadyexist insystems 32 26 47 … USERITEMS( , harry): 1 FRIENDS( ): 0.085 0.12 0.10 … MMCI Retreat, Braunshausen

Experimental Evaluation: Efficiency • Testbed: 3 large crawls of real social networks • Flickr: 10 mio pictures, ~50,000 users • Del.icio.us: ~175,000 bookmarks, ~12,000 users • Librarything: ~6.5 mio books, ~10,000 users • Queries: • 150 frequent tag pairs • for each query pick user with „enough“ results & friends • Abstract cost measure  disk load • Baseline: full merge + sort MMCI Retreat, Braunshausen

Experimental Evaluation: Efficiency (=0) 2-8 times better than baseline α MMCI Retreat, Braunshausen

WisNetGrid: Semantic Search for D-Grid • D-Grid: German science grid providing computing and storage resources • Many topic-specific communities: Astro-, Text-, Medi-, Interlog-, Wiss- (Science-), Finance-, … • Two services missing so far (among others): • Integrated search over all data sources • Extraction of facts from data and fact-based search WisNetGrid: BMBF project with 10 national partners MMCI Retreat, Braunshausen

Whatever is left • Everlast: Distributed Web Archiving (with A. Anand, S. Bedathur, MPI-INF) • IR on knowledge graphs (with S. Elbassuoni, M. Ramanath, G. Weikum, MPI-INF) • Summarization of knowledge about entities (with M. Sydow, U Warsaw, PL) • Assessments for XML IR with Amazon Mechanical Turk (with O. Alonso, Microsoft, and M. Theobald, MPI-INF) • INEX Efficiency Task (with M. Theobald, MPI-INF, and A. Trotman, U Otago, NZ) MMCI Retreat, Braunshausen

Efficient Search in Semi-structured Data Spaces

Efficient Search in Semi-structured Data Spaces

Presentation Transcript

Keyword Search on Structured and Semi-Structured Data

EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data

Putting Semi-structured Data to Practice

Semi-Indexing Semi-Structured Data (in tiny space)

Semi Structured and in depth interviews

Collectively Representing Semi-Structured Data from the Web

ICS 321 Spring 2011 Semi-structured Data Model

Text Search for Fine-grained Semi-structured Data

Semi-Structured Data Models

Semi-Structured Data and XML

XML and the Semi-Structured Data Model

Efficient Algorithms for Mining Semi-structured Data

Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace

Search in structured networks

Diversifying Query Results on Semi-Structured Data

Efficient Search on Encrypted Data

Semi-structured data - exercises

Keyword Search on Graph-Structured Data

Semi-structured Data

Semi-Structured data (XML)