1 / 33

Efficient Search in Semi-structured Data Spaces

Ralf Andreas Katja Steffen Tom Aleksandar Schenkel Broschart Hose Metzger Crecelius Stupar. Efficient Search in Semi-structured Data Spaces. General Approach.

magnar
Download Presentation

Efficient Search in Semi-structured Data Spaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ralf Andreas Katja Steffen Tom Aleksandar Schenkel Broschart Hose Metzger Crecelius Stupar Efficient Search inSemi-structured Data Spaces

  2. General Approach Model & Ranking(Probabilistic, Language Models, Authority, …) Efficient algorithms Evaluation ofresult quality Evaluation ofexecution cost Problem(Information need on some data collection) MMCI Retreat, Braunshausen

  3. Selected Projects • Efficient Information Retrieval • Social Networks • Distributed Knowledge Management • Whatever is left MMCI Retreat, Braunshausen

  4. Text Retrieval Problem:Find the best documents d from a large collectionthat match a query {t1,…,tn} Modeling and ranking:Define score for documents Importance of t in the collection(the less frequent, the better) Importance of t for document d(the more frequent, the better) Linear combination for query scores tf(d,t): frequency of tag t for doc d df(t): #docs tagged with t MMCI Retreat, Braunshausen

  5. What about efficiency? • Cannot compute this from scratch for each query(>>1010 documents) • Solution: • Precompute per-term scores for each document • For each term, store list of (d,score(d,t)) on disk • When query arrives: • combine entries from lists • sort results • return top-k (merge-then-sort algorithm) MMCI Retreat, Braunshausen

  6. Family of Threshold Algorithms T: 0.99 G: 0.77 B: 0.51 A: 0.15 D: 0.01 decreasing score But: Lists can be very long (millions of entries)  Simple merge-then-sort algorithm too expensive Observation: „Good“ results have high scores • Order lists by decreasing scores • Have „intelligent“ algorithmwith different list access modesand early stopping MMCI Retreat, Braunshausen

  7. Experiments with TREC Benchmark • TREC Terabyte collection:~24 million docs from .gov domain,~420GB (unpacked) size(we now have one with 109 docs, 5TB compressed size) • 50 keyword queries from TREC Terabyte 2005 • Performance measures: • Number of sequential and random accesses • Weighted cost: #SA + C · #RA • Wall-clock runtime MMCI Retreat, Braunshausen

  8. Experiments: (TA and) CA on TREC average abstract cost average wallclock runtime 250 4,000,000 State-of-the-art-1 State-of-the-art-2 merge-then-sort merge-then-sort State-of-the-art-1 average running time (milliseconds) average cost (#SA + 1000 x #RA) OURS 100 OURS lower bound 0 0 10 50 100 200 500 10 50 100 200 500 k k • Lower bound: for each query [VLDB06, with H. Bast] • compute top-k results R and final mink • find minimum over all combinations of scan depths that see R • SA cost + RA cost for candidates with bestscore>mink • considers blocks of entries for tractability You can safely ignore this part MMCI Retreat, Braunshausen

  9. Beyond Exact Top-K Results • Improve performance by considering approximate results with probabilistic guarantees • drop candidate when probability for being top-k result is <ε • estimate probabilities from per-list score distributions • reasonable improvement in performance (stop earlier) • probabilistic guarantee: E[relative recall @ k] = 1- • Maximize result quality within fixed budget for execution cost (number of accesses, time) • adaptive scheduling: initially prefer high scores,later high score drops • Experimental results close to optimal (offline) results [VLDB04] [ICDE09] MMCI Retreat, Braunshausen

  10. Even More Heuristics: Proximity • Observation: [SPIRE07] „Good“ results have term matches close together  add second type of list:for each term pair, include documents with close occurrences of the terms, ordered by distance-based score TL(pianist) TL(french) CL(french, pianist) A:9.3 F:9.1 B:(3.0,8.6,4.5) F:(0.7,9.1,1.5) B:8.6 T:7.2 A:5.9 E:5.0 T:(0.5,3.0,7.2) descending score G:(0.2,2.0,1.7) D:4.6 B:4.5 MMCI Retreat, Braunshausen

  11. Query Processing top-k results sort merge join Prune and reorganize index lists B:(3.0,8.6,4.5) A:9.3 A:5.9 F:(0.7,9.1,1.5) B:4.5 B:8.6 ascending did G:(0.2,2.0,1.7) E:5.0 D:4.6 T:(0.5,3.0,7.2) F:9.1 T:7.2 Observation: very small prefixes of the lists yield good results TL(french) TL(pianist) A:9.3 F:9.1 T:7.2 B:8.6 descending score E:5.0 A:5.9 B:4.5 D:4.6 CL(french, pianist) • Parameters tuned through exhaustive searchin the parameter space(4h on 80-core Hadoop-cluster) • Resulting index approx. as large as the collection B:(3.0,8.6,4.5) F:(0.7,9.1,1.5) T:(0.5,3.0,7.2) descending score G:(0.2,2.0,1.7) MMCI Retreat, Braunshausen

  12. Evaluation at INEX 2009 • Standard benchmark for XML retrieval • 2.6 million XML documents with semantic annotation from YAGO • 113 human-defined queries, 75 come with list of relevant results • Explicit efficiency task MMCI Retreat, Braunshausen

  13. Runtime vs. Quality at INEX 2009 MMCI Retreat, Braunshausen

  14. Selected Projects • Efficient Information Retrieval • Social Networks • Distributed Knowledge Management • Whatever is left MMCI Retreat, Braunshausen

  15. Querying Social Tagging Networks travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter harrypotter harrypotter harrypotter probabilitydata miningfoundations MMCI Retreat, Braunshausen

  16. Information Need 1: Globally Popular travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations harry potter Most frequently tagged items „best“Tags by all users equally important MMCI Retreat, Braunshausen

  17. Information Need 2: Similar Users travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter harrypotter harrypotter harrypotter or ? probabilitydata miningfoundations travel MMCI Retreat, Braunshausen

  18. Information Need 2: Similar Users travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter harrypotter harrypotter harrypotter or ? probabilitydata miningfoundations travel Tags by users with similar tags/items(„brothers in spirit“)more important MMCI Retreat, Braunshausen

  19. Information Need 3: Trusted Friends probabilityselling probabilityselling probabilityselling travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations probability MMCI Retreat, Braunshausen

  20. Information Need 3: Trusted Friends probabilityselling probabilityselling probabilityselling travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations probability Tags by closely related and well-known users more important MMCI Retreat, Braunshausen

  21. Towards Social-Aware Social Search • Search results may depend on • Global popularity of items • Spiritual context of the querying user(users with similar books and/or tags) • Social context of the querying user(known and trusted friends) • Combinations • Users can have differentimportance(„friendship strengths“) in different searches importance of user is convex combination of the three weights (with params α,β) MMCI Retreat, Braunshausen

  22. Prototype [VLDB/SIGIR 2008 demo] results of global search for „dragon“ MMCI Retreat, Braunshausen

  23. Prototype [VLDB/SIGIR 2008 demo] results of social search for „dragon“ MMCI Retreat, Braunshausen

  24. Preliminary User Study LibraryThing user study: [Data Engineering Bulletin, June 2008] • 6 librarything users with reasonably large library and friend sets • 49 queries like „mystery magic“, „wizard“, „yakuza“ • Crawled (part of) LibraryThing: ~1.3 mio books, ~15 mio tags, ~12,000 users, ~18,000 friend links • Measured NDCG[10] (weighted precision@10)  (spiritual) α(social) • Result quality generally very high • Combination of spiritual and social friends significantly better than pure global search MMCI Retreat, Braunshausen

  25. Algorithmic Overview • Input: query q={t1…tn} for user u, α,  • Output: k items with highest scores + „harry potter“ …………………….. MMCI Retreat, Braunshausen

  26. Can we reuse Threshold Algorithms here? No, scores specific to querying user and parameter setting! : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.0,=0.8) : harry (=1.0,=0.0) : harry (=0.0,=1.0) : harry (=0.0,=1.0) : harry (=0.5,=0.5) : harry (=0.0,=0.8) : harry (=1.0,=0.0) : harry (=0.5,=0.5) : harry (=0.0,=0.8) : harry (=0.0,=1.0) : harry (=0.0,=0.8) : harry (=0.5,=0.5) : harry (=1.0,=0.0) : harry (=1.0,=0.0) : harry (=0.0,=1.0) : harry (=0.5,=0.5) 0.98 0.98 0.98 0.98 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.84 0.84 0.84 0.84 0.89 0.89 0.86 0.89 0.89 0.89 0.89 0.89 0.86 0.89 0.86 0.89 0.89 0.89 0.89 0.86 0.45 0.45 0.45 0.45 0.56 0.64 0.56 0.56 0.64 0.56 0.56 0.56 0.64 0.56 0.56 0.64 0.56 0.56 0.56 0.56 harry travel 0.87 0.95 0.82 0.85 0.69 0.51 Number of lists to precompute would explode!(#tags  #users  parameter space) MMCI Retreat, Braunshausen

  27. Top-K in Social Networks: ContextMerge [SIGIR 2008] Precomputed lists: • ITEMS(t): pairs <i,tf(i,t)>, sorted by tf(i,t)↓ • USERITEMS(u‘,t): pairs <i,tfu‘(i,t)>, unsorted • FRIENDS(u): pairs <u‘,F(u,u‘)>, sorted by F(u,u‘)↓ ITEMS(harry): alreadyexist insystems 32 26 47 … USERITEMS( , harry): 1 FRIENDS( ): 0.085 0.12 0.10 … MMCI Retreat, Braunshausen

  28. Experimental Evaluation: Efficiency • Testbed: 3 large crawls of real social networks • Flickr: 10 mio pictures, ~50,000 users • Del.icio.us: ~175,000 bookmarks, ~12,000 users • Librarything: ~6.5 mio books, ~10,000 users • Queries: • 150 frequent tag pairs • for each query pick user with „enough“ results & friends • Abstract cost measure  disk load • Baseline: full merge + sort MMCI Retreat, Braunshausen

  29. Experimental Evaluation: Efficiency (=0) 2-8 times better than baseline α MMCI Retreat, Braunshausen

  30. Selected Projects • Efficient Information Retrieval • Social Networks • Distributed Knowledge Management • Whatever is left MMCI Retreat, Braunshausen

  31. WisNetGrid: Semantic Search for D-Grid • D-Grid: German science grid providing computing and storage resources • Many topic-specific communities: Astro-, Text-, Medi-, Interlog-, Wiss- (Science-), Finance-, … • Two services missing so far (among others): • Integrated search over all data sources • Extraction of facts from data and fact-based search WisNetGrid: BMBF project with 10 national partners MMCI Retreat, Braunshausen

  32. Selected Projects • Efficient Information Retrieval • Social Networks • Distributed Knowledge Management • Whatever is left MMCI Retreat, Braunshausen

  33. Whatever is left • Everlast: Distributed Web Archiving (with A. Anand, S. Bedathur, MPI-INF) • IR on knowledge graphs (with S. Elbassuoni, M. Ramanath, G. Weikum, MPI-INF) • Summarization of knowledge about entities (with M. Sydow, U Warsaw, PL) • Assessments for XML IR with Amazon Mechanical Turk (with O. Alonso, Microsoft, and M. Theobald, MPI-INF) • INEX Efficiency Task (with M. Theobald, MPI-INF, and A. Trotman, U Otago, NZ) MMCI Retreat, Braunshausen

More Related