1 / 33

SPARK: Top- k Keyword Query in Relational Database

SPARK: Top- k Keyword Query in Relational Database. Wei Wang University of New South Wales Australia. Outline. Demo & Introduction Ranking Query Evaluation Conclusions. Demo. Demo …. SPARK I. Searching, Probing & Ranking Top-k Results

gen
Download Presentation

SPARK: Top- k Keyword Query in Relational Database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SPARK: Top-k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

  2. Outline • Demo & Introduction • Ranking • Query Evaluation • Conclusions

  3. Demo

  4. Demo …

  5. SPARK I • Searching, Probing & Ranking Top-k Results • Thesis project (2004 – 2005) with Nino Svonja • Taste of Research Summary Scholarship (2005) • Finally, CISRA prize winner • http://www.computing.unsw.edu.au/softwareengineering.php

  6. SPARK II • Continued as a research project with PhD student Yi Luo • 2005 – 2006 • SIGMOD 2007 paper • Still under active development

  7. A Motivating Example

  8. A Motivating Example … • Top-3 results in our system

  9. Improving the Effectiveness • Three factors are considered to contribute to the final score of a search result (joined tuple tree) • (modified) IR ranking score. • the completeness factor. • the size normalization factor.

  10. Preliminaries • Data Model • Relation-based • Query Model • Joined tuple trees (JTTs) • Sophisticated ranking • address one flaw in previous approaches • unify AND and OR semantics • alternative size normalization

  11. Problems with DISCOVER2

  12. Virtual Document • Combine tf contributions before tf normalization / attenuation.

  13. Virtual Document Collection • Collection: 3 results • idfnetvista = ln(4/3) • idfmaxtor = ln(4/2) • Estimate idf: • idfnetvista =  • idfmaxtor = • Estimate avdl = avdlC + avdlP

  14. Completeness Factor L2 distance • For “short queries” • User prefer results matching more keywords • Derive completeness factor based on extended Boolean model • Measure Lp distance to the ideal position netvista Ideal Pos d = 1 (1,1) (c2  p2) d = 0.5 (c1  p1) d = 1.41 maxtor

  15. Size Normalization • Results in large CNs tend to have more matches to the keywords • Scorec = (1+s1-s1*|CN|) * (1+s2-s2*|CNnf|) • Empirically, s1 = 0.15, s2 = 1 / (|Q| + 1) works well

  16. Putting ‘em Together • score(JTT) = scorea * scoreb * scorec • a: IR-score of the virtual document • b: completeness factor • c: size normalization factor

  17. Comparing Top-1 Results • DBLP; Query = “nikos clique”

  18. #Rel and R-Rank Results • DBLP; 18 queries; Union of top-20 results • Mondial; 35 queries; Union of top-20 results

  19. Query Processing 3 Steps • Generate candidate tuples in every relation in the schema (using full-text indexes)

  20. Query Processing … 3 Steps • Generate candidate tuples in every relation in the schema (using full-text indexes) • Enumerate all possible Candidate Networks (CN)

  21. Query Processing … 3 Steps • Generate candidate tuples in every relation in the schema (using full-text indexes) • Enumerate all possible Candidate Networks (CN) • Execute the CNs • Most algorithms differ here. • The key is how to optimize for top-k retrieval

  22. Monotonic Scoring Function Execute a CN Assume: idfnetvista> idfmaxtor and k = 1 CN: PQ CQ P P1 c1  p1 c1  p1 P2 <  < C2 C1 C c2  p2 c2  p2 DISCOVER2

  23. Non-Monotonic Scoring Function Execute a CN Assume: idfnetvista> idfmaxtor and k = 1 CN: PQ CQ P2 P1 P ? c1  p1 c1  p1 <  < ? C C1 c2  p2 c2  p2 C2 SPARK • Re-establish the early stopping criterion • Check candidates in an optimal order

  24. Upper Bounding Function • Idea: use a monotonic & tight, upper bounding function to SPARK’s non-monotonic scoring function • Details • sumidf = widfw • watf(t) = (1/sumidf) * w(tfw(t) * idfw) • A = sumidf * (1 + ln(1 + ln( twatf(t) ))) • B = sumidf * twatf(t) • then, scorea  uscorea = (1/(1-s))*min(A, B) monotonic wrt. watf(t) scoreb scoreuscore are constants given the CN scorec

  25. Early Stopping Criterion Execute a CN Assume: idfnetvista> idfmaxtor and k = 1 CN: PQ CQ P P1 score( )  uscore( ) score( )  uscore( ) stop! P2 C2 C1 C SPARK  • Re-establish the early stopping criterion • Check candidates in an optimal order 

  26. {P1, P2, …} and {C1, C2, …} have been sorted based on their IR relevance scores. • Score(Pi  Cj) = Score(Pi) + Score(Cj) Query Processing … • Execute the CNs Operations: CN: PQ CQ • [P1 ,P1]  [C1 ,C1] • C.get_next() • [P1 ,P1]  C2 • P.get_next() • P2 [C1 ,C2] • P.get_next() • P3 [C1 ,C2] • … // a parametric SQL query is sent to the dbms P P3 P2 P1 C1 C2 C3 C [VLDB 03]

  27. Dominance uscore(<Pi, Cj>) > uscore(<Pi+1, Cj>) and uscore(<Pi, Cj>) > uscore(<Pi, Cj+1>) Skyline Sweeping Algorithm • Execute the CNs Priority Queue: Operations: CN: PQ CQ • <P1 , C1 > • <P2 , C1 >, <P1 , C2 > • <P3 , C1 >, <P1 , C2 >, <P2 , C2 > • <P1 , C2 >, <P2 , C2 >, <P4 , C1 >, <P3 , C2 > • … P • P1C1 • P2C1 • P3C1 P3 P2 P1 C1 C2 C3 C Skyline Sweep  • Re-establish the early stopping criterion • Check candidates in an optimal order sort of

  28. Block Pipeline Algorithm • Inherent deficiency to bound non-monotonic function with (a few) monotonic upper bounding functions •  draw an example • Lots of candidates with high uscores return much lower (real) score • unnecessary (expensive) checking • cannot stop earlier • Idea • Partition the space (into blocks) and derive tighter upper bounds for each partitions • “unwilling” to check a candidate until we are quite sure about its “prospect” (bscore)

  29. Block Pipeline Algorithm … Execute a CN Assume: idfn> idfmand k = 1 CN: PQ CQ P (n:0, m:1) (n:1, m:0) 2.74 2.41 2.38 2.63 2.63 stop! 2.63 2.63 1.05 C (n:1, m:0) (n:0, m:1) 1.05 Block Pipeline  • Re-establish the early stopping criterion • Check candidates in an optimal order 

  30. Efficiency • DBLP • ~ 0.9M tuples in total • k = 10 • PC 1.8G, 512M

  31. Efficiency … • DBLP, DQ13

  32. Conclusions • A system that can perform effective & efficient keyword search on relational databases • Meaningful query results with appropriate rankings • second-level response time for ~10M tuple DB (imdb data) on a commodity PC

  33. Q&A Thank you.

More Related