1 / 45

CS345

CS345. Compact Skeletons. Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website Assume no noise for now Reconstruct relation. Compact Skeletons. Relation. Skeleton. Data Graph. Website. Welcome to Big Corp!

zarita
Download Presentation

CS345

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS345 Compact Skeletons

  2. Compact Skeletons • Assume tuples components are scattered over website • We have a tagger that can tag all tuple components on website • Assume no noise for now • Reconstruct relation

  3. Compact Skeletons Relation Skeleton Data Graph Website

  4. Welcome to Big Corp! Join our team. The following jobs are open: Job #12345 Job #12346 Send resumes to: Jobs are available in these departments: R&D Corporate Job Title: Salary: Must know Java….. 1200 Jose Blvd, CA 94123 Programmer 100K Dept (D) Title (T) Salary (S) Address (A)

  5. Welcome to Big Corp! Join our team. The following jobs are open: Job #12345 Job #12346 Send resumes to: Jobs are available in these departments: R&D Corporate Dept (D) Title (T) Salary (S) Address (A) Job Title: Salary: Must know Java….. 1200 Jose Blvd, CA 94123 Programmer 100K

  6. Programmer 100K R & D Corporate 1200 Jose Blvd

  7. Corporate 400 7th Ave CEO Admin Asst 60K Programmer 100K 150K CTO TSDA R & D 1200 Jose Blvd Programmer 100K R &D 1200 Jose Blvd CTO 150K R & D 1200 Jose Blvd Admin Asst 60K Corporate 400 7th Ave CEO (null) Corporate 400 7th Ave

  8. R & D 1200 Jose Blvd Programmer 100K 150K CTO TSDA Corporate CEO Admin Asst 60K Programmer 100K R &D 1200 Jose Blvd CTO 150K R & D 1200 Jose Blvd Admin Asst 60K Corporate 1200 Jose Blvd CEO (null) Corporate 1200 Jose Blvd

  9. Relation Skeleton Data Graph Website

  10. Skeletons • Labeled trees • Transformation from data graphs to relations D D A A T S T S

  11. 150K CTO Overlays R & D D 1200 Jose Blvd A T S Programmer 100K

  12. 150K CTO T S D A Programmer 100K R &D 1200 Jose Blvd Overlays R & D D 1200 Jose Blvd A T S Programmer 100K

  13. T S D A Programmer 100K R &D 1200 Jose Blvd Overlays R & D D 1200 Jose Blvd A T S 150K CTO Programmer 100K CTO 150K R &D 1200 Jose Blvd

  14. Overlays R & D D A 1200 Jose Blvd T S Programmer 100K CTO 150K

  15. T S D A Programmer 150K R &D 1200 Jose Blvd Overlays R & D D A 1200 Jose Blvd T S Programmer 100K CTO 150K

  16. T S D A Programmer 150K R &D 1200 Jose Blvd Overlays R & D D A 1200 Jose Blvd T S Programmer 100K CTO 150K CTO 100K R & D 1200 Jose Blvd

  17. Inconsistent Overlays R & D D A 1200 Jose Blvd T S Programmer 100K CTO 150K

  18. Inconsistent Overlays R & D D A 1200 Jose Blvd T S Programmer 100K CTO 150K

  19. Compact Skeletons • A skeleton is compact if all overlays are consistent • Perfect if each node and edge of data graph is covered by at least one overlay • Given a data graph G, does G have a Perfect Compact Skeleton (PCS)? • Not always • But if it exists it is unique

  20. PCS Algorithm R & D 1200 Jose Blvd 150K Programmer 100K CTO

  21. Work bottom-up: Compute node signatures Place nodes in equivalence classes based on signature Construct skeleton from equivalence classes PCS Algorithm D A S T T S

  22. D A T S PCS Algorithm D A S T T S

  23. Incomplete information Corporate D A 400 7th Ave CEO T S Admin Asst 60K

  24. T S D A Admin Asst 60K Corporate 400 7th Ave Incomplete information Corporate D A 400 7th Ave CEO T S Admin Asst 60K

  25. T S D A Admin Asst 60K Corporate 400 7th Ave CEO Corporate 400 7th Ave Incomplete information Corporate D A 400 7th Ave CEO T S Admin Asst 60K

  26. Partial Compact Skeletons • For data graphs with incomplete information, we allow partial overlays • Results in nulls in relation • If we can use consistent partial overlays to cover every node and edge of the graph, we have a partially perfect compact skeleton (PPCS)

  27. Tuple subsumption • Tuple tsubsumes tuple u if t and u agree on every component of u that is not null t u

  28. Noisy Data Graphs • Real-life websites are noisy • False positives e.g., MS = degree, state or Microsoft? • Non-skeleton links e.g., featured products

  29. Data graph for a retail website C Skeleton K1 I P A a4 c3 c2 c1 C: Category I: Item P: Price A: Availability i2 i4 i1 i3 p1 a1 p3 p4 a2 a3 For simplicity: assume all nodes have a label

  30. Coverage of a skeleton C Skeleton K1 I P A c3 c2 c1 i2 i4 i1 a4 i3 p1 a1 p3 p4 a2 a3

  31. Coverage of a skeleton C Skeleton K1 Coverage = 28 I P A c3 c2 c1 i2 i4 i1 a4 i3 p1 a1 p3 p4 a2 a3

  32. Coverage of a skeleton C Skeleton K1 Coverage = 28 I P A c3 c2 c1 i2 i4 i1 a4 i3 C I Skeleton K2 Coverage = 12 P p1 a1 p3 p4 a2 a3 A

  33. Skeletons for Noisy Data Graphs • Problem: • Find skeleton K with optimal coverage, called the best-fit skeleton (BFS) • NP-complete

  34. Greedy Heuristic for BFS r c3 c2 c1 i2 i4 i1 a4 i3 p1 a1 p3 p4 a2 a3

  35. Greedy Heuristic for BFS R C I P A R C C C I I I A I P A P P A A

  36. R B A C D Greedy skeleton R B A C C C D D D D

  37. R B A C C C D D D D R B A C D Greedy skeleton Coverage = 9

  38. R B A C D Greedy skeleton Coverage = 9 R B A C C C D D D D R B A C D Optimal skeleton Coverage = 15

  39. Weighted Greedy Heuristic • Simple Greedy heuristic uses parent counts • “Memory-less” • Weighted Greedy heuristic takes into account past selections to improve simple greedy selection • Computes “benefit” of each decision at every stage

  40. R Weighted Greedy B A C C C D D D D R B A C C D D Greedy skeleton Coverage = 9

  41. benefit ( A C ) = 4 R Weighted Greedy B A C C C D D D D R B A C C D D Greedy skeleton Coverage = 9

  42. benefit ( A C ) = 4 benefit (B C ) = 10 R Weighted Greedy B A C C C D D D D R B A C C D D Greedy skeleton Coverage = 9

  43. R A R Weighted Greedy B A C C C D D D D R B B A C C D D Greedy skeleton Coverage = 9

  44. R B A C D Greedy skeleton Coverage = 9 R B A C C C D D D D R B A C D Weighted greedy skeleton Coverage = 15

  45. Summary Relation Compact Skeleton Data Graph Website

More Related