1 / 46

Global Detection of Complex Copying Relationships Between Sources

Global Detection of Complex Copying Relationships Between Sources. Xin Luna Dong AT&T Labs-Research Joint work w. Laure Berti-Equille , Yifan Hu , Divesh Srivastava @VLDB’2010. Information Propagation Becomes Much Easier with the Web Technologies. False Information Can Be Propagated.

Download Presentation

Global Detection of Complex Copying Relationships Between Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Global Detection of Complex Copying Relationships Between Sources Xin Luna Dong AT&T Labs-Research Joint work w. Laure Berti-Equille, YifanHu, DiveshSrivastava @VLDB’2010

  2. Information Propagation Becomes Much Easier with the Web Technologies

  3. False Information Can Be Propagated Posted by Andrew Breitbart In his blog …

  4. We now live in this media culture where something goes up on YouTube or a blog and everybody scrambles. - Barack Obama • The Internet needs a way to help people separate rumor from real science. • – Tim Berners-Lee

  5. Large-Scaled Copying on Structured Data(Copying of AbeBooks Data) Data collected from AbeBooks [Yin et al., 2007]

  6. Observation I. Intuitively Meaningful Clusters According to the Copying Relationships

  7. Observation I. Intuitively Meaningful Clusters According to the Copying Relationships

  8. Observation II. Complex Copying Relationships Co-copying

  9. Observation II. Complex Copying Relationships Multi-source copying Transitive copying

  10. Understanding Complex Copying Relationships Benefits • Business purpose: data are valuable • In-depth data analysis: information dissemination • Improve data integration: truth discovery, entity resolution, schema mapping, query optimization Current techniques make local decisions[Dong et al., 09a][Dong et al., 09b][Blanco et al., 10] • Cannot distinguish co-copying, transitive copying, direct copying from multiple sources

  11. Our Contributions More accurate decisions on copying direction (important for global detection) • Glean information from completeness, formatting • Consider correlated copying: e.g., a source copying the name of a book can also copy its author list Global detection of copying • Discovering co-copying and transitive copying

  12. Outline Motivation and contributions Problem definition and techniques Experimental results Related work and conclusions Techniques Intuitions

  13. Problem Definition—Input Objects: a real-world entity, described by a set of attributes • Each associated w. a true value Sources: each providing data for a subset of objects Input Missing values Incorrectvalues Different formats

  14. Problem Definition—Output For each S1, S2, decide pr of S1 copying directly from S2 • A copier copies all or a subset of data • A copier can add values and verify/modify copied values—independent contribution • A copier can re-format copied values—still considered as copied S1 S2 S3 S4

  15. Intuitions for Local Copying Detection Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2 Overlap on unpopular values  Copying Changes in quality of different parts of data Copying direction • [VLDB’09] Consider correctness of data

  16. Correctness of Data as Evidence for Copying S1 S2 S3 S4

  17. Intuitions for Local Copying Detection Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2 Overlap on unpopular values  Copying Changes in quality of different parts of data Copying direction • [VLDB’09] Consider correctness of data • Consider additionalevidence

  18. Formatting as Evidence for Copying S1 S2 S3 S4 SubValues Different formats

  19. Intuitions for Local Copying Detection Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1┴S2) S1->S2 Overlap on unpopular values  Copying Changes in quality of different parts of data Copying direction • [VLDB’09] Consider correctness of data • Consider additionalevidence • Consider correlated copying

  20. Correlated Copying 17 same values, and 8 different values 17 same values, and 8 different values Copying S: Two sources providing the same value D: Two sources providing different values

  21. Intuitions for Local Copying Detection Pr(Ф(S1)|S1->S2) >> Pr(Ф(S1)|S1┴S2) S1->S2 Overlap on unpopular values  Copying Changes in quality of different parts of data Copying direction • [VLDB’09] Consider correctness of data • Consider additionalevidence • Consider correlated copying

  22. Experimental Results for Local Copying Detection on Synthetic Data

  23. Outline Motivation and contributions Problem definition and techniques Experimental results Related work and conclusions Techniques Intuitions 

  24. Multi-Source Copying? Co-copying? Transitive Copying? S1{V1-V100} (V81-V100 are popular values) S3 {V1-V50, V101-V130} S2 {V51-V130} Multi-source copying S1{V1-V100} S1{V1-V100} {V21-V50, V81-V100} S3 S3 S2 S2 {V21-V70} {V1-V50} {V1-V50} Co-copying Transitive copying

  25. Multi-Source Copying? Co-copying? Transitive Copying? S1{V1-V100} (V81-V100 are popular values) S3 {V1-V50, V101-V130} S2 {V51-V130} Multi-source copying S1{V1-V100} S1{V1-V100} {V21-V50, V81-V100} S3 S3 S2 S2 {V21-V70} {V1-V50} {V1-V50} Co-copying Transitive copying Local copying detection results

  26. Multi-Source Copying? Co-copying? Transitive Copying? S1{V1-V100} (V81-V100 are popular values) S3 {V1-V50, V101-V130} S2 {V51-V130} Multi-source copying S1{V1-V100} S1{V1-V100} {V21-V50, V81-V100} S3 S3 S2 S2 {V21-V70} {V1-V50} {V1-V50} Co-copying Transitive copying - Looking at the copying probabilities?

  27. Multi-Source Copying? Co-copying? Transitive Copying? S1{V1-V100} (V81-V100 are popular values) 1 1 S3 {V1-V50, V101-V130} S2 {V51-V130} 1 Multi-source copying S1{V1-V100} S1{V1-V100} 1 1 1 1 {V21-V50, V81-V100} S3 S3 S2 S2 {V21-V70} {V1-V50} {V1-V50} 1 1 Co-copying Transitive copying X Looking at the copying probabilities? - Counting shared values?

  28. Multi-Source Copying? Co-copying? Transitive Copying? S1{V1-V100} (V81-V100 are popular values) 50 50 S3 {V1-V50, V101-V130} S2 {V51-V130} 30 Multi-source copying S1{V1-V100} S1{V1-V100} 50 50 50 50 {V21-V50, V81-V100} S3 S3 S2 S2 {V21-V70} {V1-V50} {V1-V50} 30 30 Co-copying Transitive copying X Looking at the copying probabilities? X Counting shared values? - Comparing the set of shared values?

  29. Multi-Source Copying? Co-copying? Transitive Copying? S1{V1-V100} (V81-V100 are popular values) V1-V50 V51-V100 S3 {V1-V50, V101-V130} S2 {V51-V130} V101-V130 Multi-source copying S1{V1-V100} S1{V1-V100} V1-V50 V21-V70 V1-V50 V21-V50, V81-V100 {V21-V50, V81-V100} S3 S3 S2 S2 {V21-V70} {V1-V50} {V1-V50} V21-V50 V21-V50 Co-copying Transitive copying X Looking at the copying probabilities? X Counting shared values? - Comparing the set of shared values?

  30. Multi-Source Copying? Co-copying? Transitive Copying? S1{V1-V100} (V81-V100 are popular values) V1-V50 V51-V100 S3 {V1-V50, V101-V130} S2 {V51-V130} V101-V130 Multi-source copying S1{V1-V100} S1{V1-V100} V1-V50 V21-V70 V1-V50 V21-V50, V80-V100 {V21-V50, V81-V100} S3 S3 S2 S2 {V21-V70} {V1-V50} {V1-V50} V21-V50 V21-V50 V21-V50 shared by 3 sources Co-copying Transitive copying X Looking at the copying probabilities? X Counting shared values? X Comparing the set of shared values? We need to reason for each data item in a principled way!

  31. Global Copying Detection First find a set of copyingsR that significantly influence the rest of the copyings • How to find such R? Adjust copying probability for the rest of the copyings: P(S1S2|R) • How to compute P(S1S2|R)?

  32. Computing P(S1S2|R) Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2 Replace Pr(Ф(S1)|S1S2) everywhere with Pr(Ф(S1)|S1S2, R) For each O.A, consider sources associated with S1 in R • Sf(O.A)—sources providing the same value in the same format on O.A as S1 • Sv(O.A)—sources providing the same value in a different format on O.A as S1 • Pf/Pv – Probability that S1 does not copy O.A from any source in Sf(O.A)/Sv(O.A) • Pr(ФO.A(S1)|S1->S2, R)=(1-PfPv)+PfPv Pr(ФO.A (S1)|S1S2)

  33. Multi-Source Copying? Co-copying? Transitive Copying? S1{V1-V100} (V81-V100 are popular values) V1-V50 V51-V100 ? S3 {V1-V50, V101-V130} S2 {V51-V130} V101-V130 Multi-source copying R={S3S1}, Pr(Ф(S3))= Pr(Ф(S3)|R) for V101-V130 S1{V1-V100} S1{V1-V100} ? X V1-V50 V21-V70 V1-V50 V21-V50, V81-V100 X ? {V21-V50, V81-V100} S3 S3 S2 S2 {V21-V70} {V1-V50} {V1-V50} V21-V50 V21-V50 Co-copying Transitive copying R={S3S2}, Pr(Ф(S3))<<Pr(Ф(S3)|R) for V21-V50 Pr(Ф(S3)) is high for V81-V100 R={S3S1}, Pr(Ф(S3))<<Pr(Ф(S3)|R) for V21-V50

  34. Finding R R (most influential copying relationships)Maximize Finding R is NP-complete(Reduction from HITTING SET problem) We need a fast greedy algorithm

  35. Greedy Algorithm for Finding R Goal: Maximize Intuitions • For each source, find the most “influential” sources from which it copies • Order the original sources by their accumulated influence on others, and iteratively add each corresponding copying to R unless one of the following holds • Prune copyings that have less accumulated influence on others than being affected by others • Prune copyings that can be significantly influenced by the already selected copyings E.g., P(S4S1)-P(S4S1|S4S3)=.8, P(S4S2)-P(S4S2|S4S3)=.8 P(S4S3)-P(S4S3|S4S1)=.5, P(S4S3)-P(S4S3|S4S2)=.5 S1 S2 X X S3 S4 Accumulated influence: .8+.8=1.6

  36. Experimental Results for Global Detection on Synthetic Data Sensitivity: Percentage of copying that are identified w. correct direction Specificity: Percentage of non-copying that are identified as so

  37. Outline Motivation and contributions Problem definition and techniques Experimental results Related work and conclusions Techniques Intuitions  

  38. Experimental Setup Dataset: Weather data • 18 weather websites • for 30 major USA cities • collected every 45 minutes for a day • 33 collections, so 990 objects • 28 distinct attributes Challenges • No true/false notion, only popularity • Frequent updates—up-to-date data may not have been copied at crawling • Complete data and standard formatting—lack evidence from completeness & formatting

  39. Golden Standard

  40. Silver Standard

  41. Results of Global Detection           

  42. Results of Local Detection            

  43. Experiment Results Measure: Precision, Recall, F-measure • C: real copying; D: detected copying Enriched improves over Corr when true/false notion does apply Transitive/co-copying not removed Ignoring evidence from correlated copying

  44. Related Work Copying detection • Texts/Programs [Schleimer et al., 03][Buneman, 71] • Videos [Law-To et al., 07] • Structured sources • [Dong et al., 09a] [Dong et al., 09b]: Local decision • [Blanco et al., 10]: Assume a copier must copy all attribute values of an object Data provenance [Buneman et al., PODS’08] • Focus on effective presentation and retrieval • Assume knowledge of provenance/lineage

  45. Conclusions and Future Work Conclusions • Improve previous techniques for pairwise copying detection by • plugging in different types of copying evidence • considering correlations between copying • Global detection for eliminating co-copying and transitive copying Ongoing and future work • Categorization and summarization of the copied instances • Visualization of copying relationships [VLDB’10 demo]

  46. Global Detection of Complex Copying Relationships Between Sources http://www2.research.att.com/~yifanhu/SourceCopying/

More Related