1 / 48

DATA FUSION

DATA FUSION. Resolving Inconsistencies at Schema, Tuple and Value Level Naveen Rajamoorthy Nachiappan Chidambaram Arun karthikeyan Palaniswamy Sriramakrishnan Soundarrajan. Need for Data Fusion. To compare different Data Sets Example: Shopping Agents Disaster Management System.

ismail
Download Presentation

DATA FUSION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DATA FUSION Resolving Inconsistencies at Schema, Tuple and Value Level NaveenRajamoorthy Nachiappan Chidambaram ArunkarthikeyanPalaniswamy SriramakrishnanSoundarrajan

  2. Need for Data Fusion • To compare different Data Sets • Example: • Shopping Agents • Disaster Management System

  3. GOALS OF DATA INTEGRATION • Completeness - amount of data (number of attributes and tuples) - achieved by adding more data sources • Conciseness - number of unique objects - number of unique attributes of the objects - achieved by reducing schematic inconsistencies by schema mapping • Correctness - validity of data - achieved by performing duplicate detection and data fusion Data Sources Schema Mapping Duplicate Detection Data Fusion

  4. Humboldt Merger(HumMer) • Fusing data from heterogeneous sources. • All Steps are performed at run-time. • Data Cleaning • Maximum Flexibility

  5. Components of Data Fusion • Heterogeneous and Dirty data • Three Steps • Schema Matching and Data Transformation • Duplicate Detection • Data Fusion

  6. Application Data Fusion Resolve inconsistencies at value level Step 3 Step 2 Duplicate Detection Resolve inconsistencies at tuple level Step 1 Resolve inconsistencies at schema level Schema Matching Data Sources Three Steps in Data Fusion

  7. Schema Matching and Data Transformation

  8. Schema Matching • Process of resolving schematic heterogeneity. 1. DUMAS Schema Matching Algorithm (Duplicate-based Matching of Schemas ) 2. TF IDF Similarity (term frequency–inverse document frequency)

  9. Example Consider the relation R(A,B,C,D,E) and S(B’,F,E,’G)

  10. ExampleConsider the relation R(A,B,C,D,E) and S(B’,F,E,’G)

  11. Example

  12. Example

  13. Schema Matching • Overlap of R and S schema

  14. Transformation • Preferred schema • Names of attributes are renamed or determined. • sourceID attribute is added to all tables in the schema.

  15. Duplicate Detection

  16. EXAMPLE <pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> Source A Source B <publication> <title> Database Systems: The Complete Book </title> <author> Molina & Ullman</author> <year> 1990 </year> </publication>

  17. SCHEMA MAPPING <pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> <pub> <title> </title> <Authors> <author> </author> <author> </author> </Authors> <year> </year> </pub> Source A Source B <publication> <title> Database Systems: The Complete Book </title> <Author> Molina & Ullman</Author> <year> 1990 </year> </publication>

  18. DATA TRANSFORMATION <pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> Source A <pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> <pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> <year> 1990 </year> </pub> XQuery <pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> <year> 1990 </year> </pub> Source B XQuery

  19. DUPLICATE DETECTION AND FUSION <pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> <pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> <year> 1990 </year> </pub> <pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> <year> 1990 </year> </pub>

  20. QUESTION • Give the correct order in which integration needs to be carried out: A) Data Transformation -> Schema Mapping -> Duplicate detection ->Fusion B) Duplicate detection -> Data Transformation -> Schema Mapping -> Fusion C)Schema Mapping -> Data Transformation -> Duplicate detection ->Fusion D) Data Transformation -> Schema Mapping -> Fusion -> Duplicate detection

  21. Duplicate Detection • Problem • Given one or more data sets, find all sets of objects that represent the same real-world entity. • Difficulties • Duplicates are not identical • Similarity measures – Levenshtein, Jaccard, etc. • Large volume, cannot compare all pairs • Partitioning strategies – Sorted neighborhood, Blocking, etc.

  22. PARTITIONING STRATEGIES • General Strategy • Sorted Neighborhood Method

  23. GENERAL STRATEGY Compare each record with every other record and calculate distance measure. Assuming there are n records in database then we need to compute n(n-1)/2 distance measures. Number of records, n = 6 Number of Distance measures to be computed = 10 If there are say, 100000 records, Then, Number of Distance Measures to be computed = 5*10^8 calculations EXPENSIVE

  24. SORTED NEIGHBORHOOD METHOD • Using Sorted Neighborhood method we can reduce the number of potential duplicate pairs. • Different fields are identified as key. • The database is sorted using this key. • After sorting a window of fixed size slides over the sorted database and duplicate records are identified. • The technique generates O(wN) pairs where w is window size and N is the total number of records in database.

  25. DUPLICATE DETECTION WITH DESCRIPTIONS Criteria For Attribute Selection: Attributes that are: related to the currently considered object Child elements having a Foreign key constraint over the attributes of the parent table. (ii) useable by our similarity measure Attribute City corresponding to attribute Zip code cannot be used to calculate similarity measure (iii) likely to distinguish duplicates from non-duplicates. Attribute for Denomination is unlikely to distinguish duplicate records

  26. DUPLICATE DETECTION WITH DESCRIPTIONS Description: • Consider attributes from other tables that have a foreign key relationship with the existing tables. • For efficiency, only direct child attributes are considered, i.e. no descendants reached by following more than one reference are discarded. Let tables T1 and T2 be the two matched tables, and let {T1,1, . . . , T1,k} and {T2,1, . . . , T2,m} be their respective children tables. Then, every pair of tables (T1,i, T2,j), 1<=i<=k, i<=j<=m is matched. Thus Actor(Movie),Actress(Movie) and Actors(Film) can also be used for Duplicate Detection

  27. Example Table 1 Table 2 ID in Table 1 is a foreign key for Country ID in Table 2 From Sim(Country) in Table 1 we understand row 1 and 3 are duplicates (row 1 = row 3) Now on using the attribute City in child table, Table 2 for Duplicate Detection we come to the conclusion that row 1 = row 2 = row 3 in table 1. i.e: USA = United States = US

  28. Detection From Similarity Measure Source 1 Source1 x Source2 Partitioning Similarity measure θ1>sim<θ2 Source 2 Sure Duplicates sim > θ2 Possibile Duplicates sim < θ1 Non-Duplicates

  29. Data Fusion • Objective Given a duplicate, create a single object-representation while resolving conflicting data values. Simple Example: Min(price) Max_length(author) Concat(Month,Year) Source 1 ID Source 2

  30. TYPES OF DATA CONFLICT • Uncertainty • NULL value vs. non-NULL value • “Easy” case • Uncertainty Conflict between a non-null value and one or more null values that are all used to describe the same property of a real-world entity Causes: Missing information, such as null values in a source or a completely missing attribute in a source • Contradiction It is a conflict between two or more different non-null values that are all used to describe the same property of the same entity. Causes: Contradiction is caused by different sources providing different values for the same attribute of a real-world entity. • Contradiction • Non-NULL value vs. (different) non-NULL value

  31. NULL TYPES • unknown • There is a value, but I do not know it. • E.g.: Unknown date-of-birth • notapplicable • There is no meaningful value. • E.g.: Spouse for singles • withheld • There is a value, but we are not authorized to see it. • E.g.: Private phone line

  32. Question • ________ refers to “Conflict between a non-null value and one or more null values that are all used to describe the same property of a real-world entity” • A. Contradiction • B. Uncertainty • C. Resolution • D. Ignorance

  33. Classification of Functions conflict resolution strategies conflict ignorance conflict avoidance conflict resolution Escalate instance based metadata based instance based metadata based Coalesce ChooseDepending Concat Choose deciding mediating deciding mediating CommonAncestor MostRecent MostAbstract MostSpecific MIN, MAX Random Vote AVG, SUM

  34. Conflict Resolution Functions

  35. Data Fusion Goals Assume 2 sources, Source 1(A,B,C) and Source 2(A,B,D) a, b, - a, b, -, - Identical tuples   a, b, -, - a, b, - a, b, -, - a, b, c a, b, c, - Subsumed tuples   a, b, c, - a, b, - a, b, -, - a, b, c a, b, c, -   a, b, c, d Complementing tuples a, b, d a, b, -, d a, b, c a, b, c, -   a, f(b,e), c, d Conflicting tuples a, e, d a, e, -, d

  36. Relational Operators – Overview • Identical tuples (duplicates) • UNION, OUTER UNION • Subsumed tuples(uncertainty) • MINIMUM UNION • Complementing tuples (uncertainty) • COMPLEMENT UNION, MERGE • Conflicting tuples (contradiction) • MATCH, GROUP, FUSE

  37. UNION ( SELECT Title AS Name,Author,ISBN FROM R)UNION( SELECT Name,Author,ID AS ISBNFROM S) UNION

  38. MINIMUM UNION = Select A,B,C,D AS NULL FROM R UNION ALL SELECT A,B, C AS NULL,D FROM S A tuple t1 subsumes a tuple t2, if it has same schema, has less NULL-values, and coincides in all non-NULL-values.  +

  39. FULL DISJUNCTION = |⋈| SELECT * FROM R FULL OUTER JOIN S ON R.A = S.A AND R.B = S.B;

  40. MERGE AND PRIORITIZED MERGE SELECT * FROM R FULL OUTER JOIN S ON R.A = S.A AND R.B = S.B; |⋈ = =  = = ⋈|

  41. FUSE BY SELECT Name, RESOLVE(Age, max), RESOLVE(Address,choose(EE_Students)) FUSE FROM EE_Students,CS_Students FUSE BY (Name) RESULT

  42. FUSE BY IMDB FILMBUFF RESULT SELECT ID, RESOLVE(Title, Choose(IMDB)), RESOLVE(Year, Max), RESOLVE(Director,Concat), RESOLVE(Rating), FUSE FROM IMDB, Filmdienst FUSE BY (ID) ON ORDER Year DESC

  43. Question Match The Following a, b, -, - Complementing tuples a 1 a, b, -, - a, b, c, - Conflicting tuples 2 b a, b, -, - a, b, c, - Subsumed tuples 3 c a, b, -, d a, b, c, - d Identical tuples 4 a, e, -, d

  44. Hummer Screenshot

  45. Hummer Screenshot

  46. Hummer Screenshot

  47. Hummer Screenshot

  48. REFERENCES • http://coitweb.uncc.edu/~wwu18/itcs6010/presentation/fusion_vldb.pdf • http://vldb.idi.ntnu.no/program/slides/demo/s1251-bilke.pdf • http://coitweb.uncc.edu/~wwu18/itcs6010/presentation/fusion-3step.pdf • http://www.hpi.uni-potsdam.de/fileadmin/hpi/FG_Naumann/publications/Modena05.pdf • http://vldb2009.org/files/DataFusionFinal.pdf • http://disi.unitn.it/~p2p/RelatedWork/Matching/dublicatesICDE05.pdf

More Related