Managing Probabilistic Duplicates in Databases

Managing Probabilistic Duplicates in Databases Mohamed M. Hafez Supervisors: Prof. Osman Hegazy Mohamed Prof. Ali Hamid El-Bastawissy

Agenda • Introduction • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution 2 of 57

Introduction Data Integration Components Consistent and Unambiguous Answer User Interface Fusion Query Global Unified Schema Data Integration System (Mediator) Local Schema Local Schema Local Schema Cleansing 3 of 57

Introduction(cont.) Common Three Levels of Inconsistencies Application Data Fusion Step 3: Duplicate Detection Step 2: Schema Matching Step 1: Data Sources 4 of 57

Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution 5 of 57

Motivation • Data is now everywhere and with multiple versions (Old Vs New). • Different social networks provide different kinds of data. • We are focusing on English Textual Data. • How data fusion can solve real life problems? • Passport(s) Inspection • News Verification • Building Knowledge base Network 6 of 57

Motivation(cont.) What is the JobTitle , Email and City of Mohamed Hafez? 7 of 57

Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution 8 of 57

Data Conflict Problem { } Select JobTitle , Email , City From GS.Employee Where Name like ‘Mohamed%Hafez’ Which one to choose ? 9 of 57

Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Data Fusion Existing Solutions • Proposed Solution 10 of 57

Data Fusion Existing Solutions Classification of the Conflict Handling Strategies SSN Name Address NULL Fusionplex 2k 4k 6k 8k 10k 11 of 57

Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution • Proposed Solution 12 of 57

Proposed Solution (cont.) Classification of the Conflict Handling Strategies Full Automation with No User Intervention Proposed Techniques Non-Federated Data Sources! No Duplicates within Data Sources! Assumptions 13 of 57

Agenda • Data Fusion based on • Introduction • Data Fusion based on • Data Dependency • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution 14 of 57

Data Fusion based on Data Dependency { } Select JobTitle , Email From GS.Employee TWO Scores to be used GAS LAS 15 of 57

Data Fusion based on Data Dependency(cont.) 1st Score (GAS) Global Attributes Scoring “GAS” Let ATT be the attribute for which we want to calculate GAS, and let D represents the list of values in ATT respectively and CLUS (Card) to present the number of records in the cluster. We define the GAS for a given attribute ATT in a given record to be: 16 of 57

Data Fusion based on Data Dependency(cont.) 1st Score (GAS) } The only one with no conflict 17 of 57

Data Fusion based on Data Dependency(cont.) Attributes Dependency Score Let C1 and C2 represents two attributes in the same data source DS, and let D1and D2 be the list of values in C1 and C2 respectively and DS(Card) to present the number of records in the data source. The score of the attributes dependency D (C1, C2) between C1 and C2 is defined as: Where Count (D1 (k), D2 (k)) > 1. 18 of 57

Data Fusion based on Data Dependency(cont.) Local Detectors The lowest dependent attribute in addition to the intersection of attributes between all of the contributing dependency lists having dependency score greater than ZERO FCI {Department, City} Banks {Employer, City} PrivateSector{EmployerName, City} 19 of 57

Data Fusion based on Data Dependency(cont.) Unified Detectors We define the unified detectors set, over all contributing data sources, to be the intersection of all the sets of local detectors { FCI {Department, City} Unified Detectors {City} Local Detectors Banks {Employer, City} PrivateSector{EmployerName, City} 20 of 57

Data Fusion based on Data Dependency(cont.) 2ndScore (LAS) Local Attributes Scoring “LAS” Let UN_DET be the list of unified detectors and UN_DET (Card) to present the number of unified detectors over all contributing data sources. We define the LAS for a given attribute ATT to be: 21 of 57

Data Fusion based on Data Dependency(cont.) 2ndScore (LAS) 22 of 57

Data Fusion based on Data Dependency(cont.) 1st Score (GAS) 2ndScore (LAS) Combined Score (TAS) Total Attributes Scoring “TAS” Let ATT be the attribute for which we want to calculate TAS, and let LAS and GAS represents its Local Attributes Scoring and Global Attributes Scoring. We define the TAS for a given attribute ATT in a given record to be: Where GAS (ATT) ≠ 1; and TAS (ATT) = 1 otherwise. 23 of 57

Data Fusion based on Data Dependency(cont.) 1st Score (GAS) 2ndScore (LAS) Combined Score (TAS) 24 of 57

Agenda • Data Fusion based on • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution 25 of 57

Data Fusion based on Information Gain In Literature, used in signals processing (Compression, source and channel coding). City Department Which Attribute gives less access to JobTitle Values? Department >> City The lowest the information gain, the better the partitioning to be selected as a detector for the target attribute. 26 of 57

Data Fusion based on Information Gain(cont.) Attribute Entropy Let ATT be the attribute for which we want to calculate the entropy of its values, and let C represents the list of ATT classes of values. Let (Card) to present the number of values in class (i) and ATT (Card) to present the number of values in the given attribute ATT. Pi to present the probability to choose a value of the class (i) from the list of values in attribute ATT. We define the attribute entropy for the attribute ATT to be: “Teaching Assistant” (3 times), “Assistant Lecturer” (3 times), “Associate Professor” (3 times) and “Professor” (1 time) 27 of 57

Data Fusion based on Information Gain(cont.) Expected Information Let ATT be the attribute to be partitioned and P_ATT be the attribute to be used to partition ATT. Let C represents the list of ATT classes of values. Let Ci(Card) to present the number of values in class (i) and ATT(Card) to present the number of values in ATT. So, we define the expected information needed to partition attribute ATT according to attribute P_ATT to be: ) 28 of 57

Data Fusion based on Information Gain(cont.) Information Gain for an attribute to partition another Let ATT be the attribute to be partitioned, and P_ATT to be the partitioning attribute. Let Info (ATT) to the entropy of ATT, and Info_(P_ATT) (ATT) to be the expected information needed to partition ATT according to P_ATT. We define the gain of such partitioning to be: = 0.41 The lowest the information gain, the better the splitter to be selected as a detector for the splitted attribute 29 of 57

Data Fusion based on Information Gain(cont.) Local Detectors The attribute with the lowest information gain for each of the query requested attributes’ detectors list in addition to the intersection between all of the contributing query requested attributes’ detectors lists having information gain not equal to the basic entropy of the requested attribute. { FCI {City, Department, Name} Unified Detectors {City} Local Detectors Banks {City, Profession} PrivateSector {City} 30 of 57

Data Fusion based on Information Gain(cont.) 2ndScore (DGS) Dependency Gain Scoring “DGS” Let ATT be the target attribute to be partitioned, and {P_ATT} to be the list of unified detectors attributes. Let Gain (P_ATT on ATT) to be the information gained by partitioning ATT based on {P_ATT} defined as: = 0.41 31 of 57

Data Fusion based on Information Gain(cont.) 2ndScore (DGS) 1st Score (GAS) Combined Score (TAPS) Total Attributes Preference Scoring “TAPS” Let ATT be the attribute for which we want to calculate TAS, and let DGS and GAS represents its Dependency Gain Scoring and Global Attributes Scoring. We define the TAS for a given attribute ATT in a given record to be: Where △DGS (ATT) = - (DGS (ATT) - next higher DGS value)/ next higher DGS value, △GAS (ATT) = (GAS (ATT) - next lowest GAS value)/ GAS (ATT), GAS (ATT) ≠ 1; and TAPS (ATT) = 0 otherwise. 32 of 57

Data Fusion based on Information Gain(cont.) 2ndScore (DGS) 1st Score (GAS) Combined Score (TAPS) 33 of 57

Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution 34 of 57

Evaluation & Results Simulation Environment Data Sources: 2,3,5,10,20 No. of Attributes: 3,5,7,10,50 No. of Distinct Values: [5,15] , [45,55] , [95,105] , [495,505] , [995,1005] No. of Records: 50,100,300,500,1000 No. of Runs for each parameter set: 10 No. of Simulation Input Parameter Sets: 5 * 5 * 5 * 5 = 625 Parameter Sets Total Number of Simulation Runs: 625 * 10 = 6250 Runs Technology used Microsoft Visual Studio. NET (C#) Microsoft SQL Server 35 of 57

Evaluation & Results(cont.) Partial Matching: Cases when Data Dependency is better than Information Gain Superior Successful Parameter Sets 36 of 57

Evaluation & Results(cont.) Partial Matching: Cases when Data Dependency is better than Information Gain Superior Successful Parameter Sets 37 of 57

Evaluation & Results(cont.) Partial Matching: Cases when Information Gain is better than Data Dependency Superior Successful Parameter Sets 38 of 57

Evaluation & Results(cont.) Partial Matching: Cases when Information Gain is better than Data Dependency Superior Successful Parameter Sets 39 of 57

50,100 300,500 1000 No_REC [45, 55] [495,505] [5, 15] [95,105] [995, 1005] No_DVAL Evaluation & Results(cont.) Partial Matching: Cases when both techniques successful & give the same result No_ATT = 50 and No_DS = 2 Superior Perfect Successful Parameter Sets 40 of 57

3 No_DS 50 100 500 No_REC [45, 55] [495,505] [5, 15] [95,105] [995, 1005] No_DVAL Evaluation & Results(cont.) Partial Matching: Cases when both techniques successful & give the same result No_ATT = 50 Superior Perfect Successful Parameter Sets 41 of 57

5 10 No_DS 100 50 No_REC [5, 15] No_DVAL Evaluation & Results(cont.) Partial Matching: Cases when both techniques successful & give the same result No_ATT = 50 Superior Perfect Successful Parameter Sets 42 of 57

Evaluation & Results(cont.) Partial Matching: Cases when both techniques are unsuccessful Total Superior Failure Parameter Sets 43 of 57

Evaluation & Results(cont.) Full Matching: Cases when Data Dependency is better than Information Gain 44 of 57

Evaluation & Results(cont.) Full Matching: Cases when Information Gain is better than Data Dependency Information Gain Superior Successful Parameter Sets with DD_Full_MATCH > 0% 45 of 57

Evaluation & Results(cont.) Full Matching: Cases when both techniques successful & give the same result 46 of 57

Evaluation & Results(cont.) Full Matching: Cases when both techniques are unsuccessful 47 of 57

Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Conclusion & Future Work • Proposed Solution 48 of 57

Conclusion & Future Work Conclusion The larger the number of attributes (No_ATT) in each data source, the higher the partial matching scores for both techniques. Also, the opposite conclusion for a lower No_ATTis valid as well. Increasing the number of records (No_REC), number of data sources (No_DS) and number of distinct values (No_DVAL) will lead to lower matching for both techniques. Data Dependency performs betterwhen the average dependency between the attributes data is very high (AvgDataDep). Also, it scores high matching when the ratiobetween No_REC and No_DVALis high, which means that the pairs have higher probability to repeat. 49 of 57

Conclusion & Future Work Conclusion Information Gain performs betterwhen AvgDataDepis relatively lowand the ratiobetween No_RECand No_DVAL is relatively low. This is because Information Gain takes into account any pairs even if it appears once, and is based on partitioning not dependency. Both techniques behave the same when the AvgDataDepis in the middle (not low no high), and the ratiois also in the middle. It has been noticed as well; both techniques score very high matching and sometimes 100% partial matching when all simulation input parameters in its minimum values. The techniques failed when the ratiobetween No_REC and No_DVAL is very far meaning that No_DVAL is very big and No_REC is small. This ratio makes it so difficult for both techniques to get better matching results. 50 of 57

Managing Probabilistic Duplicates in Databases