1 / 30

Fast & Automatic Data Fusion

Fast & Automatic Data Fusion. Mohamed M. Hafez M. AbdElRahman. Course Instructor:. Dr. Anshumali Shrivastava. Rice University October 26, 2015. Agenda. Introduction. Introduction. Data Fusion based on. Data Dependency. Motivation. Information Gain. Evaluation & Results.

kadkins
Download Presentation

Fast & Automatic Data Fusion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast & Automatic Data Fusion Mohamed M. Hafez M. AbdElRahman Course Instructor: Dr. Anshumali Shrivastava Rice University October 26, 2015

  2. Agenda • Introduction • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Future Work • Proposed Solution 2 of 30

  3. Introduction Data Integration Components User Interface Consistent and Unambiguous Answer Fusion Query Global Unified Schema Data Integration System (Mediator) Local Schema Local Schema Local Schema Cleansing 3 of 30

  4. Introduction(cont.) Common Three Levels of Inconsistencies Application Data Fusion Step 3: Duplicate Detection Step 2: Schema Matching Step 1: Data Sources 4 of 30

  5. Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Future Work • Proposed Solution 5 of 30

  6. Motivation • Data is now everywhere and with multiple versions (Old Vs New). • Different social networks provide different kinds of data. • We are focusing on English Textual Data. • How data fusion can solve real life problems? • Passport(s) Inspection • News Verification • Building Knowledge base Network 6 of 30

  7. Motivation(cont.) What is the JobTitle , Email and City of Mohamed Hafez (Me )? 7 of 30

  8. Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Conflict Problem • Data Fusion Existing Solutions • Future Work • Proposed Solution 8 of 30

  9. Data Conflict Problem { } Select JobTitle , Email , City From GS.Employee Where Name like ‘Mohamed%Hafez’ Which one to choose ? 9 of 30

  10. Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Future Work • Data Fusion Existing Solutions • Proposed Solution 10 of 30

  11. Data Fusion Existing Solutions Classification of the Conflict Handling Strategies SSN Name Address NULL Fusionplex 2k 4k 6k 8k 10k 11 of 30

  12. Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Future Work • Proposed Solution • Proposed Solution 12 of 30

  13. Proposed Solution (cont.) Classification of the Conflict Handling Strategies Full Automation with No User Intervention Proposed Techniques Non-Federated Data Sources! No Duplicates within Data Sources! Assumptions 13 of 30

  14. Agenda • Data Fusion based on • Introduction • Data Fusion based on • Data Dependency • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Future Work • Proposed Solution 14 of 30

  15. Data Fusion based on Data Dependency { } Select JobTitle , Email From GS.Employee TWO Scores to be used GAS LAS 15 of 30

  16. Data Fusion based on Data Dependency(cont.) 1st Score (GAS) 16 of 30

  17. Data Fusion based on Data Dependency(cont.) 1st Score (GAS) } The only one with no conflict 17 of 30

  18. Data Fusion based on Data Dependency(cont.) 1st Score (GAS) 2ndScore (LAS) Combined Score (TAS) 18 of 30

  19. Agenda • Data Fusion based on • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Future Work • Proposed Solution 19 of 30

  20. Data Fusion based on Information Gain “Teaching Assistant” (3 times), “Assistant Lecturer” (3 times), “Associate Professor” (3 times) and “Professor” (1 time) 20 of 30

  21. Data Fusion based on Information Gain(cont.) 21 of 30

  22. Data Fusion based on Information Gain(cont.) The lowest the information gain, the better the splitter to be selected as a detector for the splitted attribute 22 of 30

  23. Data Fusion based on Information Gain(cont.) 2ndScore (DGS) 1st Score (GAS) Combined Score (TAPS) 23 of 30

  24. Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Future Work • Proposed Solution 24 of 30

  25. Evaluation & Results Simulation Environment Data Sources: 2,3,5,10,20 No. of Attributes: 3,5,7,10,50 No. of Distinct Values: [5,15] , [45,55] , [95,105] , [495,505] , [995,1005] No. of Records: 50,100,300,500,1000 No. of Runs for each parameter set: 10 No. of Simulation Input Parameter Sets: 5 * 5 * 5 * 5 = 625 Parameter Sets Total Number of Simulation Runs: 625 * 10 = 6250 Runs Should be >= 25 Evaluation Criteria Partial Matching vs. Full Matching 25 of 30

  26. Evaluation & Results Evaluation Summary Information Gain performs betterwhen AvgDataDepis relatively lowand the ratiobetween No_RECand No_DVAL is relatively low. This is because Information Gain takes into account any pairs even if it appears once, and is based on partitioning not dependency. Both techniques behave the same when the AvgDataDepis in the middle (not low no high), and the ratiois also in the middle. It has been noticed as well; both techniques score very high matching and sometimes 100% partial matching when all simulation input parameters in its minimum values. The techniques failed when the ratiobetween No_REC and No_DVAL is very far meaning that No_DVAL is very big and No_REC is small. This ratio makes it so difficult for both techniques to get better matching results. 26 of 30

  27. Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Future Work • Future Work • Proposed Solution 27 of 30

  28. Future Work 𝐶𝑜𝑚𝑝𝑎𝑟𝑒 𝑡𝑒𝑐ℎ𝑛𝑖𝑞𝑢𝑒𝑠 𝑎𝑔𝑎𝑖𝑛𝑠𝑡 𝐺𝐴𝑆 𝑎𝑙𝑜𝑛𝑒, 𝑎𝑛𝑑 𝑠ℎ𝑜𝑤 𝑡ℎ𝑒 𝑟𝑒𝑠𝑢𝑙𝑡𝑠. In Final Presentation Hashing Speed-up proposed techniques 28 of 30

  29. THANK YOU

  30. Fast & Automatic Data Fusion Mohamed M. Hafez M. AbdElRahman Course Instructor: Dr. Anshumali Shrivastava Rice University October 26, 2015

More Related