Deduplication fusion
This presentation is the property of its rightful owner.
Sponsored Links
1 / 45

Deduplication & Fusion PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on
  • Presentation posted in: General

Deduplication & Fusion. Robert Ventura Simon [email protected] Index. Introduction Process Successful stories Architecture Demo. Index. Introduction Process Successful stories Architecture Demo. Introduction Benefits.

Download Presentation

Deduplication & Fusion

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Deduplication& Fusion

Robert Ventura Simon

[email protected]


Index

  • Introduction

  • Process

  • Successful stories

  • Architecture

  • Demo


Index

  • Introduction

  • Process

  • Successful stories

  • Architecture

  • Demo


IntroductionBenefits

Identification of suspected duplicated records inside a database

Merging of data belonging to several databases with different formats detecting duplicated records

Validation tools for the detected similarities


IntroductionDeduplication


  • Introduction

  • Deduplication

Configuration

Automatic execution

Validation of results

Personalized export


  • IntroductionDeduplication

Configuration

Automatic execution

Validation of results

Personalized export


IntroductionFusion


IntroductionFusion

Configuration

Automatic execution

Validation of results

Personalized export


IntroductionFusion

Configuration

Automatic execution

Validation of results

Personalized export


IntroductionFeatures


Index

  • Introduction

  • Process

    • Deduplication

    • Fusion

  • Successful stories

  • Architecture

  • Demo


  • DeduplicationConfigurations

    • Input data file format: CSV

    • Select relevant columns to link registers

    • Assign types to columns to help using the most adequate automatic filters

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • DeduplicationConfigurations

    • Comparative type: exact value, estimation by text, numerical estimation

    • Percentage of the importance of each column for the similarity computation

    CSV

    Configurations

    Execution

    Validation

    Exportation

    • 100% =

    • 30%

    • 35%

    • 35%

    CSV


    • DeduplicationConfigurations

    • Use filters to normalize values

    • Available automatic and specific filters for values such as name, dates, address, etc…

    CSV

    Configurations

    Execution

    Validation

    Exportation

    • Filtersapplied

    CSV


    • DeduplicationConfigurations

    • Edition of filters(create new filters, delete or update existing ones)

    • Use of dictionaries: name-converter dictionary (i.e.: Pepe Jose)

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • DeduplicationConfigurations

    • Similarity computation algorithm called Record Linkage. Parameters:

      • Size for the sliding window: number of registers each one will be compared to.

      • Sorting columns: ordenation by columns.

      • Threshold of similarity acceptance

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • DeduplicationExecution

    • Order by Surname 1

    • Sliding window = 2

    CSV

    Configurations

    Execution

    • Window = 2

    Validation

    Exportation

    CSV


    • DeduplicationExecution

    • Similarities detected

    CSV

    Configurations

    Execution

    • Window = 2

    Validation

    • Similarities

    Exportation

    • Similarity degree

    CSV


    • DeduplicationExecution

    • Similarities detected

    CSV

    • window = 2

    Configurations

    • Similarities

    Execution

    Validation

    Exportation

    • Similaritydegree

    CSV


    • DeduplicationExecution

    • List of detected similarities

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • DeduplicationExecution

    • List of detected similarities with percentage bigger than threshold 50%

    CSV

    Configurations

    > 50%

    Execution

    Validation

    Exportation

    CSV


    • DeduplicationValidation

    • Validation of results (including only those above the threshold)

    • Visualize by similarity/by group

    • Massive validation

    • Share validation between several supervisors

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • DeduplicationExportation

    CSV

    • Select output format

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Index

    • Introduction

    • Process

      • Deduplication

      • Fusion

  • Successful stories

  • Architecture

  • Demo


  • FusionConfigurations

    • Input data file format: CSV

    • Select relevant columns to link registers

    • Relation between columns from different data sources (only when merging)

    • Assign types to columns to help using the most adequate automatic filters

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • Fusion Configurations

    • Comparative type: exact value, estimation by text, numerical estimation

    • Percentage of the importance of each column for the similarity computation

    CSV

    Configurations

    Execution

    Validation

    Exportation

    • 100% =

    • 80%

    • 20%

    CSV


    • Fusion Configurations

    • Specific percentage for registers with null valued columns

    • Use filters to make values standard

    • Available automatic and specific filters for values such as name, dates, address, etc…

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • Fusion Configurations

    CSV

    • Edit filters (create new filters, delete or update existing ones)

    • Use of dictionaries: name-converter dictionary (I.e.: BCN BARCELONA)

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • Fusion Configurations

    • Similarity computation algorithm called Record Linkage. Parameters:

      • Size for the sliding window: number of registers each one will be compared to.

      • Sorting columns: ordenation by columns.

      • Threshold of similarity acceptance

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • Fusion Execution

    • Order by City

    • Sliding window = 2

    CSV

    Configurations

    Execution

    • Window = 2

    Validation

    Exportation

    CSV


    • Fusion Execution

    • Similarities detected

    CSV

    Configurations

    Execution

    • Window = 2

    Validation

    • Similarity

    Exportation

    • Similarity degree

    CSV


    • Fusion Execution

    • Similarities detected

    CSV

    Configurations

    • Similarities

    Execution

    Validation

    • Window = 2

    Exportation

    • Similarity degree

    CSV


    • Fusion Execution

    • List of detected similarities

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • Fusion Execution

    • List of detected similarities with percentage bigger than threshold 50%

    CSV

    Configurations

    > 50%

    Execution

    Validation

    Exportation

    CSV


    • Fusion Validation

    • Validation of results (including only those above the threshold)

    • Visualize by similarity/by group

    • Massive validation

    • Share validation between several supervisors

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • Fusion Exportation

    • Output format

      • Select values for every similarity

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Index

    • Introduction

    • Process

    • Successful stories

    • Architecture

    • Demo


    • Succesful storiesHealthService

    Who?Health Service

    ObjectiveDetect repeated health id cards

    SolutionDetect repeated registers in the database and delete them

    Deduplicaction with DAURUM

    ResultHealth id cards database cleaned of repetitions


    Who? Beer manufacturer

    ObjectiveDetect dealers that deliver to not previously assigned centers

    SolutionIdentify duplicates in each dealer’s delivery database and delete them

    Deduplication with DAURUM

    Detect deliveries to centers shared between different dealers

    Fusion with DAURUM

    ResultMaster database clean of repetitions and detection of dealers with wrong deliveries

    • Succesful storiesBeer manufacturer


    Index

    • Introduction

    • Process

    • Successful stories

    • Architecture

    • Demo


    • Architecture

    • Struts 2: Model-View-Controller

    • Hibernate: Database manipulation


    Index

    • Introduction

    • Process

    • Successful stories

    • Architecture

    • Demo


    Demo


    • Thanks for your attention

    • Any questions?

    SPARSITY-TECHNOLOGIES

    Jordi Girona, 1-3, Edifici K2M 08034 Barcelona

    [email protected]

    http://www.sparsity-technologies.com

    DAMA-UPC. DATA MANAGEMENT (UPC)Departamentd'Arquitectura de ComputadorsEdifici C6-S103. Campus Nord.   Jordi Girona, 1-3.  08034 - Barcelona  www.dama.upc.edu


  • Login