Deduplication fusion
This presentation is the property of its rightful owner.
Sponsored Links
1 / 45

Deduplication & Fusion PowerPoint PPT Presentation


  • 49 Views
  • Uploaded on
  • Presentation posted in: General

Deduplication & Fusion. Robert Ventura Simon [email protected] Index. Introduction Process Successful stories Architecture Demo. Index. Introduction Process Successful stories Architecture Demo. Introduction Benefits.

Download Presentation

Deduplication & Fusion

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Deduplication fusion

Deduplication& Fusion

Robert Ventura Simon

[email protected]


Index

Index

  • Introduction

  • Process

  • Successful stories

  • Architecture

  • Demo


Index1

Index

  • Introduction

  • Process

  • Successful stories

  • Architecture

  • Demo


Deduplication fusion

IntroductionBenefits

Identification of suspected duplicated records inside a database

Merging of data belonging to several databases with different formats detecting duplicated records

Validation tools for the detected similarities


Deduplication fusion

IntroductionDeduplication


Deduplication fusion

  • Introduction

  • Deduplication

Configuration

Automatic execution

Validation of results

Personalized export


Deduplication fusion

  • IntroductionDeduplication

Configuration

Automatic execution

Validation of results

Personalized export


Deduplication fusion

IntroductionFusion


Deduplication fusion

IntroductionFusion

Configuration

Automatic execution

Validation of results

Personalized export


Deduplication fusion

IntroductionFusion

Configuration

Automatic execution

Validation of results

Personalized export


Deduplication fusion

IntroductionFeatures


Index2

Index

  • Introduction

  • Process

    • Deduplication

    • Fusion

  • Successful stories

  • Architecture

  • Demo


  • Deduplication fusion

    DeduplicationConfigurations

    • Input data file format: CSV

    • Select relevant columns to link registers

    • Assign types to columns to help using the most adequate automatic filters

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    • DeduplicationConfigurations

    • Comparative type: exact value, estimation by text, numerical estimation

    • Percentage of the importance of each column for the similarity computation

    CSV

    Configurations

    Execution

    Validation

    Exportation

    • 100% =

    • 30%

    • 35%

    • 35%

    CSV


    Deduplication fusion

    • DeduplicationConfigurations

    • Use filters to normalize values

    • Available automatic and specific filters for values such as name, dates, address, etc…

    CSV

    Configurations

    Execution

    Validation

    Exportation

    • Filtersapplied

    CSV


    Deduplication fusion

    • DeduplicationConfigurations

    • Edition of filters(create new filters, delete or update existing ones)

    • Use of dictionaries: name-converter dictionary (i.e.: Pepe Jose)

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    • DeduplicationConfigurations

    • Similarity computation algorithm called Record Linkage. Parameters:

      • Size for the sliding window: number of registers each one will be compared to.

      • Sorting columns: ordenation by columns.

      • Threshold of similarity acceptance

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    • DeduplicationExecution

    • Order by Surname 1

    • Sliding window = 2

    CSV

    Configurations

    Execution

    • Window = 2

    Validation

    Exportation

    CSV


    Deduplication fusion

    • DeduplicationExecution

    • Similarities detected

    CSV

    Configurations

    Execution

    • Window = 2

    Validation

    • Similarities

    Exportation

    • Similarity degree

    CSV


    Deduplication fusion

    • DeduplicationExecution

    • Similarities detected

    CSV

    • window = 2

    Configurations

    • Similarities

    Execution

    Validation

    Exportation

    • Similaritydegree

    CSV


    Deduplication fusion

    • DeduplicationExecution

    • List of detected similarities

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    • DeduplicationExecution

    • List of detected similarities with percentage bigger than threshold 50%

    CSV

    Configurations

    > 50%

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    • DeduplicationValidation

    • Validation of results (including only those above the threshold)

    • Visualize by similarity/by group

    • Massive validation

    • Share validation between several supervisors

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    • DeduplicationExportation

    CSV

    • Select output format

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    Index

    • Introduction

    • Process

      • Deduplication

      • Fusion

  • Successful stories

  • Architecture

  • Demo


  • Deduplication fusion

    FusionConfigurations

    • Input data file format: CSV

    • Select relevant columns to link registers

    • Relation between columns from different data sources (only when merging)

    • Assign types to columns to help using the most adequate automatic filters

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    • Fusion Configurations

    • Comparative type: exact value, estimation by text, numerical estimation

    • Percentage of the importance of each column for the similarity computation

    CSV

    Configurations

    Execution

    Validation

    Exportation

    • 100% =

    • 80%

    • 20%

    CSV


    Deduplication fusion

    • Fusion Configurations

    • Specific percentage for registers with null valued columns

    • Use filters to make values standard

    • Available automatic and specific filters for values such as name, dates, address, etc…

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    • Fusion Configurations

    CSV

    • Edit filters (create new filters, delete or update existing ones)

    • Use of dictionaries: name-converter dictionary (I.e.: BCN BARCELONA)

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    • Fusion Configurations

    • Similarity computation algorithm called Record Linkage. Parameters:

      • Size for the sliding window: number of registers each one will be compared to.

      • Sorting columns: ordenation by columns.

      • Threshold of similarity acceptance

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    • Fusion Execution

    • Order by City

    • Sliding window = 2

    CSV

    Configurations

    Execution

    • Window = 2

    Validation

    Exportation

    CSV


    Deduplication fusion

    • Fusion Execution

    • Similarities detected

    CSV

    Configurations

    Execution

    • Window = 2

    Validation

    • Similarity

    Exportation

    • Similarity degree

    CSV


    Deduplication fusion

    • Fusion Execution

    • Similarities detected

    CSV

    Configurations

    • Similarities

    Execution

    Validation

    • Window = 2

    Exportation

    • Similarity degree

    CSV


    Deduplication fusion

    • Fusion Execution

    • List of detected similarities

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    • Fusion Execution

    • List of detected similarities with percentage bigger than threshold 50%

    CSV

    Configurations

    > 50%

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    • Fusion Validation

    • Validation of results (including only those above the threshold)

    • Visualize by similarity/by group

    • Massive validation

    • Share validation between several supervisors

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    • Fusion Exportation

    • Output format

      • Select values for every similarity

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Deduplication fusion

    Index

    • Introduction

    • Process

    • Successful stories

    • Architecture

    • Demo


    Deduplication fusion

    • Succesful storiesHealthService

    Who?Health Service

    ObjectiveDetect repeated health id cards

    SolutionDetect repeated registers in the database and delete them

    Deduplicaction with DAURUM

    ResultHealth id cards database cleaned of repetitions


    Deduplication fusion

    Who? Beer manufacturer

    ObjectiveDetect dealers that deliver to not previously assigned centers

    SolutionIdentify duplicates in each dealer’s delivery database and delete them

    Deduplication with DAURUM

    Detect deliveries to centers shared between different dealers

    Fusion with DAURUM

    ResultMaster database clean of repetitions and detection of dealers with wrong deliveries

    • Succesful storiesBeer manufacturer


    Deduplication fusion

    Index

    • Introduction

    • Process

    • Successful stories

    • Architecture

    • Demo


    Deduplication fusion

    • Architecture

    • Struts 2: Model-View-Controller

    • Hibernate: Database manipulation


    Deduplication fusion

    Index

    • Introduction

    • Process

    • Successful stories

    • Architecture

    • Demo


    Deduplication fusion

    Demo


    Deduplication fusion

    • Thanks for your attention

    • Any questions?

    SPARSITY-TECHNOLOGIES

    Jordi Girona, 1-3, Edifici K2M 08034 Barcelona

    [email protected]

    http://www.sparsity-technologies.com

    DAMA-UPC. DATA MANAGEMENT (UPC)Departamentd'Arquitectura de ComputadorsEdifici C6-S103. Campus Nord.   Jordi Girona, 1-3.  08034 - Barcelona  www.dama.upc.edu


  • Login