Deduplication fusion
Download
1 / 45

Deduplication & Fusion - PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on

Deduplication & Fusion. Robert Ventura Simon [email protected] Index. Introduction Process Successful stories Architecture Demo. Index. Introduction Process Successful stories Architecture Demo. Introduction Benefits.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Deduplication & Fusion' - cili


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Deduplication fusion

Deduplication& Fusion

Robert Ventura Simon

[email protected]


Index
Index

  • Introduction

  • Process

  • Successful stories

  • Architecture

  • Demo


Index1
Index

  • Introduction

  • Process

  • Successful stories

  • Architecture

  • Demo


IntroductionBenefits

Identification of suspected duplicated records inside a database

Merging of data belonging to several databases with different formats detecting duplicated records

Validation tools for the detected similarities


IntroductionDeduplication


Configuration

Automatic execution

Validation of results

Personalized export


Configuration

Automatic execution

Validation of results

Personalized export



IntroductionFusion

Configuration

Automatic execution

Validation of results

Personalized export


IntroductionFusion

Configuration

Automatic execution

Validation of results

Personalized export


IntroductionFeatures


Index2
Index

  • Introduction

  • Process

    • Deduplication

    • Fusion

  • Successful stories

  • Architecture

  • Demo


  • DeduplicationConfigurations

    • Input data file format: CSV

    • Select relevant columns to link registers

    • Assign types to columns to help using the most adequate automatic filters

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • Comparative type: exact value, estimation by text, numerical estimation

    • Percentage of the importance of each column for the similarity computation

    CSV

    Configurations

    Execution

    Validation

    Exportation

    • 100% =

    • 30%

    • 35%

    • 35%

    CSV


    • Use filters to normalize values

    • Available automatic and specific filters for values such as name, dates, address, etc…

    CSV

    Configurations

    Execution

    Validation

    Exportation

    • Filtersapplied

    CSV


    • Edition of filters(create new filters, delete or update existing ones)

    • Use of dictionaries: name-converter dictionary (i.e.: Pepe Jose)

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • Similarity computation algorithm called Record Linkage. Parameters:

      • Size for the sliding window: number of registers each one will be compared to.

      • Sorting columns: ordenation by columns.

      • Threshold of similarity acceptance

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • Order by Surname 1

    • Sliding window = 2

    CSV

    Configurations

    Execution

    • Window = 2

    Validation

    Exportation

    CSV


    • Similarities detected

    CSV

    Configurations

    Execution

    • Window = 2

    Validation

    • Similarities

    Exportation

    • Similarity degree

    CSV


    • Similarities detected

    CSV

    • window = 2

    Configurations

    • Similarities

    Execution

    Validation

    Exportation

    • Similaritydegree

    CSV


    • List of detected similarities

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • List of detected similarities with percentage bigger than threshold 50%

    CSV

    Configurations

    > 50%

    Execution

    Validation

    Exportation

    CSV


    • Validation of results (including only those above the threshold)

    • Visualize by similarity/by group

    • Massive validation

    • Share validation between several supervisors

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    CSV

    • Select output format

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Index

    • Introduction

    • Process

      • Deduplication

      • Fusion

  • Successful stories

  • Architecture

  • Demo


  • FusionConfigurations

    • Input data file format: CSV

    • Select relevant columns to link registers

    • Relation between columns from different data sources (only when merging)

    • Assign types to columns to help using the most adequate automatic filters

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • Comparative type: exact value, estimation by text, numerical estimation

    • Percentage of the importance of each column for the similarity computation

    CSV

    Configurations

    Execution

    Validation

    Exportation

    • 100% =

    • 80%

    • 20%

    CSV


    • Specific percentage for registers with null valued columns

    • Use filters to make values standard

    • Available automatic and specific filters for values such as name, dates, address, etc…

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    CSV

    • Edit filters (create new filters, delete or update existing ones)

    • Use of dictionaries: name-converter dictionary (I.e.: BCN BARCELONA)

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • Similarity computation algorithm called Record Linkage. Parameters:

      • Size for the sliding window: number of registers each one will be compared to.

      • Sorting columns: ordenation by columns.

      • Threshold of similarity acceptance

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • Order by City

    • Sliding window = 2

    CSV

    Configurations

    Execution

    • Window = 2

    Validation

    Exportation

    CSV


    • Similarities detected

    CSV

    Configurations

    Execution

    • Window = 2

    Validation

    • Similarity

    Exportation

    • Similarity degree

    CSV


    • Similarities detected

    CSV

    Configurations

    • Similarities

    Execution

    Validation

    • Window = 2

    Exportation

    • Similarity degree

    CSV


    • List of detected similarities

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • List of detected similarities with percentage bigger than threshold 50%

    CSV

    Configurations

    > 50%

    Execution

    Validation

    Exportation

    CSV


    • Validation of results (including only those above the threshold)

    • Visualize by similarity/by group

    • Massive validation

    • Share validation between several supervisors

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    • Output format

      • Select values for every similarity

    CSV

    Configurations

    Execution

    Validation

    Exportation

    CSV


    Index

    • Introduction

    • Process

    • Successful stories

    • Architecture

    • Demo


    Who? Health Service

    Objective Detect repeated health id cards

    Solution Detect repeated registers in the database and delete them

    Deduplicaction with DAURUM

    Result Health id cards database cleaned of repetitions


    Who? Beer manufacturer

    Objective Detect dealers that deliver to not previously assigned centers

    Solution Identify duplicates in each dealer’s delivery database and delete them

    Deduplication with DAURUM

    Detect deliveries to centers shared between different dealers

    Fusion with DAURUM

    Result Master database clean of repetitions and detection of dealers with wrong deliveries

    • Succesful storiesBeer manufacturer


    Index

    • Introduction

    • Process

    • Successful stories

    • Architecture

    • Demo


    • Struts 2: Model-View-Controller

    • Hibernate: Database manipulation


    Index

    • Introduction

    • Process

    • Successful stories

    • Architecture

    • Demo



    SPARSITY-TECHNOLOGIES

    Jordi Girona, 1-3, Edifici K2M 08034 Barcelona

    [email protected]

    http://www.sparsity-technologies.com

    DAMA-UPC. DATA MANAGEMENT (UPC)Departamentd'Arquitectura de ComputadorsEdifici C6-S103. Campus Nord.   Jordi Girona, 1-3.  08034 - Barcelona  www.dama.upc.edu


    ad