Business intelligence fundamentals data cleansing
1 / 17

Business Intelligence Fundamentals: Data Cleansing - PowerPoint PPT Presentation

  • Uploaded on

Business Intelligence Fundamentals: Data Cleansing. Ola Ekdahl IT Mentors. Introduction to data cleansing SSIS tasks Data Profiling Fuzzy Lookup Fuzzy Grouping. Agenda. Introduction to Data Cleansing.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Business Intelligence Fundamentals: Data Cleansing' - tokala

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Business intelligence fundamentals data cleansing

Business Intelligence Fundamentals: Data Cleansing

Ola Ekdahl

IT Mentors

Data Cleansing

Introduction to data cleansing

SSIS tasks

  • Data Profiling

  • Fuzzy Lookup

  • Fuzzy Grouping


Introduction to data cleansing

Data Cleansing

Introduction to Data Cleansing

Data cleansing or data scrubbing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data.

After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by different data dictionary definitions of similar entities in different stores, may have been caused by user entry errors, or may have been corrupted in transmission or storage.

Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data.

The actual process of data cleansing may involve removing typos or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records).

Data profiling

Data Cleansing

The Data Profiling task computes various profiles that help you become familiar with a data source and identify problems in the data that have to be fixed.

After using the task to compute data profiles and save them in a file, you can use the stand-alone Data Profile Viewer to review the profile output.

Data Profiling

Fuzzy lookup

Data Cleansing

Fuzzy Lookup enables you to match input records with clean, standardized records in a reference table.

Fuzzy Lookup returns the closest match and indicates the quality of the match.

Fuzzy Lookup

Fuzzy grouping

Data Cleansing

Fuzzy Grouping enables you to identify groups of records in a table where each record in the group potentially corresponds to the same real-world entity.

The grouping is resilient to commonly observed errors in real data, because records in each group may not be identical to each other but are very similar to each other.

Fuzzy Grouping

Data profiling profile requests

Data Cleansing

Data Profiling – Profile Requests

Data profiling profile requests1

Data Cleansing

Data Profiling – Profile Requests

Fuzzy lookup eti and q grams

Data Cleansing

  • Error-Tolerant Index

    • Fuzzy Lookup uses the Error-Tolerant Index (ETI) to find matching rows in the reference table.

    • Each record in the reference table is broken up into words (also known as tokens), and the ETI keeps track of all the places in the reference table where a particular token occurs.

    • In addition, Fuzzy Lookup indexes substrings, known as q-grams, so that it can better match records that contain errors.

Fuzzy Lookup – ETI and Q-Grams

Fuzzy lookup runtime

Data Cleansing

The task takes an input row and tries to find the best match or matches in the reference table as efficiently as possible.

By default, this is done by using the ETI to find candidate reference records that share tokens or q-grams in common with the input.

The best candidates are retrieved from the reference table and a more careful comparison is made between the two records.

Once there are no more candidates that could be better than any match found so far, Fuzzy Lookup stops and moves on to the next input row.

Fuzzy Lookup – Runtime

Fuzzy grouping details

Data Cleansing

Fuzzy Grouping uses Fuzzy Lookup under the covers to perform the grouping.

Fuzzy Grouping passes its tokenization string intact to Fuzzy Lookup.

At run-time, Fuzzy Grouping uses to Fuzzy Lookup to build a temporary ETI against the input data and uses it to determine which input rows are close to each other.

Depending on the number of results it gets back and the resulting similarities between records, it generates groups.

Fuzzy Grouping – Details

Fuzzy grouping details1

Data Cleansing

Fuzzy Grouping – Details

Setup considerations

Data Cleansing

Use the more lightweight DTExec.exe rather than the full SSIS Designer to execute packages in production.

Drop unused columns in your pipeline because they require memory.

For recurring Fuzzy Lookup tasks in which the reference table is considerably larger than the typical input table, you should consider pre-computing the index.

By default, Fuzzy Lookup will load the ETI and reference table into available memory before starting to process rows. If you only have a few rows to process in a particular run, you can reduce this time by setting the WarmCaches property to False.

Setup Considerations


Data Cleansing

The number of rows and columns has the greatest impact on performance. The more data you have, the more resources Fuzzy Lookup and Fuzzy Grouping require. The figures in the following sections show specific data for various scenarios.

The average number of tokens per string column on which a fuzzy match is performed also has an impact on performance. Fuzzy transforms are not meant for document retrieval. For longer fields (greater than 20 tokens), it might be more efficient to use the SQL Server full-text indexing features.


Data Cleansing

Fuzzy Lookup and Grouping

Data Profiler

Thank You

Data Cleansing

More recordings available at: