Business Intelligence Fundamentals: Data Cleansing. Ola Ekdahl IT Mentors. Introduction to data cleansing SSIS tasks Data Profiling Fuzzy Lookup Fuzzy Grouping. Agenda. Introduction to Data Cleansing.
Introduction to data cleansing
Data cleansing or data scrubbing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data.
After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by different data dictionary definitions of similar entities in different stores, may have been caused by user entry errors, or may have been corrupted in transmission or storage.
Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data.
The actual process of data cleansing may involve removing typos or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records).
The Data Profiling task computes various profiles that help you become familiar with a data source and identify problems in the data that have to be fixed.
After using the task to compute data profiles and save them in a file, you can use the stand-alone Data Profile Viewer to review the profile output.Data Profiling
Fuzzy Grouping enables you to identify groups of records in a table where each record in the group potentially corresponds to the same real-world entity.
The grouping is resilient to commonly observed errors in real data, because records in each group may not be identical to each other but are very similar to each other.Fuzzy Grouping
The task takes an input row and tries to find the best match or matches in the reference table as efficiently as possible.
By default, this is done by using the ETI to find candidate reference records that share tokens or q-grams in common with the input.
The best candidates are retrieved from the reference table and a more careful comparison is made between the two records.
Once there are no more candidates that could be better than any match found so far, Fuzzy Lookup stops and moves on to the next input row.Fuzzy Lookup – Runtime
Fuzzy Grouping uses Fuzzy Lookup under the covers to perform the grouping.
Fuzzy Grouping passes its tokenization string intact to Fuzzy Lookup.
At run-time, Fuzzy Grouping uses to Fuzzy Lookup to build a temporary ETI against the input data and uses it to determine which input rows are close to each other.
Depending on the number of results it gets back and the resulting similarities between records, it generates groups.Fuzzy Grouping – Details
Use the more lightweight DTExec.exe rather than the full SSIS Designer to execute packages in production.
Drop unused columns in your pipeline because they require memory.
For recurring Fuzzy Lookup tasks in which the reference table is considerably larger than the typical input table, you should consider pre-computing the index.
By default, Fuzzy Lookup will load the ETI and reference table into available memory before starting to process rows. If you only have a few rows to process in a particular run, you can reduce this time by setting the WarmCaches property to False.Setup Considerations
The number of rows and columns has the greatest impact on performance. The more data you have, the more resources Fuzzy Lookup and Fuzzy Grouping require. The figures in the following sections show specific data for various scenarios.
The average number of tokens per string column on which a fuzzy match is performed also has an impact on performance. Fuzzy transforms are not meant for document retrieval. For longer fields (greater than 20 tokens), it might be more efficient to use the SQL Server full-text indexing features.Measurements
Fuzzy Lookup and Grouping
More recordings available at: