data cleanup unlock the potential at a corporate scale
Skip this Video
Download Presentation
Data Cleanup: Unlock the potential at a corporate scale

Loading in 2 Seconds...

play fullscreen
1 / 27

Data Cleanup: Unlock the potential at a corporate scale - PowerPoint PPT Presentation

  • Uploaded on

Data Cleanup: Unlock the potential at a corporate scale. Thibault Dambrine. IT professional for 25 years Network Designer ETL Data Warehouse Analyst Interface Specialist ERP Developer Data Quality Experience:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Data Cleanup: Unlock the potential at a corporate scale' - henrik

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
thibault dambrine
  • IT professional for 25 years
    • Network Designer
    • ETL Data Warehouse Analyst
    • Interface Specialist
    • ERP Developer
  • Data Quality Experience:
    • Tasked to work on the pre-conversion data cleanup project during the Shell JDE to SAP transition
  • Premise:

“What would a Company-wide Data Quality Initiative look like?”

  • Base:
    • Experience setting up a data cleanup team, prior to a JDE to SAP data conversion
    • Realizing the potential for increasing the data value within the corporation
defining data quality
Defining Data Quality
  • Intrinsic Data Quality:
    • Accuracy, Completeness, Uniqueness
    • Reliability, Security and Accessibility
    • Contextual Data Quality:
    • Timeliness, Relevance
    • Inter-operability, consistency of identifiers
  • Accessibility and Representational Data Quality
    • Ease of understanding
    • Consistency of identifiers
    • Consistency in structures
quantifying the cost of quality the 1 10 100 rule additional cost less competitive business


Quantifying the cost of Quality: The 1-10-100 Rule - Additional cost = Less Competitive Business


Prevention Cost


Correction Cost

Failure Cost

The 1-10-100 Quality Cost Rule

the 1 10 100 rule a data quality example
The 1-10-100 Rule A Data Quality Example:

The 1-10-100 Quality Rule Applied to mailing Data:

It costs:

  • $1.00 to verify data at data entry time
  • $10.00 to clean the data after the fact
  • $100.00 to mop up errors caused by bad data
    • Packages mailed in the wrong address
    • Lost revenue
    • Lost customers
    • Bad (sloppy) reputation
    • Additional carrying cost for bad data
data quality the up side
Data Quality: The Up Side!


  • Consistent data inquiry results build confidence in information systems
  • Tractability across Business and IT domains
  • Consistent data identifiers
    • promotes internal cross-department reporting
    • Consistency
    • Confidence in results


  • Removing redundant or near-redundant data
  • Maximizes re-use of data
  • Reduces the amount of data being processed
  • Reduces errors


Consistently good quality data is data you can count on!

a data cleanup initiative
A Data Cleanup Initiative
  • Where to start?
  • Who will enforce such quality initiatives?
  • How will the data quality be maintained on on-going basis?
dqb task 1 identify sponsor and data quality boss
DQB Task 1: Identify Sponsor and Data Quality Boss

To Identify sponsor:

  • Communicate clear understanding of the cost of bad data
  • Use the 1-10-100 rule
  • Initiative has to be backed with
    • Money
    • Authority
    • Responsibility
identify data quality boss dqb
Identify Data Quality Boss – DQB
  • Must be knowledgeable on data quality
  • Must be knowledgeable on the Business
  • Will be responsible for data quality
  • Will have authority to make changes


Responsibility without authority will not work

dqb task 2 identify data sets
DQB Task 2: Identify Data Sets
  • Identify/inventory high-level data sets e.g.
    • CMDB
    • ERP
      • Master Data e.g.
        • Customer Master
        • Item Master
      • Transaction Data e.g.
        • PO’s & Invoices
        • Inventory movements
  • Assign data sets to departments, potential lists of Data owners
  • Note: The final data owners may not be the one initially penciled in at this stage
dqb task 3 identify business side data owners
DQB Task 3: Identify Business Side Data Owners
  • Data Owners will effectively be the local, more granular, Data Quality Bosses. Again, they will need to
    • Be responsible for the data at their level
    • Have authority to request changes at their level
    • Have bottom up knowledge of the data, understand what “should be there”
  • Setup meetings with every department, in line with the Data Sets identified, with aim of coming up with a set of Data Owners
    • Have a presentation ready
    • Look for individuals who have been in the Business for a long time, who are well respected, who understand the data, the dependencies, and know who to talk to, to get answers, from the bottom up
dqb task 4 request from data owners the data quality specification or dqs
DQB Task 4: Request from Data Owners the “Data Quality Specification” or DQS
  • DQS is a document that spells out the data quality rules e.g.
    • No duplicates or near-duplicates
    • Data older than x years should be purged or archived
    • Data Dependencies such as no detail without a header or no invoice without a PO
    • No duplicates
    • Consistency e.g. data format
    • Quality audit e.g. postal code matches address
    • More…
  • Note: Some rules will apply in all DQS Documents
  • There is value in sharing, reviewing and updating the DQS over time.
  • Data quality issues are not always apparent until a first cut of data is cleaned up
dqs part of task 4 also look for
DQS (part of Task 4 ) Also look for:
  • Data Islands
    • Lack of consistent identifiers inhibit a single view of the big picture
  • Data Opportunity
    • Could correlated data sets be more useful to the Business?
  • Data Surprises
    • Misplaced Data
    • Information buried in free-form fields

DQB Task 5: Build IT Data Quality Team

  • Data Quality
    • Cannot be a “side job” or a part-time task
    • Must be staffed with individuals who understand data. Best candidates
      • Proficient in SQL, data extract techniques
      • Understand ETL tools and techniques
      • Are detailed-oriented
      • Experience: Data Warehouse staff is good fishing grounds for such individuals
mid presentation recap all the ingredients are now in place the real work can start
Mid-Presentation Recap: All the Ingredients are now in placeThe real work can start!
  • Name Data Quality Sponsor & Data Quality Boss (DQB)
  • Identify Data Sets
  • Data Quality Owners
  • Data Quality Specifications (the DQ Roadmaps)
  • IT Data Cleansing Team
introducing the data quality cycle
Introducing: the Data Quality Cycle
  • We now have
    • a sponsor
    • Identified data sets and data owners
      • They have produced Data Quality Specifications
    • An IT Team ready to work on the first Data Quality measurements, based on the DQS
  • Next step: Initiate the cleanup
    • Not a single iteration but one that will be repeated in a cyclic fashion
data quality cycle corporate version
Data Quality Cycle - Corporate Version

Analyze Data

Improve Data

Continuous Improvement

Monitor Progress

Formalize Schedule

Make Progress VISIBLE

step 1 identify bad data bad data definition does not adhere to dqs
Step 1: Identify Bad DataBad Data Definition: Does not adhere to DQS
  • Coordinate meetings to translate DQS documents into a suite of repeatable data cleansing procedures
  • Very important that these procedures should be repeatable, schedulable on regular basis
  • Initial Focus: Identify Bad Data
    • Bad data(does not adhere to DQS),
    • Inconsistent data
    • Old Data
    • Note: DQS will spell out rules for “old” and “inconsistent”
  • Ensure results are reported in a format readable by Management at executive level. This initiative has to be VISIBLE
step 2 data cleansing
Step 2: Data Cleansing
  • Data Cleansing can be done in two ways:
    • Automated, IT based cleanup
    • Business-based, manual cleanup
  • Once the bad data is identified, determine who must do what
    • Business-based,
      • manual cleanup appropriate for more subtle tasks, e.g. to determine which of two duplicates identified should be kept. These tasks may require additional research, phone calls etc.
    • Automated,
      • IT based cleanup good for simpler tasks e.g. making telephone number formats consistent
      • Can be also sub-contracted to specialized data quality companies
it based data cleansing and outsource considerations
IT-based Data Cleansing and Outsource Considerations
  • Data cleansing may take valuable time from the Business, which is not available – Data Cleanup effort may suffer as a result
  • Not all data cleansing is a simple SQL
  • Not all data is most confidential

When considering data cleansing tasks, look at all possible options

  • Outsourcing some data cleansing tasks may be more economical than doing it all in-house
step 3 measure progress
Step 3: Measure Progress
  • All programs, procedures written with the aim of identifying data quality issues should
    • Be stored, like any other programming assets
    • Be repeatable and be schedulable
    • Provide aggregate measures to describe the data cleanup status e.g.
      • X duplicates
      • Y old records
      • Z invoices without PO
    • Progress
      • Has to be measured in a published dashboard
      • Has to be visible by the entire organization to provide a sense of value
step 4 data hygiene schedule the cleanup review tasks ensure results are visible
Step 4: Data Hygiene: * Schedule the Cleanup/Review Tasks * Ensure results are visible
  • Bad data is created EVERY DAY
  • Data quality is an on-going effort
  • Establish, publish a schedule, part of the dashboard
  • Ensure there is visibility and accountability to ensure the levels of bad data
    • are going down with time
    • Or are kept at a minimal level
step 5 sharpen the saw
Step 5: Sharpen the Saw

Once the data cleanup cycle is established

  • Review Results
  • Review DQS documents periodically (setup schedule)
  • Get Business input
    • Improve process
    • Give input on improvements to be made
  • Ask the Business to come up with performance improvement measures born from the Data Quality initiative

Two sets of Five activities best define the Data Quality

The Foundation Setup

  • Identify Data Quality Sponsor & Data Quality Boss (DQB)
  • Identify Data Sets
  • Identify Business Side Data Owners
  • Define Data Quality Specifications (DQS) - the DQ Roadmaps
  • Appoint IT Data Cleansing Team

The Data Quality Cycle – Ongoing

  • Identify Bad Data
  • Initiate Data Cleansing
  • Measure Progress
  • Initiate Data Hygiene, Data Cleanup Cycle
    • Schedule Cleanup/Reviews
    • Ensure progress visible
  • Sharpen the Saw
  • How to improve Data Quality
  • Predefined data quality rule definitions

  • Creating Effective Business Rules: Interview with Graham Witt
  • Gartner Magic Quadrant on Data Quality Tools – “Demand for data quality tools remains strong”