using metadata to drive data quality hunting the data dust bunnies l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies PowerPoint Presentation
Download Presentation
Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies

Loading in 2 Seconds...

play fullscreen
1 / 96

Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies - PowerPoint PPT Presentation


  • 234 Views
  • Uploaded on

Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies. John Murphy Apex Solutions, Inc. NoCOUG 11-13-2003. Presentation Outline. The Cost - It’s always funny when it’s someone else… Quality - Quality Principles and The Knowledge Worker Data Quality Data Development

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies' - Lucy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
using metadata to drive data quality hunting the data dust bunnies

Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies

John Murphy

Apex Solutions, Inc.

NoCOUG

11-13-2003

presentation outline
Presentation Outline
  • The Cost - It’s always funny when it’s someone else…
  • Quality - Quality Principles and The Knowledge Worker
  • Data Quality
  • Data Development
  • Metadata Management
  • Profile and Baseline Statistics
  • Vendor Tools
  • Wrap-up
  • Some light reading
the cost
The Cost

1. The Cost…

the cost4
The Cost…

“Quality is free. What’s expensive is finding out how to do it it right the first time.” Philip Crosby

  • A major credit service was ordered to pay $25M for release of hundreds of customers names of one bank to another bank because a confidentiality indicator was not set.
  • Stock value of major health care insurer dropped 40% because analysts reported the insurer was unable to determine which “members” were active paying policy holders. Stock value dropped $3.7 Billion in 72 hours.
  • The sale/merger of a major cable provider was delayed 9 months while the target company determined how many households it had under contract. Three separate processes calculated three values with a 80% discrepancy.
the cost5
The Cost…
  • The US Attorney General determined that 14% of all health care dollars or approximately $23 billion were the result of fraud or inaccurate billing.
  • DMA estimates that greater than 20% of customer information housed by it’s members database is inaccurate or unusable.
  • A major Telco has over 350 CUSTERMER tables with CUSTOMER data repeated as many as 40 times with 22 separate systems capable of generating or modifying customer data.
  • A major communications company could not invoice 2.6% of it’s customers because the addresses provided were non-deliverable. Total Cost > $85M annually.
  • A State Government annually sends out 300,000 motor vehicle registration notices with up to 20% undeliverable addresses.
  • A B2B office supply company calculated that it saves an average of 70 cents per line item through web sales based on data entry validation at the source of order entry.
the cost6
The Cost

TDWI - 2002

the regulatory challenges
The Regulatory Challenges
  • No more “Corporate” data
    • New Privacy Regulations
      • Direct Marketing Access
      • Telemarketing Access
      • Opt – In / Opt - Out
    • New Customer Managed Data Regulations
      • HIPAA
    • New Security Regulations
      • Insurance Black List Validation
      • Bank Transfer Validation
    • Business Management
      • Certification of Financial Statements
      • Sarbanes – Oxley

These have teeth…

data quality process
Data Quality Process
  • There is a formal process to quantitatively evaluate the quality of corporate data assets.
  • The process outlined here is based on Larry English’s Total data Quality Management (TdQM)
    • Audit the current data resources
    • Assess the Quality
    • Measure Non-quality Costs
    • Reengineer and Cleanse data assets
    • Update Data Quality Process
data quality process12
Data Quality Process
  • Develop a Data Quality Process to Quantitatively evaluate the Quality of Corporate Data Assets.
  • Establish the Metadata Repository
  • Implement Data Development and Standardization Process
  • Profile and Baseline your Data
  • Use the Metadata to Improve your Data Quality
  • Revise The Data Quality Process
customer satisfaction profile
Customer Satisfaction Profile
  • Determine who are the most consistent users of specific Data entities. Select a sample set of attributes to be reviewed.
  • Publish the metadata report for the selected attributes
  • Select representatives in from the various business areas and knowledge workers to review the selected attributes and metadata.
  • Distribute the questionnaires, retrieve and score the results.
  • Report on and distribute the results.
quality assessment results
Quality Assessment Results

Problem Metadata

Acceptability Threshold

quality assessment results17
Quality Assessment Results

Acceptability Threshold

Improving…

quality assessment results18
Quality Assessment Results

Acceptability Threshold

Got it Right!

data standardization process
Data Standardization Process

Data Development and Approval Process

Integrated Data Model

(Data Elements)

MDR

Data

Requirements

Resolved

Issues

Issues

Proposal

Package

Technical

Review

Functional

Review

Issue

Resolution

Data Architect

Data

Administrator

Data Architect

Stewards

Architect

data standardization process21
Data Standardization Process
  • Proposal Package – Data Model, Descriptive Information, Organization Information, Integration Information, Tool Specific Information
  • Technical Review – Model Compliance, Metadata Complete and accurate
  • Functional Review – Integration with Enterprise Model
  • Issue Resolution – Maintenance and Management
  • Total Process < 30 days
  • All based on an integrated web accessible application. Results integrated to the Enterprise Metadata Repository.
data standardization
Data Standardization
  • Getting to a single view of the truth
  • Getting to a corporate owned process of data and information management.

Addition of new business needs

Describe Existing Data Assets

data development process
Data Development Process
  • There is a formal process for development, certification, modification and retirement of data.
    • Data Requirements lead directly to Physical Data Structures.
    • Data Products lead directly to Information Products.
  • The Data Standards Evaluation Guide
    • Enterprise level not subject area specific
      • I can use “customer” throughout the organization
      • I can use “Quarterly Earnings” throughout the organization
    • All the data objects have a common minimal set of attributes dependent upon their type.
      • All data elements have a name, business name, data type, length, size or precision, collection of domain values etc.
    • There are clear unambiguous examples of the data use
    • The data standards are followed by all development and management teams.
    • The same data standards are used to evaluate internally derived data as well as vendor acquired data.
data standardization process24
Data Standardization Process
  • Standardization is the basis of modeling – Why Model?
    • Find out what you are doing so you can do it better
    • Discover data
    • Identify sharing partners for processes and data
    • Build framework for database that supports business
    • Establish data stewards
    • Identify and eliminate redundant processes and data
  • Check out ICAM / IDEF…
an example process model

Statutes, Regulations & Policies

Funding

Acquisition Guidance

Requirement

Package

Solicitation Announcement

Industry Resource Data

Purchase

Conduct Procurement

Communication from Contractor

Notification to Vendor

Proposed Programs & Procurement Issues

Purchase Performance Analysis

  • A0

Company Support Team

Purchase Officer

An example Process Model
the data model
The Data Model…
  • Data Model- A description of the organization of data in a manner that reflects the information structure of an enterprise
  • Logical Data Model - User perspective of enterprise information. Independent of target database or database management system
  • Entity – Person, Place, Thing or Concept
  • Attribute – Detail descriptive information associated with an Entity
  • Relation – The applied business rule to one or more entities
  • Element - A named identifier of each of the entities and their

attributes that are to be represented in a database.

rules to entities and attributes
Rules to Entities and Attributes
  • There is more than one state.
  • Each state may contain multiple cities.
  • Each city is always associated with a state.
  • Each city has a population.
  • Each city may maintain multiple roads.
  • Each road has a repair status.
  • Each state has a motto.
  • Each state adopts a state bird
  • Each state bird has a color.

STATE

STATE Code

CITY Name

CITY POPULATION Quantity

CITY ROAD Name

CITY ROAD REPAIR Status

STATE MOTTO Text

STATE BIRD Name

STATE BIRD COLOR Name

resulting populated tables 3nf

CITY ROAD

City Road

Repair

Status

City

Name

State

Code

City Road Name

VA Alexandria Route 1 2

VA Alexandria Franconia 1

MD Annapolis Franklin 1

MD Baltimore Broadway 3

AZ Tucson Houghton 2

AZ Tucson Broadway 2

IL Springfield Main 3

MA Springfield Concord 1

Resulting Populated Tables (3NF)

STATE CITY

STATE

State

Motto

State Bird

Name

State

Code

City

Name

City

Pop.

State

Code

Cardinal

Oriole

Cactus Wren

Cardinal

Chickadee

VA Alexandria 200K

MD Annapolis 50K

MD Baltimore 1500K

AZ Tucson 200K

IL Springfield 40K

MA Springfield 45K

VA “ “

MD “ “

AZ “ “

IL “ “

MA “ “

STATE BIRD

State

Bird

State Bird

Color

Cardinal Red

Oriole Black

Cactus Wren Brown

Chickadee Brown

example entity attributes relationships

STATE Code (FK)

CITY Name

CITY POPULATION Quantity

Example Entity, Attributes, Relationships

Becomes Road Kill on/ Kills

State Model

STATE

STATE CITY

STATE Code

STATE MOTTO Text

STATE BIRD Name (FK)

Contains

Maintains

Adopted by/

Adopts

CITY ROAD

STATE BIRD

STATE Code (FK)

CITY Name (FK)

CITY ROAD Name

CITY ROAD REPAIR Status

STATE BIRD Name

STATE BIRD COLOR Name

data standardization31
Data Standardization
  • Data Element Standardization -The process of documenting, reviewing, and approving unique names, definitions, characteristics, and representations of data elements according to established procedures and conventions.

Prime

Word

Property

Modifier(s)

Generic

Element

Required

1

0 - n

Standard Data Element Structure

Class word

Modifier(s)

Class Word

0 - n

the generic element
The Generic Element

The Generic Element - The part of a data element that establishes a structure and limits the allowable set of values of a data element. Generic elements classify the domains of data elements. Generic elements may have specific or general domains.

Examples – Code, Amount, Weight, Identifier

Domains – The range of values associated with an element. General Domains can be infinite ranges as with an ID number or Fixed as with a State Code.

standardized data element
Standardized Data Element

EXAMPLE

Element Name:

Access Name:

Definition Text:

Person Eye Color Code

PR-EY-CLR-CD

The code that represents the natural pigmentation of a person’s iris

Authority Reference Text:

Steward Name:

U.S. Code title 10, chapter 55

USD (P&R)

Domain values:

BK . . . . . . . . . . . . . . Black

BL . . . . . . . . . . . . . . Blue

BR . . . . . . . . . . . . . . Brown

GR . . . . . . . . . . . . . . Green

GY . . . . . . . . . . . . . . Gray

HZ . . . . . . . . . . . . . . Hazel

VI . . . . . . . . . . . . . . Violet

standards
Standards
  • Name Standards
    • Comply with format
    • Single concept, clear, accurate and self explanatory
    • According to functional requirements not physical

considerations

    • Upper and lower case alphabetic characters,

hyphens (-) and spaces ( )

    • No abbreviations or acronyms, conjunctions, plurals, articles, verbs or class words used as modifiers or prime words
  • Definition Standards
    • What the data is, not HOW, WHERE or WHEN used or WHO uses
    • Add meaning to name
    • One interpretation, no multiple purpose phrases, unfamiliar technical program, abbreviations or acronyms
integration of the data through metadata
Integration of the Data Through Metadata

Data Integration by Subject area

Subject Area 2

Subject Area 1

Subject Area 3

data model integration
Data Model Integration
  • Brings together (joins) two or more approved Data Model views
  • Adds to the scope and usability of the Corporate Data Model (EDM)
  • Continues to support the activities of the department that the individual models were intended to support
  • Enables the sharing of information between the functional areas or components which the Data Models support
enterprise data model
Enterprise Data Model

Use of Enterprise Data Model

Component

Views

SECURITY-

LEVEL

ORGANIZATION-

SECURITY-LEVEL

ORGANIZATION

Component

Models

PERSON-SECURITY-

LEVEL

Functional

Models

Standard

Metadata & Schemas

System

Models

Functional

Views

metadata management
Metadata Management

5. Metadata Management

data in context
Data in Context!

Mr. End User Sets Context For His Data.

metadata
Metadata

Metadata is the data about data… Huh?

Metadata is the descriptive information used to set the context and limits around a specific piece of data.

  • The metadata lets data become discreet and understandable by all communities that come in contact with a data element.
  • Metadata is the intersection of certain facts about data that lets the data become unique.
  • It makes data unique, understood and unambiguous.
  • The accumulation of Metadata creates a piece of data. The more characteristics about the data you have the more unique and discreet the data can be.
relevant metadata
Relevant Metadata
  • Technical - Information on the physical warehouse and data.
  • Operational / Business - Rules on the data and content
  • Administrative - Security, Group identification etc.
  • The meta model is the standard content defining the attributes of any given data element in any one of these models. The content should address the needs of each community who comes in contact with the data element. The meta model components make the data element unique to each community and sub community.
acquiring the metadata
Acquiring the Metadata
  • Data Modeling Tools – API and Extract to Repository
  • Reverse Engineered RDBMS – Export Extract
  • ETL Tools – Data mapping, Source to Target Mapping
  • Scheduling Tools – Refresh Rates and Schedules
  • Business Intelligence Tools – Retrieval Use
  • Current Data Dictionary
technical metadata
Technical Metadata
  • Physical Descriptive Qualities
    • Standardized Name
    • Mnemonic
    • Data type
    • Length
    • Precision
    • Data definition
    • Unit of Measure
    • Associated Domain Values
    • Transformation Rules
    • Derivation Rule
    • Primary and Alternate Source
    • Entity Association
    • Security And Stability Control
administrative and operational metadata
Administrative and Operational Metadata
  • Relates the Business perspective to the end user and Manages Content
    • Retention period
    • Update frequency
    • Primary and Optional Sources
    • Steward for Element
    • Associated Process Model
    • Modification History
    • Associated Requirement Document
    • Business relations
    • Aggregation Rules
    • Subject area oriented to insure understanding by end user
the simple metamodel

Individual

Individual View

Relationship

Subject Area

Entity

Attribute

Entity Alias

Encoding /

Lookup Tables

Source System

Attribute Alias

Attribute Default

The Simple Metamodel
the common meta model
The Common Meta Model

Based on Tannenbaum

required data element technical metadata
Required Data Element Technical Metadata
  • Name
  • Mnemonic
  • Definition
  • Data value source list text
  • Decimal place count quantity
  • Authority reference text
  • Domain definition text
  • Domain value identifiers
  • Domain value definition text
  • High and low range identifiers
  • Maximum character count quantity
  • Proposed attribute functional data steward
  • Functional area identification code
  • Unit measure name
  • Data type name
  • Security classification code
  • Creation Date
use of the enterprise tools
Use of The Enterprise Tools

Enterprise Data Repository (EDR)

Enterprise

Data

Model (EDM)

Migration/New

Information systems

Enterprise

Data Dictionary

System (EDDS)

Database

Tables

Database

Columns

Database

Rows

Prime Words

Data Elements

Metadata

Entities

Attributes

Relationships

(business rules)

Database

Dictionary

Associations

and Table Joins

profile the data
Profile the Data

6. Profile and Baseline Data

audit data
Audit Data
  • Establish Table Statistics
    • Total Size in bytes including Indexes
    • When was it last refreshed
    • Is referential Integrity applied
  • Establish Row Statistics
    • How many rows
    • How many Columns / Table
  • Establish Column Statistics
    • How Many Unique Values
    • How many Null Values
    • How Many Values outside defined domain
    • If a Key value, how many duplicates
some simple statistics
Some Simple Statistics
  • In Oracle – Run Analyze against tables, partitions, indexes and clusters. Allows you to determine sample size as a specific % of the total size or the specific number of rows. Default is a 1064 row set.
  • Example – 5.7 million rows in Transaction Table
  • “analyze table transaction estimate statistics;”
    • Statistics are estimated using a 1064 sample
  • “analyze table transaction estimate statistics sample 20 percent”
    • Statistics are estimated using 1.14 million rows

Statistics are store in several views

getting statistics
Getting Statistics

Get the Statistics…

SQL > select table_name, num_rows

from user_tables where num_rows is not null

TABLE_NAME NUM_ROWS

---------------------------------------------------------

Transaction 5790230

Account 1290211

Product 308

Location 2187

Vendors 4203

Alternatively, you can use select count by table

SQL > select count(*) from transaction;

Count(*)

---------------------

5790230

getting statistics54
Getting Statistics

To determine Unique counts of a column in a table:

SQL> SELECT COUNT(DISTINCT [COLUMN]) FROM [TABLE];

To Determine the number of NULL values in a column in a table:

SQL> SELECT COUNT(DISTINCT [COLUMN] ) FROM [TABLE]

WHERE [TABLE]_NAME IS NULL;

To Determine if there are values outside a domain range

SQL> SELECT COUNT(DISTINCT [COLUMN])

FROM [TABLE]

WHERE [TABLE]_STATUS NOT IN (‘val1’,’val2’,’val3’);

getting usage statistics
Getting Usage Statistics
  • What tables are being used?
  • With audit on, audit data is loaded to DBA_AUDIT_OBJECT
    • Create a table with columns for object_name, owner and hits.
    • Insert the data from DBA_AUDIT_OBJECT to you new table.
    • Clear out the data in DBA_AUDIT_OBJECT
    • Write the following report:

col obj_name form a30

col owner form a20

col hits form 99,990

select obj_name, owner, hits from aud_summary;

OBJ_NAME OWNER HITS

--------------------------------------------------------------------

Region Finance 1,929

Transaction Sales 18,916,344

Account Sales 4,918,201

baseline statistics
Baseline Statistics
  • Based on the statistics collected
    • Use these as a baseline and save to meta data repository operational metadata
    • Compare with planned statistics generated with Knowledge Workers
    • Generate and publish reports covering the data
    • Use these as baseline statistics
  • Regenerate the statistics on a fixed period basis.
    • Compare and track with time
quality assessment
Quality Assessment
  • Establish Baseline KPI For Data Quality
  • Perform Statistics on Sample Sets
  • Compare results

Time to Develop and Implement

Corrective Measures and push process change upstream

total error tracking over time
Total Error Tracking over Time
  • Set Error reduction Schedule
  • Track Errors over time
  • Note when new systems or impact processes are added
performance assessment
Performance Assessment

Performance Statistics for Monthly DW Load

daily performance statistics
Daily Performance Statistics

Daily Performance Cycles

Normal Daily Load

-Daily On Demand Reporting

-Batch Reporting

EOM Performance Cycles

End Of Month Daily Performance Cycle

-EOM Loads Impacting Normal Daily Reporting

- Additional CPU / Swap / Cache/ DB overhead

monitoring your data
Monitoring Your Data
  • Source Statistics – Data Distribution, Row Counts etc.
  • Schedule Exceptions – Load Abends, Unavailable source or target
  • System Statistics – Configuration Stats, System Performance logs
  • Change Control – Model / Process change History
  • Test Criteria And Scenarios – Scripts, data statistics, test performance
  • Meta Model – Metadata (domain values, operational metadata etc.)
  • Load Statistics – Value Distribution, row counts, load rates
  • Test / Production Data Statistics – Data Distribution, Row counts, model revisions, refresh history
  • Query Performance – End user query performance and statistics
  • End User Access – Who’s accessing, when, what query / service requested, when to they access, what business are they associated with
  • Web Logs – Monitor External user access and performance
  • End User Feedback – Comments, complaints and whines.
monitoring your data62
Monitoring your Data

Typical Analytical

Environment

Monitor

Point

a reason for metadata
A reason for metadata

Source

Systems

ODS

Staging

Data Marts

Analytics

End User

Access

Info.

Distribution

Data

Warehouse

Metadata

metadata and monitoring
Metadata and Monitoring
  • The Metadata provides a objective criteria based evaluation of the data from a quality / integrity standpoint.
  • The Metadata provides standards for data use and quality assurance at all levels from the enterprise to the individual.
  • The Metadata ensures continuity in the data independently from the applications and users accessing it. Applications come and go but data is forever…
  • Metadata forces us to understand the data that we are using prior to its use.
  • Metadata promotes corporate development and retention of data assets.
the leaky pipe
The Leaky Pipe…
  • Gets Worse Everyday
  • Must Plug the Holes NOW
  • Easy ROI Justification

Existing

Processes

& Systems

Increased Processing Costs

Inability to relate customer Data

Poor Exception Management

Lost Confidence in Analytical Systems

Inability to React to Time to Market Pressures

Unclear Definitions of the Business

Decreased Profits

vendor tools
Vendor Tools

7. Vendor Tools and

Metadata

vendor metadata
Vendor Metadata
  • CASE Tools – ERWin, Designer 2000, Power Designer
    • Technical Metadata
  • RDBMS – Oracle, Informix, DB2
    • Technical Metadata
    • Operational Statistics – Row Counts, Domain Value Deviation, Utilization Rates, Security
  • ETL
    • Transformation Mappings
    • Exceptions Management
    • Recency
  • BI
    • Utilization
  • ERP
    • Source of Record
current metadata management
Current Metadata Management
  • Reflects Data After the fact
  • Most are only current state views, no history
  • No data development and Standardization Process
  • No standards for Definitions
wrap up
Wrap-up

8. Wrap-up

wrap up72
Wrap up
  • Use metadata as part of your data quality effort
    • Incomplete Metadata is a pay me now or pay me later proposition
  • Develop statistics around the data distribution, refresh strategy, access etc.
    • Know what your data looks like. Know when it changes.
  • Use your metadata to answer the who, what, when, where and why about your data.
    • Tie your Data Quality Management (DQM) to your Total Quality Management (TQM) to create a TdQM program.
  • Understand the data distribution in the production environment.
    • Understand the statistics about your data.
  • Publish statistics to common repository
    • Share your data quality standards and reports about the statistics.
summary implement
Summary - Implement
  • Implement Validation Routines at data collection points.
  • Implement ETL and Data Quality Tools to automate the continuous detection, cleansing, and monitoring of key files and data flows.
  • Implement Data Quality Checks. Implement data quality checks or audits at reception points or within ETL processes. Stringent checks should be done at source systems and a data integration hub.
  • Consolidate Data Collection Points to minimize divergent data entry practices.
  • Consolidate Shared Data. Use a data warehouse or ODS to physically consolidate data used by multiple applications.
  • Minimize System Interfaces by (1) backfilling a data warehouse behind multiple independent data marts, (2) merging multiple operational systems or data warehouses, (3) consolidating multiple non-integrated legacy systems by implementing packaged enterprise application software, and/or (4) implementing a data integration hub (see next).
summary implement74
Summary - Implement
  • Implement a Data Integration Hub which can minimize system interfaces and provide a single source of clean, integrated data for multiple applications. This hub uses a variety of middleware (e.g. message queues, object request brokers) and transformation processes (ETL, data quality audits) to prepare and distribute data for use by multiple applications.
  • Implement a Meta Data Repository. Create a repository for managing meta data gleaned from all enterprise systems. The repository should provide a single place for systems analysts and business users to look up definitions of data elements, reports, and business views; trace the lineage of data elements from source to targets; identify data owners and custodians; and examine data quality reports. In addition, enterprise applications, such as a data integration hub or ETL tools, can use this meta data to determine how to clean, transform, or process data in its workflow.
some light reading
Some Light Reading…
  • Metadata Solutions by Adrienne Tannenbaum
  • Improving Data Warehousing and Business Information Quality By Larry English
  • The DOD 8320 M Standard for data creation and management
  • Data Warehousing and The Zachman Framework by W.H. Inmon, John Zachman and John Geiger
  • Common Warehouse Metamodel (CWM) Specification
working with complete attributes
Working with complete attributes…

A vital piece of previously omitted metadata adversely impacts the outcome of the game…

slide77
John Murphy – 303-670-8401

jmurphy@apexsolutions.com

Suzanne Riddell 303-216-9491

suzanne@apexsolutions.com

touch points impact
Touch Points Impact

Add, Update

Retrieve

Repositories

Data Warehouse

Data Marts

Operational

Systems

Same Data

Multiple Locations

Multiple Touch Points

quality assessment content
Quality Assessment Content
  • Project Information
    • Identifier, Name, Manager, Start Date, End Date
  • Project Metrics
    • Reused Data Object Count
    • New Data Object Count
    • Objects Modified
metadata strategy
Metadata Strategy
  • Build Data Quality Process
    • Establish Data Quality Steering Committee
    • Establish Data Stewards
    • Establish Metadata Management Process
    • Establish Data Development and Certification Process
  • Audit existing metadata resources
    • Data models
    • ETL Applications
    • RDBMS Schemas
    • Collect and Certify existing metadata
  • Develop Meta Model
    • Determine key metadata sources and alternate sources
  • Develop Metadata Repository and Access Strategy
    • Implement Meta Model
    • Populate with available as is metadata
  • Define Gaps in the Metadata
using metadata for quality
Using Metadata For Quality
  • Develop The Data Quality Process
  • Implement Data Development and Standardization Process
  • Establish the Metadata Repository
  • Profile and Baseline your Data
  • Use the Metadata to Improve your Data Quality
  • Revise The Data Quality Process
statistical analysis
Statistical Analysis
  • Determine your Sample Size
    • Size needs to be statistically significant
    • If in doubt use a true random 1%
    • Repeat complete process several times to gain confidence and repeatability
    • Example:
      • N=((Confidence Level x Est. Standard Deviation) / Bound)^2
      • N=((2.575x.330)/.11)^2
      • N=60 Rows
    • Use as large a meaningful sample set as possible.
what causes data warehouses to fail
What Causes Data Warehouses to Fail
  • Failing to understand the purpose of data warehousing
  • Failing to understand who are the real “customers” of the data warehouse
  • Assuming the source data is “OK” because the operational systems seem to work just fine
  • Not developing enterprise-focused information architecture—even if only developing a departmental data mart.
  • Focusing on performance over information quality in data warehousing
  • Not solving the information quality problems at the source
  • Inappropriate “Ownership” of data correction/cleanup processes
  • Not developing effective audit and control processes for the data Extract, Correct, Transform and Load (ECTL) processes
  • Misuse of information quality software in the data warehousing processes
  • Failing to exploit this opportunity to “correct” some of the wrongs created by the previous 40 years of bad habits
metadata tool vendors
Metadata Tool Vendors
  • Data Advantage – www.dataadvantage.com
  • CA Platinum - www.ca.com
  • Arkidata – www.arkidata.com
  • Sagent- www.sagent.com
  • Dataflux – www.dataflux.com
  • DataMentors – www.datamentors.com
  • Vality – www.vality.com
  • Evoke – www.evokesoft.com
quality what it is and is not
Quality – What it is and is not
  • Data and Information Quality is the ability to consistently meet the customers expectations and to adapt to those expectations as they change.
  • Quality is a process not an end point.
  • Quality is understanding the impact of change and the ability to Pro-actively adapt.
  • Quality is building adaptable / survivable processes – The less I have to change and keep my Knowledge Workers satisfaction high the more successful I’ll be.
  • Data and Information Quality is not Data Cleansing or transformations. By then it’s too late.
  • Quality impacts the costs associated with Scrap and Rework – Just Like Manufacturing!
the quality leaders
The Quality Leaders
  • The Quality Leaders
    • W. Edward Demming – 14 Points of Quality moving from do it fast to do it right.
    • Philip Crosby -14 Step Quality Program – Determine what is to be delivered, then the timeline.
    • Malcolm Baldridge – Determination of excellence, commitment to change
    • Masaaki Kaizen – Continuous Process Improvement
  • Quality Frameworks
    • Six Sigma – A statistically repeatable approach
    • Lean Thinking – Simplify to eliminate waste
    • ISO 9000 – Quality measurement process
quality tools
Quality Tools
  • Six Sigma – A statistically repeatable approach
    • Define - Once a project has been selected by management, the team identifies the problem, defines the requirements and sets an improvement goal.
    • Measure - Used to validate the problem, refine the goal, then establish a baseline to track results.
    • Analyze – Identifies the potential root causes and validate a hypothesis for corrective action.
    • Improve – Develop solutions to root causes, test the solutions and measure the impact of the corrective action.
    • Control - Establish standard methods and correct problems as needed. In other words, the corrective action should become the new requirement but additional problems may occur that will have to be adjusted for.
quality principles the knowledge worker
Quality Principles – The Knowledge Worker
  • IT has a reason a reason to exist, it’s the Knowledge Worker
  • At Toyota the Knowledge Worker is the “Honored Guest”
  • It’s all for the knowledge worker. How well do you know them?
    • Who are your knowledge workers?
    • What data do they need?
    • When do they use your data?
    • Where do they access it from?
    • Why do they need it to do their job?
  • Do your KWs feel like Honored Guests or cows in the pasture?
  • Building a Profile of the Knowledge Workers
    • Classes of Knowledge Workers
      • Farmers, Explorers, Inventors
    • Determine the distribution of the Knowledge Workers
    • Determine their use profile
user groups by data retrieval needs
User Groups By Data Retrieval Needs

5% Inventors

80%

Grazers

15%

Explorers

Grazers – Push Reporting

Explorers – Push with Drill Down

Inventors - Any, All and then

Some

quality shared it and users
Quality Shared – IT and Users
  • Shared Ownership of the data
    • What data do I have?
    • How do I care for it?
    • What do I want to do with it?
    • Where do I / my process add value?
  • Start with a target
    • Build the car while your driving
    • Everyone owns the process
    • Everyone participates
    • Breakdown the barriers
the barriers to quality
The Barriers to Quality
  • Knowledge Workers Gripes about IT
    • IT can’t figure out how to get my data on time – I’ll do it in Access
    • IT has multiple calculations for the same values – I’ll correct them by hand
    • It takes IT forever to build that table for me. I’ll do it in Excel
  • IT Gripes about Knowledge Workers
    • KW won’t make the time to give us an answer
    • What KW’s said last month isn’t the same as this month
    • They are unrealistic in their expectations
    • We can’t decide that, it’s their decision
    • I don’t think they can understand a data model
quality tools93
Quality Tools
  • Lean Thinking – Simplify to Eliminate Waste
    • Value - Defining what the customer wants. Any characteristic of the product or service that doesn't align with the customers' perception of value is an opportunities to streamline.
    • Value Stream - The value stream is the vehicle for delivering value to the customer. It is the entire chain of processes that develop, produce and deliver the desired outcome. Lean Enterprise tries to streamline the process at every step of the way.
    • Flow - Sequencing the value stream (process flow) in such a manner as to eliminate any part of the process that doesn't add value.
    • Pull - This is the concept of producing only what is needed, when it's needed. This tries to avoid the stockpiling of products by producing or providing only what the customer wants, when they want it.
    • Perfection-The commitment to continually pursue the ideal means creating value while eliminating waste.
total quality data management
Total Quality data Management
  • TQdM as a data quality standard process from Larry English
  • Process 1- Assess the Data Definition Information Architecture Quality
    • In – Starting Point
    • Out – Technical Data Definition Quality
    • Out – Information Groups
    • Out – Information Architecture
    • Out – Customer Satisfaction
  • Process 2 - Assess the Information Quality
    • In – Technical Data Definition Quality Assessment
    • Out – Information Value and Cost Chain
    • Out – Information Quality Reports
  • Process 3 – Measure Non-quality Information Costs
    • In – Outputs from Process 2
    • Out – Information Value and Cost Analysis
total quality data management95
Total Quality data Management
  • Process 4 –Re-engineer Data and Data Clean-up
    • In – Outputs from Process 3
    • Out – Data Defect identification
    • Out – Cleansed Data to Data Warehouse and Marts
  • Process 5 – Improve Information Process Quality
    • In – Production data, Raw and Clean
    • Out – Identified opportunities for Quality Improvement
  • Process 6 – Establish Information Quality Environment
    • In – All quality issues from Process 1 to 5
    • Out – Management of Process 1 to 5

Collects much of the existing Meta Data in existence.

information quality improvement process
Information Quality Improvement Process

P4

Re-Engineer

And Cleanse

Data

P3

Measure

Non-Quality

Information

Costs

P1

Assess Data

Definition

And Information

Architecture

P2

Assess

Information

Quality

P5

Improve

Information

Process

Quality

P6

Establish Information Quality Environment