Data profiling the path to more accurate data
Download
1 / 45

Data Profiling - PowerPoint PPT Presentation


  • 822 Views
  • Updated On :

Data Profiling The Path to More Accurate Data. Jack Olson CTO, Evoke Software [email protected] What is Data Profiling. What is Data Profiling. The use of analytical techniques on data for the purpose of developing a thorough knowledge of its content, structure and quality.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Profiling' - Jimmy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data profiling the path to more accurate data l.jpg

Data Profiling The Path to More Accurate Data

Jack Olson

CTO, Evoke Software

[email protected]


What is data profiling l.jpg

What is Data Profiling

What is Data Profiling

The use of analytical techniques on data for the purpose of developing a thorough knowledge of its

content, structure and quality.

A process of developing information about data

instead of information from data.


Slide3 l.jpg

What is Data Profiling

Information About Data: (Data Profiling)

30% of entries in SUPPLIER_ID are blank

the range of values in UNIT_PRICE is 5.99 to 4599.99

there are 14 ORDER_HEADER rows with no ORDER_DETAIL rows

Information FROM Data: (not Data Profiling)

Texas auto buyers buy more Cadillacs per capita than any other state

The average mortgage amount increased last year by 6%

10% of last year's customers did not buy anything this year


Slide4 l.jpg

Examples of Data Content Issues

  • understanding explicit column properties

    • datatype

    • length

    • range

    • permitted values

    • null

    • unique

    • patterns

  • empty columns

  • inconsistency in value representation

  • representation of missing values

  • change points in granularity

  • unused data values

  • over-used data values


Slide5 l.jpg

Examples of Data Structure Issues

  • understanding table structure

    • primary key

    • derived columns

    • foreign keys

      • inclusive rule

      • one-one, one-many rule

  • denormalized column sets within table

  • overloaded columns (columns having more than one data fact recorded in them)

  • duplicate data columns

  • synonym column pairs

  • homonym column pairs


Slide6 l.jpg

Examples of Data Quality Issues

  • inaccurate data values

  • inconsistent representation of same value

  • missing values

  • violation of structure rules

    • non-unique primary keys

    • primary key/foreign key pair orphans

    • synonym discrepancies

  • violation of simple data rules

    • date ordering

  • violation of business data rules


Slide7 l.jpg

Data Profiling Model

metadata

accurate &

inaccurate

accurate

metadata

data

profiling

facts about

inaccurate

data

data

accurate &

inaccurate


Slide8 l.jpg

Data Profiling Challenge

> Metadata does not exist

> Metadata is incomplete

> Metadata is inaccurate

> Metadata is difficult to gather

The quality of metadata

is generally much worse

than the quality of the data

> Data deviates from expectations

> Data has inaccuracies

> Data inconsistent with metadata

The quality of the data

is generally unknown


Slide9 l.jpg

The Need For Data Profiling

  • Ad hoc methods of gathering the information fail

    • too much effort required

    • lack of adequate tools causes failure to expose important information

    • lack of organized methodology leads to missing important information

    • not enough time to complete

  • Complete and accurate data profiles are critical to success of many projects

    • data quality assessment

    • data migration

    • data integration

    • data warehousing

    • crm


Slide10 l.jpg

Data Profile Uses

  • data quality assessments

  • justifications for data quality remedies

  • designing data movement processes

    • target system design

    • transformations needed

    • data cleansing required

    • data purging/fixing required

  • designing data migration processes

  • designing data integration processes

  • writing new programs to access data


Slide11 l.jpg

Elements of a Data Profiling Environment

Metadata

no matter how bad

Data

must include real data

A Methodology

a rigorous, repeatable process

A Repository

a place to store information as it develops

Discovery Processes

to find rules you don't know

Validation Processes

to find violations of rules you do know

data analyst

to execute the process

business analysts

to interpret the findings and make decisions



Slide13 l.jpg

Data Profiling Steps

Gather metadata

Identify and Prepare Data for Profiling

Value Analysis

Structure Analysis

Single Object Data Rule Analysis

Multiple Object Data Rule Analysis


Slide14 l.jpg

Gather Metadata

  • Copy Information From Metadata Repositories

    • enterprise repositories

    • data dictionaries

  • Extract Data from Data Definitions

    • COBOL Copybooks

    • PL/1 Data

    • RDBMS Catalogs

  • Extract Rules from Program Logic

  • Extract Rules from Screen Definitions

  • Extract Rules from User Procedures

  • Consult with Data Management Staff

    • DBA

    • Data Architect

    • Business Analysts


Slide15 l.jpg

Identify and Prepare Data

  • Need Data in Tabular Form (First Normal or better)

    • must resolve REDEFINE structures

    • must resolve OCCURS structures

    • must resolve overloaded attributes

      • bit level assignments

      • combination values

      • keywords

      • multiple tokens

  • Need access to data

    • direct

    • indirect through offload to staging database or flat files

    • honor security requirements

  • May need to sample

    • extremely large data sources need to be broken down

    • get only more recent data


Slide16 l.jpg

Value Analysis

  • Attribute Properties

    • Business meaning

    • Datatype

    • Length/precision

    • List of valid values

    • Range of valid values

    • Null/empty rule

    • Unique rule

    • Consecutive rule

    • Formatting rules

    • Encoding patterns

documented properties

discovered properties

discovery

analysis

data

accurate properties

validation

invalid values


Slide17 l.jpg

Value Analysis - Discovery

  • Discovery software can compute most properties for columns

    • some is easy - min/max lengths

    • some is complex - datatype, patterns

  • Cannot identify local conventions for NULL

  • Cannot tell you the Business Meaning

  • Computes "what is" not "what is possible" or "what is valid"


Slide18 l.jpg

Value Analysis - Analysis

  • Compare documented properties with discovered properties

  • Investigate discrepancies

    • caused by inaccurate data

    • caused by inaccurate metadata

  • Determine correct property list

  • Document Accurate Property List

collaborative process

data analyst

business analyst

subject matter expert

dba

data architect


Slide19 l.jpg

Value Analysis - Property Extensions

  • Find Other Facts About Content

    • empty columns (no values)

    • low usage columns (high percentage of blanks or zeros)

    • inconsistent representation of values

    • unused data value

    • methods used to indicate "no value"

      • null, blank, ?, don't know, did not provide

    • indications of change points

      • excessive number of blanks

      • excessive frequencies on some values


Slide20 l.jpg

Value Analysis - Visual Inspection

  • Visually Inspect data values

    • value/frequency list

    • boundary points (high/low values), (short/long values)

    • values with special characters

    • random walk through data

  • Determine if consistent with business meaning

  • Determine if quality errors exist that cannot be extracted from property list

    • names

      • DONALD DUCK, CHICAOG, BLAKC

    • Frequencies

      • 50% color BLUE


Slide21 l.jpg

Value Analysis -Documentation

  • You must document everything you find in the repository

    • documented properties

    • discovered properties

    • accurate properties

    • characteristics found within valid values

    • data quality issues

      • bad practices

      • values found in violation of properties

      • values found in error but within property definitions

  • This information is needed for project execution

  • This information will be helpful to next project that uses this data source


Slide22 l.jpg

Structure Analysis

  • Structural Properties

    • Functional dependencies

    • Primary keys

    • Denormalization

    • Derived values

    • Synonyms

    • Foreign keys

    • Duplicate Data

    • Third-normal form model

documented structure

discovered structure

discovery

analysis

data

accurate structure

validation

invalid structure


Slide23 l.jpg

Structure Analysis - Discovery

  • Discovery software is very complex

  • Discovery software can find many things for you

    • functional dependencies

    • primary keys

    • derived columns

    • denormalization sets

  • Discovery software can aid in identifying others

    • foreign keys

    • synonyms

    • duplicate data

  • Data inaccuracies at value level can reduce discovery software's effectiveness


Slide24 l.jpg

Structure Analysis - Functional Dependency Analysis

ORDER_NUMBER determines CUSTOMER-ID

ORDER_NUMBER, LINE_NUMBER determines PRICE

PART_NUMBER determines DESCRIPTION

QTY,UNIT_PRICE determines AMOUNT

CUSTOMER_CATEGORY determines DISCOUNT_AMT

  • Functional dependencies exist when the value of a column is constant for each value of one or more other columns

  • Functional dependencies identify

    • primary keys

    • denormalized sets and their sub-key

    • derived columns

    • coincidences


Slide25 l.jpg

Structure Analysis - Primary Key Analysis

  • Primary keys are functional dependencies that cover all columns in a table

  • Usually obvious

  • May have multiple candidates

    • encoded key (customer_id)

    • natural key (customer name, address)

  • Testing natural keys may uncover data quality problems

    • two employees with same social security number

    • same part under two part-numbers

  • Single column primary keys are easy to find

    • cardinality equals number of rows

  • Multiple column primary keys are hard to find without functional dependency discovery


Slide26 l.jpg

Structure Analysis - Denormalization

  • Many operational systems are denormalized

    • done for performance reasons

    • two types

      • inner denormalization

        • PART_NUMBER > DESCRIPTION, UNIT_OF_MEASURE

      • outer denormalization

        • ORDER_NO, CUST_NAME, LINE_NO > QTY, PRICE

      • outers often encountered on unloaded data

  • Almost never documented

  • Functional Dependency Discovery will identify them

    • unless data inaccuracies hide them

    • unless the sub-key is not included


Slide27 l.jpg

Structure Analysis - Derived Columns

  • Relationship exists between two or more columns

    • computation

      • PRICE = QTY * UNIT_PRICE

    • business rule

      • CUSTOMER_TYPE > DISCOUNT

  • Functional dependency discovery will find them

  • Usually not documented in metadata

    • documentation may exist in program logic

    • documentation may exist in application procedures


Slide28 l.jpg

Structure Analysis - Synonyms

  • Synonyms are column pairs that connect data in one table to data in another table

    • may be same table

      • employee-id - manager-id

    • may be more than one column to each side of pair

      • building,room - building,meeting_room

  • Discovery software can find candidates

    • cannot find exclusive relationships (split tables)

    • cannot find pairs with different data representation

  • Synonyms may reflect

    • Foreign key pairs

    • Duplicate data

    • Shared Domain sets

    • Coincidences


Slide29 l.jpg

Structure Analysis- Primary-Key/Foreign-Key Analysis

  • Foreign keys are columns that connect data in one row to the primary key of another row of the same or a different table. Match is done on column value.

  • Often not documented or enforced

  • Can identify some by names

    • employee-id, emp-id

  • Can identify some through value overlap analysis

    • a discovery technique that finds pairs of columns across tables that have a high degree of value overlap

  • Need to establish rules for each pair

    • which is primary/ which is foreign

    • inclusive rule (yes or no)

    • one-to-one or one-to-many


Slide30 l.jpg

Structure Analysis - Duplicate Data

  • Duplicate data is a synonym pair that

    • does not define a primary/foreign key pair

    • each side is dependent on a keys that are primary/foreign key pairs

      • INVENTORY.DESCRIPTION - ORDER_DETAIL.DESCRIPTION

      • INVENTORY.PARTID - ORDER-DETAIL.PART-ID

  • Generally used for performance improvements

  • May be false indicator of duplication

    • CUSTOMER.DISCOUNT - ORDER.DISCOUNT

    • CUSTOMER.CUSTOMERID - ORDER.CUSTOMER-ID


Slide31 l.jpg

Structure Analysis - Third-Normal-Form Model

  • Software can generate TNF from functional dependencies, primary key, and synonym information

  • Generation of TNF model will help validate conclusions on structural issues


Slide32 l.jpg

Single Object Data Rule Analysis

  • Single Object Data Rules

    • Physical Rules

    • Business Rules

documented rules

analysis

validation

An Object is data that documents a single instance of a business object.

examples:

order

personnel record

inventory record

data

accurate rules

validation

invalid objects


Slide33 l.jpg

Single Object Data Rule - Analysis

  • Rules involve more than 1 column

  • Discovery is not very useful

    • ordering of date columns

    • patterns of blanks, nulls

  • Validate twice

    • first to determine if rule holds to aid in establishing the rule

    • second to find data quality violations

  • Violations do not identify the specific offending value(s)

  • Violations do not necessarily mean an incorrect value

    • may be a process violation

    • may be an authorized exception


Slide34 l.jpg

Single Object Data Rule - Physical Rules

  • Rules that are inherent in the data definitions

    • DELIVERY_DATE > ORDER_DATE

    • HIRE_DATE > BIRTHDATE

    • SHIPPED_AMT >= ORDER_AMT

    • If RETURN_FLAG = NO THEN RETURN_DATE = NULL

    • IF MARITAL_STATUS = 'S' THEN SPOUSE NAME = BLANK

  • Analyst can develop them with common sense examination of business meaning of columns

  • Rules and results need to be reviewed by business analysts


Slide35 l.jpg

Single Object Data Rule - Business Rules

  • Rules that require business knowledge

    • If PAYTYPE='PARTTIME' THEN SALARY_TYPE='HOURLY'

    • If CUSTOMER_TYPE='RETAIL' THEN DISCOUNT=0

    • If PRODUCT_TYPE='ENGINE' THEN QTY < 100

  • Business analysts must develop

  • Analysis should determine hardness of the rule

    • must always be true

    • violations may exist and be correct


Slide36 l.jpg

Single Object Data Rule - Final Validation

  • All data rules should be documented in the repository

  • All results of validation tests should be documented in the repository

  • Rules should be classified

    • violation level (none, small, moderate, large)

    • importance to business if not true (no impact, moderate, large)

    • monitoring status (currently monitored, not monitored)

  • Can analyze further to find inaccurate data values within violations

    • some are obvious from inspection

    • clustering analysis may find source of errors


Slide37 l.jpg

Multiple Object Data Rule Analysis

documented rules

  • Multiple Object Data Rules

    • Same object rules

    • Different object rules

    • Computed values

analysis

validation

data

accurate rules

validation

invalid sets of objects


Slide38 l.jpg

Multiple Object Data Rule Analysis

  • Discovery is not possible

  • Validate twice

    • first to determine if rule holds to aid in establishing the rule

    • second to find data quality violations

  • Violations do not identify the specific offending value(s)

    • set within which offending values exist is much larger than in the case of single object data rules

  • Violations do not necessarily mean an incorrect value

    • may be a process violation

    • may be an authorized exception


Slide39 l.jpg

Multiple Object Data Rule Analysis - Same Object Rules

  • Rules that cross multiple occurrences of the same data object type

    • CHECKOUT_DATES CANNOT OVERLAP FOR SAME ASSET_ID

    • IF TWO PRODUCTS SHARE SAME BIN-NUMBER THEN ONE MUST BE A SUBSTITUTE PART_NO FOR THE OTHER

    • IF A PRODUCT IS PROVIDED BY MORE THAN ONE SUPPLIER-ID THEN ONE MUST BE IDENTIFIED AS THE PRIMARY SUPPLIER


Slide40 l.jpg

Multiple Object Data Rule Analysis - Different Object Rules

  • Rules that deal with interaction between different object types

    • PRODUCT.SHIPPING_MODE = ORDER.SHIPPING_MODE

    • PRODUCT.TYPE_CODE MUST BE VALID FOR SUPPLIER.VALID_TYPES

    • EMPLOYEES MUST NOT BE CONVICTED FELONS


Slide41 l.jpg

Multiple Object Data Rule Analysis - Value Tests

  • Computations that may indicate problems in the data

    • value/frequency pairs

    • comparison of order numbers/$amounts by quarter

    • average sales per quarter

  • Not true rules in that they do not clearly define boundary between accurate and inaccurate

  • Looking for values that are clearly out of bounds (reasonable tests)

  • Good for finding incomplete sets of data

  • Good for finding bad practices


Slide42 l.jpg

END OF DATA PROFILING

  • Repository contains complete metadata descriptions

    • Column Properties

    • Structure Definitions

    • Data Rules

  • Repository contains data on content

  • Repository contains identification of inaccurate data

    • column property violations

    • structure rule violations

    • data rule violations


Slide43 l.jpg

Data Profiling Value

$

$

$

$

$

$

$

________________

Complete Projects

Faster and at

Lower Cost

________________

Better Quality

Systems

Delivered Earlier

________________

Better able to

Respond to New

Business

Opportunities

35% savings in

cost and time


Slide44 l.jpg

Data Profiling

project data analysts

business analysts

data quality assurance analysts

Data Profiling should be a foundation technology

for all their activities.


Slide45 l.jpg

Bibliography

Jack Olson, Data Profiling: The Accuracy Dimension,

Morgan Kaufmann, 2002

Jack Olson, "Data Profiling: The First Step in Creating

a Data Warehouse", Enterprise Systems Journal,

Vol 14 No 5, May 1999, pp34-36

Jack Olson, "Data Profiling, The Key to Success in

Integration Projects", EAI Journal, Vol 4 No 2,

February 2002, pp 22-26

Philip Russom, "Data Profiling FAQ", Intelligent Enterprise,

August 1, 2002


ad