Data Profiling The Path to More Accurate Data

Data Profiling The Path to More Accurate Data Jack Olson CTO, Evoke Software jolson@evokesoft.com

What is Data Profiling What is Data Profiling The use of analytical techniques on data for the purpose of developing a thorough knowledge of its content, structure and quality. A process of developing information about data instead of information from data.

What is Data Profiling Information About Data: (Data Profiling) 30% of entries in SUPPLIER_ID are blank the range of values in UNIT_PRICE is 5.99 to 4599.99 there are 14 ORDER_HEADER rows with no ORDER_DETAIL rows Information FROM Data: (not Data Profiling) Texas auto buyers buy more Cadillacs per capita than any other state The average mortgage amount increased last year by 6% 10% of last year's customers did not buy anything this year

Examples of Data Content Issues • understanding explicit column properties • datatype • length • range • permitted values • null • unique • patterns • empty columns • inconsistency in value representation • representation of missing values • change points in granularity • unused data values • over-used data values

Examples of Data Structure Issues • understanding table structure • primary key • derived columns • foreign keys • inclusive rule • one-one, one-many rule • denormalized column sets within table • overloaded columns (columns having more than one data fact recorded in them) • duplicate data columns • synonym column pairs • homonym column pairs

Examples of Data Quality Issues • inaccurate data values • inconsistent representation of same value • missing values • violation of structure rules • non-unique primary keys • primary key/foreign key pair orphans • synonym discrepancies • violation of simple data rules • date ordering • violation of business data rules

Data Profiling Model metadata accurate & inaccurate accurate metadata data profiling facts about inaccurate data data accurate & inaccurate

Data Profiling Challenge > Metadata does not exist > Metadata is incomplete > Metadata is inaccurate > Metadata is difficult to gather The quality of metadata is generally much worse than the quality of the data > Data deviates from expectations > Data has inaccuracies > Data inconsistent with metadata The quality of the data is generally unknown

The Need For Data Profiling • Ad hoc methods of gathering the information fail • too much effort required • lack of adequate tools causes failure to expose important information • lack of organized methodology leads to missing important information • not enough time to complete • Complete and accurate data profiles are critical to success of many projects • data quality assessment • data migration • data integration • data warehousing • crm

Data Profile Uses • data quality assessments • justifications for data quality remedies • designing data movement processes • target system design • transformations needed • data cleansing required • data purging/fixing required • designing data migration processes • designing data integration processes • writing new programs to access data

Elements of a Data Profiling Environment Metadata no matter how bad Data must include real data A Methodology a rigorous, repeatable process A Repository a place to store information as it develops Discovery Processes to find rules you don't know Validation Processes to find violations of rules you do know data analyst to execute the process business analysts to interpret the findings and make decisions

THE DATA PROFILING PROCESS

Data Profiling Steps Gather metadata Identify and Prepare Data for Profiling Value Analysis Structure Analysis Single Object Data Rule Analysis Multiple Object Data Rule Analysis

Gather Metadata • Copy Information From Metadata Repositories • enterprise repositories • data dictionaries • Extract Data from Data Definitions • COBOL Copybooks • PL/1 Data • RDBMS Catalogs • Extract Rules from Program Logic • Extract Rules from Screen Definitions • Extract Rules from User Procedures • Consult with Data Management Staff • DBA • Data Architect • Business Analysts

Identify and Prepare Data • Need Data in Tabular Form (First Normal or better) • must resolve REDEFINE structures • must resolve OCCURS structures • must resolve overloaded attributes • bit level assignments • combination values • keywords • multiple tokens • Need access to data • direct • indirect through offload to staging database or flat files • honor security requirements • May need to sample • extremely large data sources need to be broken down • get only more recent data

Value Analysis • Attribute Properties • Business meaning • Datatype • Length/precision • List of valid values • Range of valid values • Null/empty rule • Unique rule • Consecutive rule • Formatting rules • Encoding patterns documented properties discovered properties discovery analysis data accurate properties validation invalid values

Value Analysis - Discovery • Discovery software can compute most properties for columns • some is easy - min/max lengths • some is complex - datatype, patterns • Cannot identify local conventions for NULL • Cannot tell you the Business Meaning • Computes "what is" not "what is possible" or "what is valid"

Value Analysis - Analysis • Compare documented properties with discovered properties • Investigate discrepancies • caused by inaccurate data • caused by inaccurate metadata • Determine correct property list • Document Accurate Property List collaborative process data analyst business analyst subject matter expert dba data architect

Value Analysis - Property Extensions • Find Other Facts About Content • empty columns (no values) • low usage columns (high percentage of blanks or zeros) • inconsistent representation of values • unused data value • methods used to indicate "no value" • null, blank, ?, don't know, did not provide • indications of change points • excessive number of blanks • excessive frequencies on some values

Value Analysis - Visual Inspection • Visually Inspect data values • value/frequency list • boundary points (high/low values), (short/long values) • values with special characters • random walk through data • Determine if consistent with business meaning • Determine if quality errors exist that cannot be extracted from property list • names • DONALD DUCK, CHICAOG, BLAKC • Frequencies • 50% color BLUE

Value Analysis -Documentation • You must document everything you find in the repository • documented properties • discovered properties • accurate properties • characteristics found within valid values • data quality issues • bad practices • values found in violation of properties • values found in error but within property definitions • This information is needed for project execution • This information will be helpful to next project that uses this data source

Structure Analysis • Structural Properties • Functional dependencies • Primary keys • Denormalization • Derived values • Synonyms • Foreign keys • Duplicate Data • Third-normal form model documented structure discovered structure discovery analysis data accurate structure validation invalid structure

Structure Analysis - Discovery • Discovery software is very complex • Discovery software can find many things for you • functional dependencies • primary keys • derived columns • denormalization sets • Discovery software can aid in identifying others • foreign keys • synonyms • duplicate data • Data inaccuracies at value level can reduce discovery software's effectiveness

Structure Analysis - Functional Dependency Analysis ORDER_NUMBER determines CUSTOMER-ID ORDER_NUMBER, LINE_NUMBER determines PRICE PART_NUMBER determines DESCRIPTION QTY,UNIT_PRICE determines AMOUNT CUSTOMER_CATEGORY determines DISCOUNT_AMT • Functional dependencies exist when the value of a column is constant for each value of one or more other columns • Functional dependencies identify • primary keys • denormalized sets and their sub-key • derived columns • coincidences

Structure Analysis - Primary Key Analysis • Primary keys are functional dependencies that cover all columns in a table • Usually obvious • May have multiple candidates • encoded key (customer_id) • natural key (customer name, address) • Testing natural keys may uncover data quality problems • two employees with same social security number • same part under two part-numbers • Single column primary keys are easy to find • cardinality equals number of rows • Multiple column primary keys are hard to find without functional dependency discovery

Structure Analysis - Denormalization • Many operational systems are denormalized • done for performance reasons • two types • inner denormalization • PART_NUMBER > DESCRIPTION, UNIT_OF_MEASURE • outer denormalization • ORDER_NO, CUST_NAME, LINE_NO > QTY, PRICE • outers often encountered on unloaded data • Almost never documented • Functional Dependency Discovery will identify them • unless data inaccuracies hide them • unless the sub-key is not included

Structure Analysis - Derived Columns • Relationship exists between two or more columns • computation • PRICE = QTY * UNIT_PRICE • business rule • CUSTOMER_TYPE > DISCOUNT • Functional dependency discovery will find them • Usually not documented in metadata • documentation may exist in program logic • documentation may exist in application procedures

Structure Analysis - Synonyms • Synonyms are column pairs that connect data in one table to data in another table • may be same table • employee-id - manager-id • may be more than one column to each side of pair • building,room - building,meeting_room • Discovery software can find candidates • cannot find exclusive relationships (split tables) • cannot find pairs with different data representation • Synonyms may reflect • Foreign key pairs • Duplicate data • Shared Domain sets • Coincidences

Structure Analysis- Primary-Key/Foreign-Key Analysis • Foreign keys are columns that connect data in one row to the primary key of another row of the same or a different table. Match is done on column value. • Often not documented or enforced • Can identify some by names • employee-id, emp-id • Can identify some through value overlap analysis • a discovery technique that finds pairs of columns across tables that have a high degree of value overlap • Need to establish rules for each pair • which is primary/ which is foreign • inclusive rule (yes or no) • one-to-one or one-to-many

Structure Analysis - Duplicate Data • Duplicate data is a synonym pair that • does not define a primary/foreign key pair • each side is dependent on a keys that are primary/foreign key pairs • INVENTORY.DESCRIPTION - ORDER_DETAIL.DESCRIPTION • INVENTORY.PARTID - ORDER-DETAIL.PART-ID • Generally used for performance improvements • May be false indicator of duplication • CUSTOMER.DISCOUNT - ORDER.DISCOUNT • CUSTOMER.CUSTOMERID - ORDER.CUSTOMER-ID

Structure Analysis - Third-Normal-Form Model • Software can generate TNF from functional dependencies, primary key, and synonym information • Generation of TNF model will help validate conclusions on structural issues

Single Object Data Rule Analysis • Single Object Data Rules • Physical Rules • Business Rules documented rules analysis validation An Object is data that documents a single instance of a business object. examples: order personnel record inventory record data accurate rules validation invalid objects

Single Object Data Rule - Analysis • Rules involve more than 1 column • Discovery is not very useful • ordering of date columns • patterns of blanks, nulls • Validate twice • first to determine if rule holds to aid in establishing the rule • second to find data quality violations • Violations do not identify the specific offending value(s) • Violations do not necessarily mean an incorrect value • may be a process violation • may be an authorized exception

Single Object Data Rule - Physical Rules • Rules that are inherent in the data definitions • DELIVERY_DATE > ORDER_DATE • HIRE_DATE > BIRTHDATE • SHIPPED_AMT >= ORDER_AMT • If RETURN_FLAG = NO THEN RETURN_DATE = NULL • IF MARITAL_STATUS = 'S' THEN SPOUSE NAME = BLANK • Analyst can develop them with common sense examination of business meaning of columns • Rules and results need to be reviewed by business analysts

Single Object Data Rule - Business Rules • Rules that require business knowledge • If PAYTYPE='PARTTIME' THEN SALARY_TYPE='HOURLY' • If CUSTOMER_TYPE='RETAIL' THEN DISCOUNT=0 • If PRODUCT_TYPE='ENGINE' THEN QTY < 100 • Business analysts must develop • Analysis should determine hardness of the rule • must always be true • violations may exist and be correct

Single Object Data Rule - Final Validation • All data rules should be documented in the repository • All results of validation tests should be documented in the repository • Rules should be classified • violation level (none, small, moderate, large) • importance to business if not true (no impact, moderate, large) • monitoring status (currently monitored, not monitored) • Can analyze further to find inaccurate data values within violations • some are obvious from inspection • clustering analysis may find source of errors

Multiple Object Data Rule Analysis documented rules • Multiple Object Data Rules • Same object rules • Different object rules • Computed values analysis validation data accurate rules validation invalid sets of objects

Multiple Object Data Rule Analysis • Discovery is not possible • Validate twice • first to determine if rule holds to aid in establishing the rule • second to find data quality violations • Violations do not identify the specific offending value(s) • set within which offending values exist is much larger than in the case of single object data rules • Violations do not necessarily mean an incorrect value • may be a process violation • may be an authorized exception

Multiple Object Data Rule Analysis - Same Object Rules • Rules that cross multiple occurrences of the same data object type • CHECKOUT_DATES CANNOT OVERLAP FOR SAME ASSET_ID • IF TWO PRODUCTS SHARE SAME BIN-NUMBER THEN ONE MUST BE A SUBSTITUTE PART_NO FOR THE OTHER • IF A PRODUCT IS PROVIDED BY MORE THAN ONE SUPPLIER-ID THEN ONE MUST BE IDENTIFIED AS THE PRIMARY SUPPLIER

Multiple Object Data Rule Analysis - Different Object Rules • Rules that deal with interaction between different object types • PRODUCT.SHIPPING_MODE = ORDER.SHIPPING_MODE • PRODUCT.TYPE_CODE MUST BE VALID FOR SUPPLIER.VALID_TYPES • EMPLOYEES MUST NOT BE CONVICTED FELONS

Multiple Object Data Rule Analysis - Value Tests • Computations that may indicate problems in the data • value/frequency pairs • comparison of order numbers/$amounts by quarter • average sales per quarter • Not true rules in that they do not clearly define boundary between accurate and inaccurate • Looking for values that are clearly out of bounds (reasonable tests) • Good for finding incomplete sets of data • Good for finding bad practices

END OF DATA PROFILING • Repository contains complete metadata descriptions • Column Properties • Structure Definitions • Data Rules • Repository contains data on content • Repository contains identification of inaccurate data • column property violations • structure rule violations • data rule violations

Data Profiling Value $ $ $ $ $ $ $ ________________ Complete Projects Faster and at Lower Cost ________________ Better Quality Systems Delivered Earlier ________________ Better able to Respond to New Business Opportunities 35% savings in cost and time

Data Profiling project data analysts business analysts data quality assurance analysts Data Profiling should be a foundation technology for all their activities.

Bibliography Jack Olson, Data Profiling: The Accuracy Dimension, Morgan Kaufmann, 2002 Jack Olson, "Data Profiling: The First Step in Creating a Data Warehouse", Enterprise Systems Journal, Vol 14 No 5, May 1999, pp34-36 Jack Olson, "Data Profiling, The Key to Success in Integration Projects", EAI Journal, Vol 4 No 2, February 2002, pp 22-26 Philip Russom, "Data Profiling FAQ", Intelligent Enterprise, August 1, 2002

Data Profiling The Path to More Accurate Data

Data Profiling The Path to More Accurate Data

Presentation Transcript

Recursive Data Structure Profiling

Designing Accurate Data Entry Procedures

Data, Data, and more Data

Your Path To Data-Driven Quality

More Data! More Headaches !

Data, Data, and more Data

Data Profiling

Standard Administration = Accurate + Reliable Data

Green and Greener: The path to (more) sustainable data centers

Enhancing Data Path

Data Path

Collecting and Reporting Accurate Data

MIPS Data Path

The Requirement to Report Accurate Data

Data Accurate Measurements

Data, Data, and More Data

Variational Path Profiling

Accurate Data Entry Services

Ensuring Accurate Data for CCRPI

Taking the correct data path!

Data Profiling and Quality Issues

Dependable & Accurate Data Conversion Services