Data Quality
E N D
Presentation Transcript
Data Quality Class 2 David Loshin
Goals • Cost of low data quality • Mapping the information chain • Data Quality impacts • Economic measures • Impact domains • Building the Data Quality ROI Model
Goals 2 • Data Cleansing Project • Goal of the application • Components of the application
Cost of Low Data Quality • Data quality is measured using anecdotes • “Hazy” feeling of wrongness • Desire to gauge the true cost of poor data quality
5 Steps • Map the Information Chain • Categorize costs associated with low data quality • Identify and estimate actual effect • Determine cost of fixing problem • Calculate Return on Investment (ROI)
Evidence of Economic Impact • Frequent service interruptions and system failures • Drop in productivity vs. volume • High employee turnover • High new business/continued business ratio • Increased customer service requirements • Customer Attrition
The Information Chain • Data flow model • Processing stages • Communication/data transfer
Data Supply Data Acquisition Data Creation Data Processing Data Packaging Decision Making Decision Implementation Data Delivery Data Consumption The Information Chain 2
Information Chain 3 • Information chain = data flow graph • Processing stages are vertices in graph • Directed message-passing channels = directed edges • Examples
Impacts of Low Data Quality • Hard impacts: can be estimated and/or measured • Soft impacts: hard to measure, but definitely are evident
Hard Impacts • Customer attrition • Costs attributed to error detection • Costs attributed to error rework • Costs attributed to prevention of errors • Costs associated with customer service • Costs associated with fixing customer problems • Costs associated with enterprisewide data inconsistency • Costs attributable to delays in processing
Soft Impacts • Difficulty in decision making • Time delays in operation • Organizational mistrust • Lowered ability to effectively compete • Data ownership conflicts • Lowered employee satisfaction
Economic Measures • Cost Increase • Revenue Decrease • Cost Decrease • Revenue Increase • Delay • Speedup • Increase Satisfaction • Decrease Satisfaction
Impact Domains • Operational • Tactical/Strategic
Detection Correction Rollback Rework Prevention Warranty Reduction Attrition Blockading. Operational Impacts
Delays Preemption Idling Increased Difficulty Lost opportunities Organizational mistrust Alignment Acquisition overhead Decay Infrastructure Tactical/Strategic Impacts
Putting it Together • Map the information chain • Conduct interviews to locate data quality problems • Annotate information chain with location of data qualty problems • Identify impact domains for each problem • Characterize economic impact (=cost!) • Aggregate totals
ROI Model • Create a spreadsheet with assigned costs • Add in costs of improvements • Determine best return on investment
Data Cleansing Project • Write an application to cleanse data • Record Parsing • Metadata cleansing • Data standardization • Data correction • Data enhancement
Record Parsing • Data element types • first names • last names • honorifics • titles • street names • directions • business words • etc.
Data Domains • Data types • Subclassed data types = domains • Mappings between domains
Data Domains 2 • Data type = char(2) • 676 possible non-punctuation members • Data Domain: US State abbreviations • 62 possible members • Subclassed data domain: “New England” • {“ME”, “NH”, “VT”, “MA”, “CT”, “RI”}
Data Domains 3 • Enumerated domains • All values are explicit • Rule-based domains • Domain definition is generative
Record Parsing • Tokenizing data elements within an attribute • Assign meaning to tokens • Domain membership • Patterns • Context
Tokenizing • Straightforward • white-space separated • punctuation – important or not? • Result: stream of tokens
Domain Membership • Can each token be assigned to a domain? • Based strictly on token value • Based on patterns • Based on context
Domain Membership 2 • Domains can be maintained in memory using hash tables • Search for domain membership is the same as hash table lookups • What if a token belongs to more than one domain?
Patterns • Certain kinds of data attributes are organized around token patterns • Example: names can appear using these kinds of patterns: • (title) (first) (middle) (last) • (title) (first) (initial) (last) • (first) (middle) (last) • (last) (comma) {first) (middle) • etc.
Context • What happens when a token belongs to more than one domain? • We can use context to infer decision • Build weights based on frequency = training
Next Week • Dimensions of Data Quality • Project specification