1 / 21

Data Quality

Data Quality. David Loshin. Course Structure. Overview of Data Quality Data Ownership and Data Roles Cost Analysis of Poor Data Qaulity Dimensions of Data Quality Data models, Data values, Presentation Data Extraction and Transformation ETL, Data transformation. Course Structure (2).

micheal
Download Presentation

Data Quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Quality David Loshin

  2. Course Structure • Overview of Data Quality • Data Ownership and Data Roles • Cost Analysis of Poor Data Qaulity • Dimensions of Data Quality • Data models, Data values, Presentation • Data Extraction and Transformation • ETL, Data transformation

  3. Course Structure (2) • Data Quality Improvement • Metadata and Enterprise Reference Data • Domains and Mappings • Data Quality Rules • Definition of Rules • Discovery of Rules

  4. Course Structure (3) • Using Data Quality Rules • Message Transformation and Routing • Data warehouse validation • GUI Generation • Data Warehouse Population

  5. Course Structure (4) • Data Cleansing • Data Parsing • Standardization • Linkage • Duplicate Elimination • Approximate Searching • Scalability Issues

  6. Project • Build a data quality tool • rule definition • data parsing • data element standardization • record linkage • Apply the tool in characterizing real-world data (I’ll supply some, don’t worry ;-)

  7. Some Examples • Frequent Flyer Miles and Long-Distance Service • Corporate Credit Card • Direct Marketing Event • CD Club Scam

  8. What is Data? • Working definitions: • Data: arbitrary values (with their own representation) • Information: data within a context • Knowledge: Understanding of information within its context • Metadata: data about data

  9. Who Owns Data? • Important question, because the answers indicate where responsibility for data quality lies • Data quality can be difficult to effect because of complicating notions • Data Processing as an “information Factory” • Actors in the information factory and their roles

  10. Supplier Acquirer Creator Processor Packager Delivery Agent Consumer Middle Manager Senior Manager Decision-maker Actors and Their Roles

  11. Definition of data Authorization and Security User support Data packaging and delivery Maintenance Data quality Management of business rules Management of metadata Standards management Supplier management Ownership Responsibilities

  12. Creator Consumer Compiler Enterprise Funder Decoder Packager Reader Subject Purchaser Everyone Owernship Paradigms

  13. Complicating Notions • Ownerhsip is affected by the value of data • Privacy • Turf • Fear • Bureaucracy

  14. The Data Ownership Policy • Order of enforcement • Identify stakeholders • Identify data sets • Allocation of ownership • Ownership roles and responsibilities • Dispute Resolution

  15. The Data Ownership Policy (2) • Maintain a metadata database for data ownership • Parties table • Data set table • Roles and responsibilities • Policies (i.e., dispute resolution, communication, etc.)

  16. CIO CKO Trustee Policy Manager Registrar Steward Custodian Data Administrator Security Administrator Information Flow Information Processing Application development Data Provider Data Consumer Ownership Roles

  17. The Information Factory • Information processing can be broken down into a graph • Each node in the graph is a data producer, data consumer, or both • The edges represent communcation paths

  18. What is Data Quality? • “Fitness for Use” • Different rules for different data sets • Includes, but is more than: • Data cleansing • Standardization • Deduplification • Merge-purge

  19. Lather, Rinse, Repeat • Data quality is a process: • Assess the current state of the quality of data • Determine the area that needs most improvement • Determine success criteria • Implement the improvement • Measure against success threshold • If success: goto 2

  20. No one wants to admit mistakes Denial of responsibility Lack of understanding “Dirty work” Lack of recognition Data Quality is Hard to Do

  21. Steps to Data Quality • Training • Data ownership policy • Economic model of data quality • Current state assessment and requirements analysis • Project selection and implementation

More Related