1 / 10

Data Profiling and Quality Issues

Data Profiling and Quality Issues. Alka Vaidya NIBM. Data Profiling. “Begin at the beginning” regardless of the state of the information within the enterprise It is a fundamental step that should begin every data-driven initiative A proactive approach to understanding the data

maston
Download Presentation

Data Profiling and Quality Issues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Profiling and Quality Issues Alka Vaidya NIBM

  2. Data Profiling • “Begin at the beginning” regardless of the state of the information within the enterprise • It is a fundamental step that should begin every data-driven initiative • A proactive approach to understanding the data • It discovers the data present in your organization and the characteristics of the data • It also gives you insight into your business processes and refine them over time

  3. Data Profiling Defined • A process whereby one examines the data available in an existing database and collects statistics and information about that data. The purpose of these statistics may be to: • find out whether existing data can easily be used for other purposes • give metrics on data quality including whether the data conforms to company standards • assess the risk involved in integrating data for new applications • track data quality • assess whether metadata accurately describes the actual values in the source database • understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project cost overruns. • have an enterprise view of all data, for applications such as Data Warehousing

  4. Data Profiling Techniques • Structure Discovery • Does your data match corresponding metadata? • Does the data adhere to proper uniqueness and null value constraints? • Data Discovery • Are the data values complete, accurate and unambiguous? • Relationship Discovery • Does the data adhere to specified required key relationships across columns and tables? Are there inferred relationships across columns, tables?

  5. Structure Discovery • Validation with Metadata • If data and metadata do not conform to each other, it may have far-reaching implications. E.g. Consider a table with 10M records, a particular field is char(255) and the longest field is 200 characters in length, you are wasting approximately 550MB of disk space • Missing values in fields that should not have missing values can cause joins to fail

  6. Pattern Matching • Typically, it determines if the data values in a field are in the expected format. It can validate that the values in the field are consistent across the data source • It will also tell you if a field is all numeric, if a field has consistent lengths • E.g. Phone Numbers

  7. Basic Statistics • One can learn a lot by reviewing a basic statistics about the data, especially numeric data • Reviewing statistics such as min/max/s.d. etc. can give you insight into validity of data

  8. Content Discovery • This can help in validating rules and assessing data completeness • Contents discovery techniques include • Standardization • Frequency counts and outlier detection • Business Rule Validation

  9. Relationship Discovery • This technique highlights potential key relationships across tables • It helps you understand how data sources interact with other data sources • It point out key violations

More Related