1 / 68

Quality Assurance & Quality Control

Quality Assurance & Quality Control. Kristin Vanderbilt, Ph.D. Sevilleta LTER New Mexico, USA. What is data quality?. fitness of a dataset or record for a particular use. Chrisman, 1991. Two Types of Errors. Commission: Incorrect or inaccurate data are entered into a dataset

nona
Download Presentation

Quality Assurance & Quality Control

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quality Assurance & Quality Control Kristin Vanderbilt, Ph.D. Sevilleta LTER New Mexico, USA

  2. What is data quality? • fitness of a dataset or record for a particular use Chrisman, 1991

  3. Two Types of Errors • Commission: Incorrect or inaccurate data are entered into a dataset • Transcription Error • Malfunctioning instrumentation • Omission: Data or metadata are not recorded • Difficult or impossible to find • Inadequate documentation of data values, sampling methods, anomalies in field, human errors

  4. Loss of data quality can occur at many stages: • At the time of collection • During digitisation • During documentation • During archiving

  5. QA/QC • “mechanisms [that] are designed to prevent the introduction of errors into a data set, a process known as data contamination” Brunt 2000

  6. Scientist Database • Quality Assurance: • Graphics • Statistics • Documentation • Quality Control: • Datasheet Design • Data Entry Constraint checks • Documentation Cost of error correction increases

  7. QA/QC Activities • Defining and enforcing standards for formats, codes, measurement units and metadata. • Checking for unusual or unreasonable patterns in data. • Checking for comparability of values between data sets. • Documenting all QA/QC activities Brunt 2000

  8. Outline: • QC procedures • Designing data sheets • Data entry using validation rules, filters, lookup tables • QA procedures • Graphics and Statistics • Outlier detection • Archiving data

  9. The Cost of Errors • Error prevention is considered to be far superior to error detection, since detection is often costly and can never guarantee to be 100% successful (Dalcin 2004).

  10. Quality Control starts with people • Assign responsibility for the quality of data to those who create them. If this is not possible, assign responsibility as close to data creation as possible (Redman 2001) • Poor training lies at the root of many data quality problems.

  11. 1 1 1 2 2 2 3 3 3 Flowering Plant Phenology – Data Collection Form Design • Three sites, each with 3 transects • On each transect, every species will have its phenological class recorded Deep Well Five Points Goat Draw

  12. Data Collection Form Development: What’s wrong with this data sheet? PlantLife Stage _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________

  13. PHENOLOGY DATA SHEET Collectors:_________________________________ Date:___________________ Time:_________ Location:deep well, five points, goat draw Transect: 1 2 3 Notes:_________________________________________ PlantLife Stage ardi P/G V B FL FR M S D NP arpu P/G V B FL FR M S D NP atca P/G V B FL FR M S D NP bamu P/G V B FL FR M S D NP zigr P/G V B FL FR M S D NP P/G V B FL FR M S D NP P/G V B FL FR M S D NP P/G = perennating or germinating M = dispersing V = vegetating S = senescing B = budding D = dead FL = flowering NP = not present FR = fruiting

  14. ardi P/G V B FL FR M S D NP deob P/G V B FL FR M S D NP arpu P/G V B FL FR M S D NP asbr P/G V B FL FR M S D NP Y N Y N Y N Y N Y N Y N Y N Y N Y N YN Y N YN Y N Y N YN Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N YN Y N PHENOLOGY DATA ENTRY INTERFACE Collectors Mike Friggens Date: 16 May 1998 Time: 13:12 Location: Deep Well Transect: 1 Notes:Cloudy day, 3 gopher burrows on transect

  15. Validation Rules: • Control the values that a user can enter into a field • Example in Access: • <=100 • Between #1/1/70# and Date() • Coordinate data (latitude): • Degrees: >=0 and <= 90 • Minutes: >=0 and <=60

  16. >=0 and <= 100 Validation rules in MS Access: Enter in Table Design View

  17. Look-up Fields • Display a list of values from which entry can be selected

  18. Look-up Tables in MS Access: Enter in Table Design View

  19. Other methods for preventing data contamination • Double-keying of data by independent data entry technicians followed by computer verification for agreement • Use text-to-speech program to read data back • Filters for “illegal” data • Statistical/database/spreadsheet programs • Legal range of values • Sanity checks • Unit counts Edwards 2000

  20. Flow of Information in Filtering “Illegal” Data Raw Data File Illegal Data Filter Table of Possible Values and Ranges Report of Probable Errors Edwards 2000

  21. Tree Growth Data A filter written in SAS • Data Checkum; Set tree; • Message=repeat(“ ”,39); • If cover<0 or cover>100 then do; message=“cover is not in the interval [0,100]”; output; end; • If dbh_1998>dbh_1999 then do; message=“dbh_1998 is larger than dbh_1999”; output; end; • If message NE repeat(“ ”, 39); • Keep ID message; • Proc Print Data=Checkum; Error report Obs ID message 1 a dbh_1998 is larger than dbh_1999 2 b cover is not in the interval [0,100] 3 c dbh_1998 is larger than dbh_1999 4 d cover is not in the interval [0,100]

  22. Spreadsheet column statistics:Peromyscus truei example

  23. Spreadsheet range checks =if(mass>50,1,0)

  24. Unit counts: was everything measured? Quadrats 1 1 1 2 2 2 3 3 3 4 4 4 Web 1 Plot N Web 2 E W Web 3 S This study design is replicated at 3 sites. • At each site: • Ensure that there are 16 quads per web

  25. SQL: SELECT site, season, web, count(distinct plot, quad) FROM `palmtop`.`new_npp` where year = 2003 group by site ASC, season ASC, web ASC

  26. Other methods for preventing data contamination • Double-keying of data by independent data entry technicians followed by computer verification for agreement • Use text-to-speech program to read data back • Filters for “illegal” data • Statistical/database/spreadsheet programs • Legal range of values • Sanity checks • Properly designing database (taxon names, localities, persons are only entered once) Edwards 2000

  27. Good database design can improve data quality • Atomize data • Use consistent terminologies • “FLOWER COLOUR=RED”, and • “ FLOWER COLOUR=CRIMSON”. • Document changes • Record level – what changes have been made and by whom • Dataset level

  28. Atomize data: One cell, one piece of information

  29. Avoid Domain schizophrenia • Fields used for purposes for which they were not intended Dalcin 2000

  30. Document changes to data • Why? • Avoid duplication of error checking • Users can determine fitness of data for use • How? • Include data quality/accuracy fields in database design • Develop an audit trail, so that changes can be undone – record the who, when, how and why of record updates

  31. Metadata for ‘bad’ data Variable 9* Name: Average Wind Speed * Label: Avg_Windspeed * Definition: Average wind speed during the hour at 3 m * Units of Measure: meters/second * Precision of Measurements: .11 m/s * Range or List of Values: 0-50 * Data Type: Real * Column Format: #.### * Field Position: Columns 51-58 * Missing Data Code: -999 (bad) -888 (not measured) * Computational Method for Derived Data: na

  32. Flagging Data Values

  33. Outline: • QC procedures • Designing data sheets • Data entry using validation rules, filters, lookup tables • QA procedures • Graphics and Statistics: • Consistency checks • Unusual patterns • Outliers • Archiving data

  34. Consistency check to identify sensor errors: Comparison of data from three Meteorology stations, Sevilleta LTER

  35. Identification of Sensor Errors: Comparison of data from three Met stations, Sevilleta LTER

  36. Manual QC/QA Flagging Sheldon, GCE LTER

  37. Outliers • An outlier is “an unusually extreme value for a variable, given the statistical model in use” • The goal of QA is NOT to eliminate outliers! Rather, we wish to detect unusually extreme values. • Attempt to determine if contamination is responsible and, if so, flag the contaminated value. Edwards 2000

  38. Methods for Detecting Outliers • Graphics • Scatter plots • Box plots • Histograms • Normal probability plots • Formal statistical methods • Grubbs’ test Edwards 2000

  39. X-Y scatter plots of gopher tortoise morphometrics Michener 2000

  40. Box Plot Interpretation IQR = Q(75%) – Q(25%) Upper adjacent value = largest observation < (Q(75%) + (1.5 X IQR)) Lower adjacent value = smallest observation > (Q(25%) - (1.5 X IQR)) Extreme outlier: > 3 X IQR beyond upper or lower adjacent values Inter-quartile range Median

  41. Box Plots Depicting Statistical Distribution of Soil Temperature

  42. Normal density and Cumulative Distribution Functions 95% 99% Edwards 2000 If data are normally distributed, a data point falling more than 3 standard deviations away from the mean is an outlier

  43. Normal Plot of 30 Observations from a Normal Distribution Edwards 2000

  44. Normal Plots from Non-normally Distributed Data Edwards 2000

  45. Statistical tests for outliers assume that the data are normally distributed…. CHECK THIS ASSUMPTION!

  46. Grubbs’ test for outlier detection in a univariate data set: Tn = (Yn – Ybar)/S where Yn is the possible outlier, Ybar is the mean of the sample, and S is the standard deviation of the sample Contamination exists if Tn is greater than T.01[n] Grubbs, Frank (February 1969), Procedures for Detecting Outlying Observations in Samples, Technometrics, Vol. 11, No. 1, pp. 1-21.

  47. Example of Grubbs’ test for outliers: rainfall in acre-feet from seeded clouds (Simpson et al. 1975) 4.1 7.7 17.5 31.4 32.7 40.6 92.4 115.3 118.3 119.0 129.6 198.6 200.7 242.5 255.0 274.7 274.7 302.8 334.1 430.0 489.1 703.4 978.0 1656.0 1697.8 2745.6 T26 = 3.539 > 3.029;Contaminated Edwards 2000 But Grubbs’ test is sensitive to non-normality……

  48. Checking Assumptions on Rainfall Data Skewed distribution: Grubbs’ Test detects contaminating points Normal Distribution: Grubbs’ test detects no contamination Edwards 2000

  49. References about outliers: • Barnett, V. and Lewis, T.: 1994, Outliers in Statistical Data, John Wiley & Sons, New York • Iglewicz, B. and Hoaglin, D. C.: 1993 How to Detect and Handle Outliers, American Society for Quality Control, Milwaukee, WI.

More Related