Quality assurance quality control
Download
1 / 68

Quality Assurance & Quality Control - PowerPoint PPT Presentation


  • 150 Views
  • Updated On :

Quality Assurance & Quality Control. Kristin Vanderbilt, Ph.D. Sevilleta LTER New Mexico, USA. What is data quality?. fitness of a dataset or record for a particular use. Chrisman, 1991. Two Types of Errors. Commission: Incorrect or inaccurate data are entered into a dataset

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Quality Assurance & Quality Control' - nona


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Quality assurance quality control l.jpg

Quality Assurance & Quality Control

Kristin Vanderbilt, Ph.D.

Sevilleta LTER

New Mexico, USA


What is data quality l.jpg
What is data quality?

  • fitness of a dataset or record for a particular use

Chrisman, 1991


Two types of errors l.jpg
Two Types of Errors

  • Commission: Incorrect or inaccurate data are entered into a dataset

    • Transcription Error

    • Malfunctioning instrumentation

  • Omission: Data or metadata are not recorded

    • Difficult or impossible to find

    • Inadequate documentation of data values, sampling methods, anomalies in field, human errors


  • Loss of data quality can occur at many stages l.jpg
    Loss of data quality can occur at many stages:

    • At the time of collection

    • During digitisation

    • During documentation

    • During archiving


    Qa qc l.jpg
    QA/QC

    • “mechanisms [that] are designed to prevent the introduction of errors into a data set, a process known as data contamination”

      Brunt 2000


    Slide6 l.jpg

    Scientist

    Database

    • Quality Assurance:

    • Graphics

    • Statistics

    • Documentation

    • Quality Control:

    • Datasheet Design

    • Data Entry Constraint checks

    • Documentation

    Cost of error correction increases


    Qa qc activities l.jpg
    QA/QC Activities

    • Defining and enforcing standards for formats, codes, measurement units and metadata.

    • Checking for unusual or unreasonable patterns in data.

    • Checking for comparability of values between data sets.

    • Documenting all QA/QC activities

      Brunt 2000


    Outline l.jpg
    Outline:

    • QC procedures

      • Designing data sheets

      • Data entry using validation rules, filters, lookup tables

    • QA procedures

      • Graphics and Statistics

      • Outlier detection

    • Archiving data


    The cost of errors l.jpg
    The Cost of Errors

    • Error prevention is considered to be far superior to error detection, since detection is often costly and can never guarantee to be 100% successful (Dalcin 2004).


    Quality control starts with people l.jpg
    Quality Control starts with people

    • Assign responsibility for the quality of data to those who create them. If this is not possible, assign responsibility as close to data creation as possible (Redman 2001)

    • Poor training lies at the root of many data quality problems.


    Flowering plant phenology data collection form design l.jpg

    1

    1

    1

    2

    2

    2

    3

    3

    3

    Flowering Plant Phenology – Data Collection Form Design

    • Three sites, each with 3 transects

    • On each transect, every species will have its phenological class recorded

    Deep Well

    Five Points

    Goat Draw


    Slide12 l.jpg

    Data Collection Form Development:

    What’s wrong with this data sheet?

    PlantLife Stage

    _________________________________________

    _________________________________________

    _________________________________________

    _________________________________________

    _________________________________________


    Slide13 l.jpg

    PHENOLOGY DATA SHEET

    Collectors:_________________________________

    Date:___________________ Time:_________

    Location:deep well, five points, goat draw

    Transect: 1 2 3

    Notes:_________________________________________

    PlantLife Stage

    ardi P/G V B FL FR M S D NP

    arpu P/G V B FL FR M S D NP

    atca P/G V B FL FR M S D NP

    bamu P/G V B FL FR M S D NP

    zigr P/G V B FL FR M S D NP

    P/G V B FL FR M S D NP

    P/G V B FL FR M S D NP

    P/G = perennating or germinating M = dispersing

    V = vegetating S = senescing

    B = budding D = dead

    FL = flowering NP = not present

    FR = fruiting


    Slide14 l.jpg

    ardi P/G V B FL FR M S D NP

    deob P/G V B FL FR M S D NP

    arpu P/G V B FL FR M S D NP

    asbr P/G V B FL FR M S D NP

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    YN

    Y N

    YN

    Y N

    Y N

    YN

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    Y N

    YN

    Y N

    PHENOLOGY DATA ENTRY INTERFACE

    Collectors Mike Friggens

    Date: 16 May 1998 Time: 13:12

    Location: Deep Well

    Transect: 1

    Notes:Cloudy day, 3 gopher burrows on transect


    Validation rules l.jpg
    Validation Rules:

    • Control the values that a user can enter into a field

    • Example in Access:

      • <=100

      • Between #1/1/70# and Date()

      • Coordinate data (latitude):

        • Degrees: >=0 and <= 90

        • Minutes: >=0 and <=60


    Validation rules in ms access enter in table design view l.jpg

    >=0 and <= 100

    Validation rules in MS Access: Enter in Table Design View


    Look up fields l.jpg
    Look-up Fields

    • Display a list of values from which entry can be selected


    Look up tables in ms access enter in table design view l.jpg
    Look-up Tables in MS Access: Enter in Table Design View


    Other methods for preventing data contamination l.jpg
    Other methods for preventing data contamination

    • Double-keying of data by independent data entry technicians followed by computer verification for agreement

    • Use text-to-speech program to read data back

    • Filters for “illegal” data

      • Statistical/database/spreadsheet programs

        • Legal range of values

        • Sanity checks

        • Unit counts

          Edwards 2000


    Slide20 l.jpg

    Flow of Information in Filtering “Illegal” Data

    Raw

    Data

    File

    Illegal

    Data

    Filter

    Table of

    Possible

    Values

    and Ranges

    Report of

    Probable

    Errors

    Edwards 2000


    Slide21 l.jpg

    Tree Growth Data

    A filter written in SAS

    • Data Checkum; Set tree;

      • Message=repeat(“ ”,39);

      • If cover<0 or cover>100 then do; message=“cover is not in the interval [0,100]”; output; end;

      • If dbh_1998>dbh_1999 then do; message=“dbh_1998 is larger than dbh_1999”; output; end;

      • If message NE repeat(“ ”, 39);

      • Keep ID message;

    • Proc Print Data=Checkum;

    Error report

    Obs ID message

    1 a dbh_1998 is larger than dbh_1999

    2 b cover is not in the interval [0,100]

    3 c dbh_1998 is larger than dbh_1999

    4 d cover is not in the interval [0,100]


    Spreadsheet column statistics peromyscus truei example l.jpg
    Spreadsheet column statistics:Peromyscus truei example


    Spreadsheet range checks l.jpg
    Spreadsheet range checks

    =if(mass>50,1,0)


    Unit counts was everything measured l.jpg
    Unit counts: was everything measured?

    Quadrats

    1

    1

    1

    2

    2

    2

    3

    3

    3

    4

    4

    4

    Web 1

    Plot

    N

    Web 2

    E

    W

    Web 3

    S

    This study design is replicated at 3 sites.

    • At each site:

      • Ensure that there are 16 quads per web


    Slide25 l.jpg

    SQL:

    SELECT site, season, web, count(distinct plot, quad)

    FROM `palmtop`.`new_npp`

    where year = 2003

    group by site ASC, season ASC, web ASC


    Other methods for preventing data contamination27 l.jpg
    Other methods for preventing data contamination

    • Double-keying of data by independent data entry technicians followed by computer verification for agreement

    • Use text-to-speech program to read data back

    • Filters for “illegal” data

      • Statistical/database/spreadsheet programs

        • Legal range of values

        • Sanity checks

    • Properly designing database (taxon names, localities, persons are only entered once)

      Edwards 2000


    Good database design can improve data quality l.jpg
    Good database design can improve data quality

    • Atomize data

    • Use consistent terminologies

      • “FLOWER COLOUR=RED”, and

      • “ FLOWER COLOUR=CRIMSON”.

    • Document changes

      • Record level – what changes have been made and by whom

      • Dataset level



    Avoid domain schizophrenia l.jpg
    Avoid Domain schizophrenia

    • Fields used for purposes for which they were not intended

    Dalcin 2000


    Document changes to data l.jpg
    Document changes to data

    • Why?

      • Avoid duplication of error checking

      • Users can determine fitness of data for use

    • How?

      • Include data quality/accuracy fields in database design

      • Develop an audit trail, so that changes can be undone – record the who, when, how and why of record updates


    Metadata for bad data l.jpg
    Metadata for ‘bad’ data

    Variable 9* Name: Average Wind Speed

    * Label: Avg_Windspeed

    * Definition: Average wind speed during the hour at 3 m

    * Units of Measure: meters/second

    * Precision of Measurements: .11 m/s

    * Range or List of Values: 0-50

    * Data Type: Real

    * Column Format: #.###

    * Field Position: Columns 51-58

    * Missing Data Code: -999 (bad) -888 (not measured)

    * Computational Method for Derived Data: na



    Outline34 l.jpg
    Outline:

    • QC procedures

      • Designing data sheets

      • Data entry using validation rules, filters, lookup tables

    • QA procedures

      • Graphics and Statistics:

        • Consistency checks

        • Unusual patterns

        • Outliers

    • Archiving data


    Slide35 l.jpg

    Consistency check to identify sensor errors: Comparison of data from three Meteorology stations, Sevilleta LTER



    Slide37 l.jpg

    Manual QC/QA Flagging three Met stations, Sevilleta LTER

    Sheldon, GCE LTER


    Outliers l.jpg
    Outliers three Met stations, Sevilleta LTER

    • An outlier is “an unusually extreme value for a variable, given the statistical model in use”

    • The goal of QA is NOT to eliminate outliers! Rather, we wish to detect unusually extreme values.

    • Attempt to determine if contamination is responsible and, if so, flag the contaminated value.

      Edwards 2000


    Methods for detecting outliers l.jpg
    Methods for Detecting Outliers three Met stations, Sevilleta LTER

    • Graphics

      • Scatter plots

      • Box plots

      • Histograms

      • Normal probability plots

    • Formal statistical methods

      • Grubbs’ test

        Edwards 2000


    Slide40 l.jpg

    X-Y scatter plots of gopher tortoise morphometrics three Met stations, Sevilleta LTER

    Michener 2000


    Box plot interpretation l.jpg
    Box Plot Interpretation three Met stations, Sevilleta LTER

    IQR = Q(75%) – Q(25%)

    Upper adjacent value = largest observation < (Q(75%) + (1.5 X IQR))

    Lower adjacent value = smallest observation > (Q(25%) - (1.5 X IQR))

    Extreme outlier: > 3 X IQR beyond upper or lower adjacent values

    Inter-quartile range

    Median



    Slide43 l.jpg

    Normal density and Cumulative Distribution Functions Temperature

    95%

    99%

    Edwards 2000

    If data are normally distributed, a data point falling more than 3 standard deviations away from the mean is an outlier




    Statistical tests for outliers assume that the data are normally distributed l.jpg
    Statistical tests for outliers assume that the data are normally distributed….

    CHECK THIS ASSUMPTION!


    Grubbs test for outlier detection in a univariate data set l.jpg
    Grubbs’ test for outlier detection in a univariate data set:

    Tn = (Yn – Ybar)/S

    where Yn is the possible outlier,

    Ybar is the mean of the sample, and

    S is the standard deviation of the sample

    Contamination exists if Tn is greater than T.01[n]

    Grubbs, Frank (February 1969), Procedures for Detecting Outlying Observations in Samples, Technometrics, Vol. 11, No. 1, pp. 1-21.


    Example of grubbs test for outliers rainfall in acre feet from seeded clouds simpson et al 1975 l.jpg
    Example of Grubbs’ test for outliers: rainfall in acre-feet from seeded clouds (Simpson et al. 1975)

    4.1 7.7 17.5 31.4 32.7 40.6 92.4 115.3 118.3 119.0 129.6 198.6 200.7 242.5 255.0 274.7 274.7 302.8 334.1 430.0 489.1 703.4 978.0 1656.0 1697.8 2745.6

    T26 = 3.539 > 3.029;Contaminated

    Edwards 2000

    But Grubbs’ test is sensitive

    to non-normality……


    Slide49 l.jpg

    Checking Assumptions on Rainfall Data acre-feet from seeded clouds (Simpson et al. 1975)

    Skewed distribution: Grubbs’ Test detects contaminating points

    Normal

    Distribution: Grubbs’ test detects no contamination

    Edwards 2000


    References about outliers l.jpg
    References about outliers: acre-feet from seeded clouds (Simpson et al. 1975)

    • Barnett, V. and Lewis, T.: 1994, Outliers in Statistical Data, John Wiley & Sons, New York

    • Iglewicz, B. and Hoaglin, D. C.: 1993 How to Detect and Handle Outliers, American Society for Quality Control, Milwaukee, WI.


    Slide51 l.jpg

    QA/QC in the Lab: Using Control Charts acre-feet from seeded clouds (Simpson et al. 1975)


    Slide52 l.jpg

    Laboratory quality control using statistical process control charts:

    • Determine whether analytical system is “in control” by examining

      • Mean

      • Variability (range)


    Mean control chart l.jpg
    Mean Control Chart charts:

    N concentration in sample of known concentration

    UCL = Mean + 3 SD

    UCL

    UCL

    Mean

    Mean

    N Concentration

    LCL

    LCL

    Time


    Linear trend l.jpg
    Linear trend charts:


    Control charts for qa of weather data l.jpg
    Control Charts for QA of weather data charts:

    Hourly air temperatures (1-24) for each month (calculated for many years of historical data)

    Hourly air temperature, 26th day of each month 2002

    Eching and Snyder 2005


    Slide56 l.jpg

    Daily average air temperature for each month charts:

    Daily average air temperature for each day in 2002


    North temperate lakes qa algorithm l.jpg
    North Temperate Lakes QA Algorithm charts:

    Figure 7. Hourly wind speed plot of 1997 and 2000 with 0 speed highlighted in dots

    Hu and Benson, unpublished


    Slide58 l.jpg

    # of wind speeds in 0-1 bin/ charts:

    Total # of wind speeds recorded

    Figure 9. Hourly average wind speed PDF by month for 1995.


    Slide59 l.jpg

    Average frequency of 0 wind speed in each month based on 1995 data + 3 SD

    Figure 10. Detecting abnormal wind speed distribution.


    Outline60 l.jpg
    Outline: 1995 data + 3 SD

    • QC procedures

      • Designing data sheets

      • Data entry using validation rules, filters, lookup tables

    • QA procedures

      • Graphics and Statistics:

        • Consistency checks

        • Unusual patterns

        • Outliers

    • Archiving data


    Archiving high quality data for easy reuse l.jpg
    Archiving high quality data for easy reuse 1995 data + 3 SD

    • Avoid using the same column title more than once

    • Avoid inconsistencies (e.g. different date ranges in title vs. the data)

    Figure courtesy of Christine Laney, JRN LTER


    Avoid formatting errors cryptic data and metadata interspersed with the data l.jpg
    Avoid formatting errors, cryptic data, and metadata interspersed with the data

    Figure courtesy of Christine Laney, JRN LTER


    The details l.jpg
    The details interspersed with the data

    • Dates as an example of what not to do:

      • 2-digit years

      • range of dates in single cell (e.g., 02/01-03/2006 or 02/01/2006,02/03/2006)

      • date with a letter appended to the end (ex: 02/01/1999A)

      • single digit day and month, especially when there are no delimiters between month, day, year. (e.g., 1212005)

    courtesy of Christine Laney, JRN LTER


    Preferred data formats for synthesis l.jpg
    Preferred data formats for synthesis interspersed with the data

    • Simple ascii delimited with commas, spaces, tabs, etc. with headers, or very simple excel spreadsheets. If fixed-width, give widths and spaces.

    • Metadata in separate file

    • All data in single file, not separated by year. If not possible, each file in exactly the same format.

    • Complex formatting systems, like multisheets & several tables in one sheet, are more difficult to interpret and extract information.


    More archival suggestions l.jpg
    More archival suggestions: interspersed with the data

    • Assign descriptive file names:

      • File names should be unique and reflect the file contents

        • Bad file names

          • Mydata

          • 2001_data

        • Good file name:

          • Sevilleta _LTER_NM_2001_NPP.asc

            • Sevilleta_LTER is the project name

            • NM is the state abbreviation

            • 2001 is the calendar year

            • NPP represents Net Primary Productivity Data

            • Asc stands for the file type-ASCII

    • Assign descriptive data set titles (similar to file names):

      • Shrub net primary productivity at the Sevilleta LTER, New Mexico, 2001


    Best practices reference l.jpg
    Best practices reference: interspersed with the data

    • Cook, R. B., R. J. Olson, P. Kanciruk, and L. A. Hook. 2001. Best practices for preparing ecological and ground-based data sets to share and archive. Ecol. Bulletins 82:138-141.


    Slide67 l.jpg

    References interspersed with the data

    • Michener and Brunt (2000) Ecological Data: Design, Management and Processing. Blackwell Science.

      • Edwards (2000), “Data Quality Assurance”

      • Brunt (2000) Ch. 2, “Data Management Principles, Implementation, and Administration”

      • Michener (2000) Ch. 7: “Transforming Data into Information and Knowledge”

    • Redman, T.C. 2001. Data Quality: The Field Guide. Boston, MA: Digital Press.

    • Chapman, A.D. 2005a. Principles of Data Quality. Report for the Global Biodiversity Information Facility 2004. Copenhagen: GBIF. http://www.gbif.org/prog/digit/data_quality/data_quality

    • Dalcin, E.C. 2005. Data Quality Concepts Applied to Taxonomic Databases. Ph.D. Dissertation, Univeresity of Southhamptom, UK 2005. (http://www.dalcin.org/eduardo/downloads/edalcin_thesis_submission.pdf)


    Slide68 l.jpg

    Questions? interspersed with the data