Chapter 3 reliability and objectivity
Download
1 / 56

Chapter 3 Reliability and Objectivity - PowerPoint PPT Presentation


  • 136 Views
  • Uploaded on

Chapter 3 Reliability and Objectivity. Chapter 3 Outline. Selecting a Criterion Score Types of Reliability Reliability Theory Estimating Reliability – Intraclass R Spearman-Brown Prophecy Formula Standard Error of Measurement Objectivity Reliability of Criterion-referenced Tests

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Chapter 3 Reliability and Objectivity' - len-hill


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Chapter 3 reliability and objectivity
Chapter 3Reliability and Objectivity


Chapter 3 outline
Chapter 3 Outline

  • Selecting a Criterion Score

  • Types of Reliability

  • Reliability Theory

  • Estimating Reliability – Intraclass R

  • Spearman-Brown Prophecy Formula

  • Standard Error of Measurement

  • Objectivity

  • Reliability of Criterion-referenced Tests

  • Reliability of Difference Scores


Objectivity
Objectivity

  • Interrater Reliability

  • Agreement of competent judges about the value of a measure.


Reliability
Reliability

  • Dependability of scores

  • Consistency

  • Degree to which a test is free from measurement error.


Selecting a criterion score
Selecting a Criterion Score

  • Criterion score – the measure used to indicate a person’s ability.

    • Can be based on the mean score of the best score.

  • Mean Score – average of all trials.

    • Usually a more reliable estimate of a person’s true ability.

  • Best Score – optimal score a person achieves on any one trial.

    • May be used when criterion score is to be used as an indicator of maximum possible performance.


Potential methods to select a criterion score
Potential Methods to Select a Criterion Score

  • Mean of all trials.

  • Best score of all trials.

  • Mean of selected trials based on trials on which group scored best.

  • Mean of selected trials based on trials on which individual scored best (i.e., omit outliers).

    Appropriate method to use depends on the situation.


Norm referenced test
Norm-referenced Test

  • Designed to reflect individual differences.


In norm referenced framework
In Norm-referenced Framework

  • Reliability - ability to detect reliable differences between subjects.


Types of reliability
Types of Reliability

  • Stability

  • Internal Consistency


Stability test retest reliability
Stability (Test-retest) Reliability

  • Each subject is measured with same instrument on two or more different days.

  • Scores are then correlated.

    • An intraclass correlation should be used.


Internal consistency reliability
Internal Consistency Reliability

  • Consistent rate of scoring throughout a test or from trial to trial.

  • All trials are administered in a single day.

  • Trial scores are then correlated.

    • An intraclass correlation should be used.


Sources of measurement error
Sources of Measurement Error

  • Lack of agreement among raters (i.e., objectivity).

  • Lack of consistent performance by person.

  • Failure of instrument to measure consistently.

  • Failure of tester to follow standardized procedures.


Reliability Theory

X = T + E

Observed score = True score + Error

2X = 2t + 2e

Observed score variance = True score variance + Error variance

Reliability = 2t ÷ 2X

Reliability = (2X - 2e) ÷ 2X


Reliability depends on
Reliability depends on:

  • Decreasing measurement error

  • Detecting individual differences among people

    • ability to discriminate among different ability levels


Reliability1
Reliability

  • Ranges from 0 to 1.00

    • When R = 0, there is no reliability.

    • When R = 2, there is maximum reliability.


Reliability from Intraclass R

  • ANOVA is used to partition the variance of a set of scores.

  • Parts of the variance are used to calculate the intraclass R.


Estimating reliability
Estimating Reliability

  • Intraclass correlation from one-way ANOVA:

  • R = (MSA – MSW)  MSA

    • MSA = Mean square among subjects (also called between subjects)

    • MSw = Mean square within subjects

    • Mean square = variance estimate

  • This represents reliability of the mean test score for each person.



Estimating reliability1
Estimating Reliability

  • Intraclass correlation from two-way ANOVA:

  • R = (MSA – MSR)  MSA

    • MSA = Mean square among subjects (also called between subjects)

    • MSR = Mean square residual

    • Mean square = variance estimate

  • Used when trial to trial variance is not considered measurement error (e.g., Likert type scale).



What is acceptable reliability
What is acceptable reliability?

  • Depends on:

    • age

    • gender

    • experience of people tested

    • size of reliability coefficients others have obtained

    • number of days or trials

    • stability vs. internal consistency coefficient


What is acceptable reliability?

  • Most physical measures are stable from day- to-day.

    • Expect test-retest Rxx between .80 and .95.

  • Expect lower Rxx for tests with an accuracy component (e.g., .70).

  • For written test, want RXX > .70.

  • For psychological instruments, want RXX > .70.

  • Critical issue: time interval between 2 test sessions for stability reliability estimates. 1 to 3 days apart for physical measures is usually appropriate.


Factors Affecting Reliability

  • Type of test.

    • Maximum effort test expect Rxx .80

    • Accuracy type test expect Rxx .70

    • Psychological inventories expect Rxx .70

  • Range of ability.

    • Rxx higher for heterogeneous groups than for homogeneous groups.

  • Test length.

    • Longer test, higher Rxx


Factors Affecting Reliability

  • Scoring accuracy.

    • Person administering test must be competent.

  • Test difficulty.

    • Test must discriminate among ability levels.

  • Test environment, organization, and instructions.

    • favorable to good performance, motivated to do well, ready to be tested, know what to expect.


Factors Affecting Reliability

  • Fatigue

    • decreases Rxx

  • Practice trials

    • increase Rxx


Coefficient Alpha

  • AKA Cronbach’s alpha

  • Most widely used with attitude instruments

  • Same as two-way intraclass R through ANOVA

  • An estimate of Rxx of a criterion score that is the sum of trial scores in one day


Coefficient Alpha

Ralpha = [K / (K-1)] x [(S2x - S2trials) / S2x]

• K = # of trials or items

• S2x = variance for criterion score (sum of all trials)

• S2trials = sum of variances for all trials


Kuder-Richardson (KR)

  • Estimate of internal consistency reliability by determining how all items on a test relate to the total test.

  • KR formulas 20 and 21 are typically used to estimate Rxx of knowledge tests.

  • Used with dichotomous items (scored as right or wrong).

  • KR20 = coefficient alpha


KR20

  • KR20 = [K / (K-1)] x [(S2x - pq) / S2x]

    • K = # of trials or items

    • S2x = variance of scores

    • p = percentage answering item right

    • q = percentage answering item wrong

    • pq = sum of pq products for all k items


KR20 Example

Item p q

1 .50 .50

2 .25 .75

3 .80 .20

4 .90 .10

If Mean = 2.45 and

SD = 1.2, what is KR20?

pq

.25

.1875

.16

.09

pq = 0.6875

KR20 = (4/3) x (1.44 – 0.6875)/1.44

KR20 = .70


KR21

  • If assume all test items are equally difficult, KR20 can be simplified to KR21

    KR21 =[(K x S2)-(Mean x (K - Mean)] ÷ [(K-1) x S2]

    • K = # of trials or items

    • S2 = variance of test

    • Mean = mean of test


Equivalence reliability parallel forms
Equivalence Reliability (Parallel Forms)

  • Two equivalent forms of a test are administered to same subjects.

  • Scores on the two forms are then correlated.


Spearman-Brown Prophecy formula

  • Used to estimate rxx of a test that is changed in length.

  • rkk = (k x r11) ÷ [1 + (k - 1)(r11)]

  • k = number of times test is changed in length.

  • k = (# trials want) ÷ (# trials have)

  • r11 = reliability of test you’re starting with

  • Spearman-Brown formula will give an estimate of maximum reliability that can be expected (upper bound estimate).


Standard error of measurement se m
Standard Error of Measurement (SEM)

  • Degree you expect test score to vary due to measurement error.

  • Standard deviation of a test score.

  • SEM = Sx1 - Rxx

    • Sx = standard deviation of group

    • Rxx = reliability coefficient

  • Small SEM indicates high reliability


SEM

  • example: written test: Sx = 5 Rxx = .88

  • SEM = 5  1 - .88 = 1.73

  • Confidence Interval:

    68% X ± 1.00 (SEM)

    95% X ± 1.96 (SEM)

  • If X =23 23 + 1.73 = 24.73

    23 - 1.73 = 21.27

  • 68% confident true score is between 21.27 and 24.73


Objectivity rater reliability
Objectivity (Rater Reliability)

  • Degree of agreement between raters.

  • Depends on:

    • clarity of scoring system.

    • degree to which judge can assign scores accurately.

  • If test is highly objective, objectivity is obvious and rarely calculated.

  • As subjectivity increases, test developer should report estimate of objectivity.


Two types of objectivity
Two Types of Objectivity:

  • Intrajudge objectivity

    • consistency in scoring when test user scores same test two or more times.

  • Interjudge objectivity

    • consistency between two or more independent judgments of same performance.

  • Calculate objectivity like reliability, but substitute judges scores for trials.


Criterion referenced test
Criterion-referenced Test

  • A test used to classify a person as proficient or nonproficient (pass or fail).


In criterion referenced framework
In Criterion-referenced Framework:

  • Reliability - defined as consistency of classification.


Reliability of criterion referenced test scores
Reliability of Criterion-referenced Test Scores

  • To estimate reliability, a double-classification or contingency table is formed.


Contingency Table(Double-classification Table)

Day 2

Pass

Fail

Pass

A

B

Day 1

Fail

C

D


Proportion of Agreement (Pa)

  • Most popular way to estimate Rxx of CRT.

  • Pa = (A + D) ÷ (A + B + C + D)

  • Pa does not take into account that some consistent classifications could happen by chance.


Example for calculating Pa

Day 2

Pass

Fail

Pass

45

12

Day 1

Fail

8

35


Day 2

Pass

Fail

Pass

45

12

Day 1

Fail

8

35

Pa = (A + D) ÷ (A + B + C + D)

Pa = (45 + 35) ÷ (45 + 12 + 8 + 35)

Pa = 80 ÷ 100 = .80


Kappa Coefficient (K)

  • Estimate of CRT Rxx with correction for chance agreements.

    K = (Pa - Pc) ÷ (1 - Pc)

    • Pa = Proportion of Agreement

    • Pc = Proportion of Agreement expected by chance

    Pc = [(A+B)(A+C)+(C+D)(B+D)]÷(A+B+C+D)2


Example for calculating K

Day 2

Pass

Fail

Pass

45

12

Day 1

Fail

8

35


Day 2

Pass

Fail

Pass

45

12

Day 1

Fail

8

35

  • K = (Pa - Pc) ÷ (1 - Pc)

  • Pa = .80


Day 2

Pass

Fail

Pass

45

12

Day 1

Fail

8

35

Pc = [(A+B)(A+C)+(C+D)(B+D)]÷(A+B+C+D)2

Pc = [(45+12)(45+8)+(8+35)(12+35)]÷(100)2

Pc = [(57)(53)+(43)(47)]÷(10,000) = 5,042÷10,000

Pc = .5042


Kappa (K)

  • K = (Pa - Pc) ÷ (1 - Pc)

  • K = (.80 - .5042) ÷ (1 - .5042)

  • K = .597


Modified Kappa (Kq)

  • Kq may be more appropriate than K when proportion of people passing a criterion-referenced test is not predetermined.

  • Most situations in exercise science do not predetermine the number of people who will pass.


Modified Kappa (Kq)

  • Kq = (Pa – 1/q) ÷ (1 – 1/q)

    • q = number of classification categories

    • If pass-fail, q = 2

  • Kq = (.80 - .50) ÷ (1 - .50)

  • Kq = .60


Modified Kappa

  • Interpreted same as K.

  • When proportion of masters = .50, Kq = K.

  • Otherwise, Kq > K.


Interpretation of Rxx for CRT

  • Pa (Proportion of Agreement)

    • Affected by chance classifications

    • Pa < .50 are unacceptable

    • Pa should be > .80 in most situations.

  • K and Kq (Kappa and Modified Kappa)

    • Interpretable range: 0.0 to 1.0

  • Minimum acceptable value = .60


When reporting results:

  • Report both indices of Rxx.


Formative evaluation of chapter objectives
Formative Evaluation of Chapter Objectives

  • Define and differentiate between reliability and objectivity for norm-referenced tests.

  • Identify factors that influence reliability and objectivity of norm-referenced test scores.

  • Identify factors that influence reliability of criterion-referenced test scores.

  • Select a reliable criterion score based on measurement theory.


Chapter 3 reliability and objectivity1
Chapter 3Reliability and Objectivity


ad