Anita L. Stewart Institute for Health & Aging University of California, San Francisco

Class 4Basic Psychometric Characteristics:Variability, Reliability, InterpretabilityOctober 15, 2009 Anita L. Stewart Institute for Health & Aging University of California, San Francisco

Overview of Class 4 • Concepts of error, sources of error and bias in measures. • Indicators of variability and reasons for poor variability • Indicators of reliability • Interpretability of scores

= + Components of an Individual’s Observed Item Score (Simplistic view) Observed true item score score error

= + Components of an Individual’s Observed Item Score Observed true item score score error “score that would be obtained over repeated testings” Nunnally, 1994, p211

Random versus Systematic Error Observed true item score score random systematic error = +

Random versus Systematic Error Observed true item score score Relevant to reliability random systematic error = + Relevant to validity

= + Components of Variability in Item Scores of a Group of Individuals Observed true score score variance variance Total variance (sum of all observed item scores) error variance

= + Components of Variability in Item Scores of a Group of Individuals Observed true score score variance variance Total variance (sum of all observed item scores) (Random)error variance

Combining Items into Multi-Item Scales • When items are combined into a summated scale, random error to some extent “cancels out” • Error variance reduced as # items increases • Reducing random error increases amount of “true score” variance

Sources of Error • Subjects • Observers or interviewers • Measure or instrument

Example: Measuring Weight of Children • Observed score is a linear combination of many sources of variation for an individual

Measuring Weight in Pounds (Without Shoes) of One Child Amount of water past 30 min Observed weight True weight 80 lbs Weightof clothes + = + Person weighing childrenis not very precise Scale ismiscalibrated + +

Measuring Weight in Pounds (Without Shoes) of One Child Amount of water past 30 min +.25 lb Observed weight 82.1 lbs True weight 80 lbs Weightof clothes +.70 lb + = + Person weighing childrenis not very precise +1 lb Scale ismiscalibrated +.1 lb + + 82.1 = 80 +.25 +.70 +.1 +1

Sources of Error in Measuring Weight of Children • Weight of clothes • Subject source of random error • Scale is miscalibrated • Instrument source of systematic error • Person weighing child is not precise • Observer source of random error

Measuring Depressive Symptoms (past 4 weeks) in an Asian or Latino Man Hard to choose number on the 1-6response choice scale Observed depressionscore “True” depression 16 = + Measure misses 2culturally-bound symptoms Unwillingnessto tellinterviewer Poor memory of feelings + + +

Measuring Depressive Symptoms (past 4 weeks) in an Asian or Latino Man Hard to choose number on the 1-6response choice scale +1 Observed depressionscore 12 “True” depression 16 = + Measure misses 2culturally-bound symptoms -2 Unwillingto tellinterviewer -2 Poor memory of feelings -1 + + + 12 = 16 +1 -2 -1 -2

Sources of Error in Measuring Depression • Hard to choose one number on 1-6 response scale • Subject source of random error • Unwilling to tell interviewer, poor memory of feelings • Subject sources of systematic error (underreport true depression) • Measure misses culturally-bound symptoms • Instrument source of systematic error (underestimate true depression)

Four Types of Memory Errors: From Cognitive Psychology • Encoding • Information inadequately stored in memory • Storage • Memory eroded over time • Retrieval • Some events/feelings harder to recall • Reconstruction • Errors filling in missing pieces R Torangeau, Chap 3, in AA Stone et al. (eds)The Science of Self-Report, London: Lawrence Erlbaum, 2000

Autobiographical memory – memory of things in time and space Events not encoded with their calendar dates Thus time is a poor retrieval method Numerous errors remembering “when” and “how often” something occurred within a particular time frame Memory and Time N Bradburn, Chap 4, The Science of Self-Report

Tend to remember positive more than negative experiences more emotionally intense than neutral experiences non-threatening events more than threatening, sensitive events Memory and Emotion Kihlstrom et al, Chap 6, The Science of Self-Report

Overview • Concepts of error • Basic psychometric characteristics • Variability • Reliability • Interpretability

Variability • Good variability • All (or nearly all) scale levels are represented • Distribution approximates bell-shaped normal • Variability is a function of the sample • Need to understand variability of a measure in sample similar to one you are studying • Review criteria • Adequate variability on the latent variable that is relevant to your study

Indicators of Variability • Range of scores • Mean, median, mode • Standard deviation (or standard error) • Interquartile range • Skewness statistic • % at floor (lowest possible score) • % at ceiling (highest possible score)

Range of Scores: Possible and Observed • Especially important for multi-item measures • Example: • CES-D possible range is 0-30 • Wong et al. study of mothers of young children: observed range was 0-23 • missing entire high end of the distribution (none had high levels of depression)

Mean, Median, Mode • Mean - average • Median - midpoint • Mode - most frequent score • In normally distributed measures, these are all the same • In non-normal distributions, they will vary

Mean and Standard Deviation • Most information on variability is from mean and standard deviation • Can envision how measure is distributed on the possible range • Mean + 1 SD = 64% of the scores

Interquartile Range (IR) • Difference between the 3rd and 1st quartiles IR = Quartile 3 - Quartile 1 • This range contains the middle 50% of the distribution • 25% of the sample is above and 25% is below this range

Quartiles Divide distribution into 4 parts with 25% of the sample in each part (quartiles) • Quartile 1 - the scale score at the boundary of the lowest 25% of the distribution • Quartile 2 - the score that divides the distribution in half (same as the median) • Quartile 3 - the score at the boundary of the highest 25% (25% of the sample scores above this point)

Set of Scores on 12 people 12 people (red), 12 scores (black) 1 2 3 4 5 6 7 8 9 10 11 12 2 3 8 1 7 4 4 3 2 7 5 3 Re-arrange scores in numeric order 4 9 1 8 2 12 7 6 11 10 5 3 1 2 2 3 3 3 4 4 5 7 7 8

Example of Quartiles: Set of Scores on 12 people 1 2 2 3 3 3 4 4 5 7 7 8 2.5Q1 3.5 Q2 6 Q3 Q1=lowest 25% (lowest 3 people) Q2= median (50% below, 50% above) Q3=highest 25% (highest 3 people)

Example of Quartiles: Set of Scores on 12 people 1 2 2 3 3 3 4 4 5 7 7 8 2.5Q1 3.5 Q2 6 Q3 Interquartile range - quartile 3 - quartile 1 = 6 - 2.5 = 3.5

Skewness • Positive skew - scores bunched at low end, long tail to the right • Negative skew - opposite pattern • Skewness coefficient ranges from - infinity to + infinity • the closer to zero, the more normal • Scores +2.0 are cause for concern

Ceiling and Floor Effects: Similar to Skewness Information • Ceiling effects: substantial number of people get highest possible score • Floor effects: opposite • More helpful for single-item measures or coarse scales with only a few levels

… to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)? 49% not limited at all (can’t improve) %

SF-36 Variability Information in Patients with Chronic Conditions (N=3,445) All on 0-100 scales, higher is better McHorney C et al. Med Care. 1994;32:40-66.

Evidence of Floor and Ceiling Effects in One SF-36 Scale 24 37 All on 0-100 scales, higher is better McHorney C et al. Med Care. 1994;32:40-66.

Reasons for Poor Variability • Low variability in construct being measured in that “sample” (true low variation) • Items not adequately tapping construct • If only one item, especially hard • Items not detecting variation at one end • What to do: • If developing measures, add items • If selecting measures – find another one

Advantages of Multi-item Scales Revisited • Using multi-item scales minimizes likelihood of ceiling/floor effects • Even if items are skewed, multi-item scale “normalizes” the skew

Percent with “Best” Score on 5 Items in the MOS MHI-5 6-level response scale - all of the time to none of the time: Stewart A. et al., Measuring Functioning and Well-Being, 1992

Percent with “Best” Score on 5 Items in the MOS MHI-5 6-level response scale - all of the time to none of the time: 63 Stewart A. et al., Measuring Functioning and Well-Being, 1992

Percent with “Best” Score on 5 Items in the MOS MHI-5 6-level response scale - all of the time to none of the time: 5-itemscale: only 5%had highestscore Stewart A. et al., Measuring Functioning and Well-Being, 1992

Overview • Concepts of error • Basic psychometric characteristics • Variability • Reliability • Interpretability

Reliability • Extent to which an observed score is free of random error • Produces the same score each time it is administered (all else being equal) • Population-specific - reliability affected by: • sample size • variability in scores (dispersion) • a person’s level on the scale

= + Back to Components of Variability in Item Scores of a Group of Individuals Observed true score score variance variance Total variance (Variation is the sum of all observed item scores) error variance

Reliability Depends on True Score Variance • Reliability is a group-level statistic • Reliability: • Reliability = 1 – (error variance) • OR Proportion of variance due to true score Total variance

Reliability Depends on True Score Variance Reliability of .70 means 30% of variancein observed scores is due to error Reliability = total variance – error variance .70 = 1.0 – .30

Reliability Coefficient • Typically ranges from .00 - 1.00 • Higher scores indicate better reliability

Importance of Reliability • Necessary for validity (but not sufficient) • Low reliability (or high measurement error) attenuates correlations with other variables • May conclude that two variables are not related when they are • Greater reliability = greater power • The more reliable your scales, the smaller sample size you need to detect an association

Reliable Scale? • NO! • There is no such thing as a “reliable” scale • We accumulate “evidence” of reliability in a variety of populations in which it has been tested

How Do You Know if a Scale or Measure Has Adequate Reliability? • Adequacy of reliability judged according to standard criteria • Criteria depend on type of coefficient

Anita L. Stewart Institute for Health & Aging University of California, San Francisco