Measurement and Psychometrics

Measurement and Psychometrics PM 515 Jenifer Unger Ping Sun

Why is measurement important? • How do we assign a value to a latent construct? • How do we assign units of measurement to something that’s psychosocial? • How can we make sure we’re measuring what we think we’re measuring? • What distinguishes a good measure from a bad measure?

Important things to consider • Scaling • Assigning numbers or names to responses • Response bias • How do people’s responses correspond to their “true” states/traits? • Validity • How can you build a valid scale from an infinite pool of possible items? • Reliability • How precise the scale is to measure the construct. • The repeatability between two (samples, times, raters, sub-forms, etc. )

Scaling • How numerical values are assigned to psychological attributes • Uni-dimensional vs. multidimensional

Properties of scaling • Property of identity • If person A and person B both have depression scores of 20, they should be equally depressed. • Property of order • If person A has a depression score of 20 and person B has a depression score of 21, person B is more depressed than person A. • Property of quantity • There is some basic unit of depression, represented by 1. (Often a questionable assumption, or it’s arbitrary, like IQ points.)

Zero • If someone gets a 0 on a depression scale, do they have 0 depressive feelings? • If someone gets a 0 on a spelling test, do they have 0 ability to spell? • Zeros are usually arbitrary (like 0 degrees F) rather than absolute (like Absolute 0—the coldest possible temperature).

Types of scales • Nominal • Describes groups • Can label the groups with numbers, but the numbers don’t indicate an amount • Example: • 1=African American • 2=Asian • 3=Hispanic • 4=White

How to handle a nominal variable • Always treat it as a class variable, not continuous. • Or dummy code • If ethnic=1 then do; as=1; af=0; hi=0; wh=0; end; • If ethnic=2 then do; as=0; af=1; hi=0; wh=0; end; • If ethnic=3 then do; as=0; af=0; hi=1; wh=0; end; • (who is the reference group here?)

Ordinal • Describes a ranking (highest-lowest or lowest-highest) • Example: • Least depressed person=1 • Most depressed person=100 • But this doesn’t mean that person 100 is 100 times more depressed than person 1.

Interval • There is a unit of measurement, with a constant distance between the units • Zero point is arbitrary • Example: • Temperature scales – the degree is a constant

Ratio • Like an interval scale, but there is an absolute zero. • Example: • Distance (0 really means no distance)

Be careful about how you describe scales • A score of 0 usually doesn’t mean a total lack of the characteristic. • Be careful about saying one person is twice as depressed, intelligent, etc., as another person.

Make sure the response options are appropriate for the sample! • How many sexual partners have you had? • A. 0 • B. 1 • C. 2 • D. 3 or more • A. Less than 20 • B. 20-50 • C. 51-100 • D. More than 100 For adolescents For sex workers

Response Bias

Acquiescence bias • Automatically saying “yes” or “no”, especially to complicated questions. • Solutions: • Phrase questions in both directions so people can’t just agree with all of them. • Don’t make the questions too complicated.

Extreme and moderate responding • Some people like to choose extreme answers. • Some people like to avoid extreme answers. • Some people like to choose the middle answer.

Social Desirability • Tendency to respond in a way that is socially appealing or acceptable • More likely if it’s clear what the desirable response would be (e.g., honesty vs. dishonesty) • Less likely if responses are anonymous or confidential • Some people are more likely to do this than others • (Can measure this with a social desirability scale)

Malingering • Faking bad • Opposite of social desirability • Happens when the person could benefit from being diagnosed with a disorder • Criminal competency hearings, claims for workers’ compensation, personal injury hearings, etc. • Also happens when kids try to be funny

Careless or random responding • Making patterns on the answer sheet • Giving the same answer to every question • Happens when people are unmotivated to give the correct answer • Survey is too long • Questions are too hard for them to read or understand

How to limit response bias • Manage the testing context • Make it anonymous or confidential • Minimize fatigue, stress, distraction • Tell respondents that it’s possible to evaluate the validity of their responses

Manage the content • Write simple items • Write items that are neutral in terms of social desirability • “I am sometimes less friendly than other people” instead of “I am an unfriendly person” • Forced-choice items (friendly vs. assertive) • Include items that are worded in both directions • Include a social desirability measure or lie scale

Newer approaches to measurement Generalizability Theory Item Response Theory

Creating new measures • What is the best way to measure a latent construct? • (Latent construct = something that can’t be measured directly) Observed variable 1 error Latent construct error Observed variable 2 error Observed variable 3

Classical Test Theory • Total variance in observed scores = • True score variance + error variance • What is in that amorphous error variance? • Peculiarities of specific items, raters, methods, occasion, etc. • Anything that is not true score variance ERROR

Generalizability Theory • Instead of treating the non-true-score variance as a blob of error, try to decompose it. • Facets of measurement that can vary • Different items • Different raters • Different measurement modalities • Different measurement occasions • Etc.

Example: Measuring a professor’s helpfulness • Facet 1: • Self-ratings vs. other-ratings vs. direct observation • Facet 2: • Items (explains concepts clearly, answers questions, approachable, etc.) • Facet 3: • Occasion (first day of class, night before the exam, office hours, etc.)

The main question asked by Generalizability Theory • We can only have a limited number of items. • How well does our limited set of items measure the construct, compared with an infinite set of all possible items? • Example: • To measure helpfulness, we only asked whether the professor explains concepts clearly, answers questions, and is approachable. • Would adding caring, cheerful, or intelligent make the measure better?

In other words…. • Is the variability (across professors) in our short measure consistent with the variability that we would obtain with an infinitely long measure? • How well do our chosen measures represent the infinite universe of possible measures of helpfulness?

Example: 11 targets X 3 items = 33 datapoints

Could use these 33 datapoints in an ANOVA • Total variance= • Variance due to the target + • Variance due to the question + • Error • Could add a facet by adding additional raters: • Student 1, student 2, student 3 • Each student rates each professor on each item • Goal: calculate what proportion of the total variance is due to the target, not the item or rater or error

Signal and noise • Signal • Real differences in helpfulness across professors • Noise • Random measurement error and variability that’s due to facets of measurement (items, raters, etc.) • Generalizability coefficient= • Signal / Signal + Noise • Like a Cronbach’s alpha (.80 is good)

D-study (Decision study) • Try various combinations of numbers of items and raters and see how the generalizability coefficient changes. • Goal: Most efficient design that still gives a good generalizability coefficient

Deciding how many items to use

Two-facet design • Two facets: item and observer • Effects in your ANOVA: • Target • Item • Observer • Target X Item • Target X Observer • Item X Observer • Residual

D-study • Calculate this ANOVA for different combinations of numbers of items and numbers of observers. • 1 observer with 5 items vs. 5 observers with 1 item, etc. • Find the easiest/cheapest combination that gives a decent generalizability coefficient.

Item Response Theory (IRT) Another newer alternative to classical test theory

Same basic underlying concept • Individual’s response to an item is influenced by qualities of the individual and qualities of the item. • IRT focuses on specific qualities of the individual and the item. • Often used to develop measures of intelligence, ability, etc.

Example: SAT/GRE • Designed to assess a wide range of abilities • Some questions discriminate the 800 students from the 780 students. • Some questions discriminate the 250 students from the 300 students.

Easy questions and hard questions • Based on previous pilot testing with questions…. • Easy questions are those that most people answer correctly (e.g., 2+2) • These questions identify the lowest-performing people. • Hard questions are those that only a few people answer correctly (e.g., a really tricky geometry problem) • These questions identify the highest-performing people • Need a mix of question difficulties to make finer discriminations.

Similar principle for psychosocial scales – item “difficulty” • Depression items that are “easy” to say yes to • I felt sad. • I had trouble keeping my mind on what I was doing. • (These items discriminate between no depression and mild depression.) • Depression items that are “hard” to say yes to • I thought my life had been a failure. • I had crying spells • (These items discriminate between moderate depression and severe depression.)

Relationship between item difficulty and individuals’ traits • In IRT, item difficulty and the person’s traits are related. • A person needs to be really depressed to say yes to certain items. • A person only needs to be a little depressed to say yes to other items. • A person needs to be really smart to get certain math questions right. • A person only needs to be a little smart to get other math questions right.

The item’s difficulty is expressed in terms of the trait level • The item’s difficulty is the amount of the trait necessary to have a 50% chance of answering the item correctly. • Usually expressed in standard deviations. • If an item has a difficulty of 1.5, a person with math ability (or depression) 1.5 standard deviations above the mean has a 50% chance of getting it correct (or saying yes to it). • If an item has a difficulty of 0, Average Joe Bloggs™ has a 50% chance of getting it correct.

Item characteristic curve (ICC) Where it’s steepest, this item does the best job of differentiating between individuals. Most of the depressed people say yes Most of the non-depressed people say no

Comparing items Easy item Difficult item

Why is this important? • Want to develop the most valid (but also efficient) measures. • Want to develop measures that differentiate among individuals across a wide range of the construct of interest.

Other Examples of Uni-dimensional Scale Development Method • Thurstone or Equal-Appearing Interval Scaling • Likert or “Summative” scaling • Guttman or “Cumulative” scaling

Thurstone or Equal-Appearing Interval Scaling • method of equal-appearing intervals • Start with a large set of potential statements. • Ask people to rate level of statement being true (e.g. 1-11) for each of statements. • calculate the Q1, median, and Q3 percentils for the level. • Select the items with equal intervals across the median values, and with the smallest inter-quartile range. • The final scale will be the selected statement with “agree”/”Disagree” responses. • method of successive intervals • method of paired comparisons http://www.socialresearchmethods.net/kb/scalthur.php

Likert or “Summative” scaling • Generating the items • Should be items that can be rated on a 1-to-5 or 1-7 disagree/Agree response scale. • Rating the items (e.g. on a 1-5 scale) • Selecting the items • Throw out any items that have a low correlation with the total • Obtain top and bottom sub-groups of subjects by the average score. Compare (t-test) the rating on each item between these two sub-groups, the items with larger t value should be retained because they are better discriminators.

Guttman or “Cumulative” scaling • we would like a set of items or statements so that a respondent who agrees with any specific question in the list will also agree with all previous questions. • Put more formally, we would like to be able to predict item responses perfectly knowing only the total score for the respondent.

Guttman or “Cumulative” scaling • Develop the itms (80-100 items !!!) • Ask the judges to Rate (Yes/No) the statement (not their own feeling, but the relevance of the statement with the focus of the measure) • Sort the matrix (respondent * item) by the frequency of the rate from more agreeable to less agreeable. • Use scalogram analysis to determine the selected items. • The final scale can then be used with Yes/No, Agree/Disagree type of answers.

Measurement and Psychometrics