Create Presentation
Download Presentation

Download Presentation
## Measurement and Psychometrics

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Measurement and Psychometrics**PM 515 Jenifer Unger Ping Sun**Why is measurement important?**• How do we assign a value to a latent construct? • How do we assign units of measurement to something that’s psychosocial? • How can we make sure we’re measuring what we think we’re measuring? • What distinguishes a good measure from a bad measure?**Important things to consider**• Scaling • Assigning numbers or names to responses • Response bias • How do people’s responses correspond to their “true” states/traits? • Validity • How can you build a valid scale from an infinite pool of possible items? • Reliability • How precise the scale is to measure the construct. • The repeatability between two (samples, times, raters, sub-forms, etc. )**Scaling**• How numerical values are assigned to psychological attributes • Uni-dimensional vs. multidimensional**Properties of scaling**• Property of identity • If person A and person B both have depression scores of 20, they should be equally depressed. • Property of order • If person A has a depression score of 20 and person B has a depression score of 21, person B is more depressed than person A. • Property of quantity • There is some basic unit of depression, represented by 1. (Often a questionable assumption, or it’s arbitrary, like IQ points.)**Zero**• If someone gets a 0 on a depression scale, do they have 0 depressive feelings? • If someone gets a 0 on a spelling test, do they have 0 ability to spell? • Zeros are usually arbitrary (like 0 degrees F) rather than absolute (like Absolute 0—the coldest possible temperature).**Types of scales**• Nominal • Describes groups • Can label the groups with numbers, but the numbers don’t indicate an amount • Example: • 1=African American • 2=Asian • 3=Hispanic • 4=White**How to handle a nominal variable**• Always treat it as a class variable, not continuous. • Or dummy code • If ethnic=1 then do; as=1; af=0; hi=0; wh=0; end; • If ethnic=2 then do; as=0; af=1; hi=0; wh=0; end; • If ethnic=3 then do; as=0; af=0; hi=1; wh=0; end; • (who is the reference group here?)**Ordinal**• Describes a ranking (highest-lowest or lowest-highest) • Example: • Least depressed person=1 • Most depressed person=100 • But this doesn’t mean that person 100 is 100 times more depressed than person 1.**Interval**• There is a unit of measurement, with a constant distance between the units • Zero point is arbitrary • Example: • Temperature scales – the degree is a constant**Ratio**• Like an interval scale, but there is an absolute zero. • Example: • Distance (0 really means no distance)**Be careful about how you describe scales**• A score of 0 usually doesn’t mean a total lack of the characteristic. • Be careful about saying one person is twice as depressed, intelligent, etc., as another person.**Make sure the response options are appropriate for the**sample! • How many sexual partners have you had? • A. 0 • B. 1 • C. 2 • D. 3 or more • A. Less than 20 • B. 20-50 • C. 51-100 • D. More than 100 For adolescents For sex workers**Acquiescence bias**• Automatically saying “yes” or “no”, especially to complicated questions. • Solutions: • Phrase questions in both directions so people can’t just agree with all of them. • Don’t make the questions too complicated.**Extreme and moderate responding**• Some people like to choose extreme answers. • Some people like to avoid extreme answers. • Some people like to choose the middle answer.**Social Desirability**• Tendency to respond in a way that is socially appealing or acceptable • More likely if it’s clear what the desirable response would be (e.g., honesty vs. dishonesty) • Less likely if responses are anonymous or confidential • Some people are more likely to do this than others • (Can measure this with a social desirability scale)**Malingering**• Faking bad • Opposite of social desirability • Happens when the person could benefit from being diagnosed with a disorder • Criminal competency hearings, claims for workers’ compensation, personal injury hearings, etc. • Also happens when kids try to be funny**Careless or random responding**• Making patterns on the answer sheet • Giving the same answer to every question • Happens when people are unmotivated to give the correct answer • Survey is too long • Questions are too hard for them to read or understand**How to limit response bias**• Manage the testing context • Make it anonymous or confidential • Minimize fatigue, stress, distraction • Tell respondents that it’s possible to evaluate the validity of their responses**Manage the content**• Write simple items • Write items that are neutral in terms of social desirability • “I am sometimes less friendly than other people” instead of “I am an unfriendly person” • Forced-choice items (friendly vs. assertive) • Include items that are worded in both directions • Include a social desirability measure or lie scale**Newer approaches to measurement**Generalizability Theory Item Response Theory**Creating new measures**• What is the best way to measure a latent construct? • (Latent construct = something that can’t be measured directly) Observed variable 1 error Latent construct error Observed variable 2 error Observed variable 3**Classical Test Theory**• Total variance in observed scores = • True score variance + error variance • What is in that amorphous error variance? • Peculiarities of specific items, raters, methods, occasion, etc. • Anything that is not true score variance ERROR**Generalizability Theory**• Instead of treating the non-true-score variance as a blob of error, try to decompose it. • Facets of measurement that can vary • Different items • Different raters • Different measurement modalities • Different measurement occasions • Etc.**Example: Measuring a professor’s helpfulness**• Facet 1: • Self-ratings vs. other-ratings vs. direct observation • Facet 2: • Items (explains concepts clearly, answers questions, approachable, etc.) • Facet 3: • Occasion (first day of class, night before the exam, office hours, etc.)**The main question asked by Generalizability Theory**• We can only have a limited number of items. • How well does our limited set of items measure the construct, compared with an infinite set of all possible items? • Example: • To measure helpfulness, we only asked whether the professor explains concepts clearly, answers questions, and is approachable. • Would adding caring, cheerful, or intelligent make the measure better?**In other words….**• Is the variability (across professors) in our short measure consistent with the variability that we would obtain with an infinitely long measure? • How well do our chosen measures represent the infinite universe of possible measures of helpfulness?**Could use these 33 datapoints in an ANOVA**• Total variance= • Variance due to the target + • Variance due to the question + • Error • Could add a facet by adding additional raters: • Student 1, student 2, student 3 • Each student rates each professor on each item • Goal: calculate what proportion of the total variance is due to the target, not the item or rater or error**Signal and noise**• Signal • Real differences in helpfulness across professors • Noise • Random measurement error and variability that’s due to facets of measurement (items, raters, etc.) • Generalizability coefficient= • Signal / Signal + Noise • Like a Cronbach’s alpha (.80 is good)**D-study (Decision study)**• Try various combinations of numbers of items and raters and see how the generalizability coefficient changes. • Goal: Most efficient design that still gives a good generalizability coefficient**Two-facet design**• Two facets: item and observer • Effects in your ANOVA: • Target • Item • Observer • Target X Item • Target X Observer • Item X Observer • Residual**D-study**• Calculate this ANOVA for different combinations of numbers of items and numbers of observers. • 1 observer with 5 items vs. 5 observers with 1 item, etc. • Find the easiest/cheapest combination that gives a decent generalizability coefficient.**Item Response Theory (IRT)**Another newer alternative to classical test theory**Same basic underlying concept**• Individual’s response to an item is influenced by qualities of the individual and qualities of the item. • IRT focuses on specific qualities of the individual and the item. • Often used to develop measures of intelligence, ability, etc.**Example: SAT/GRE**• Designed to assess a wide range of abilities • Some questions discriminate the 800 students from the 780 students. • Some questions discriminate the 250 students from the 300 students.**Easy questions and hard questions**• Based on previous pilot testing with questions…. • Easy questions are those that most people answer correctly (e.g., 2+2) • These questions identify the lowest-performing people. • Hard questions are those that only a few people answer correctly (e.g., a really tricky geometry problem) • These questions identify the highest-performing people • Need a mix of question difficulties to make finer discriminations.**Similar principle for psychosocial scales – item**“difficulty” • Depression items that are “easy” to say yes to • I felt sad. • I had trouble keeping my mind on what I was doing. • (These items discriminate between no depression and mild depression.) • Depression items that are “hard” to say yes to • I thought my life had been a failure. • I had crying spells • (These items discriminate between moderate depression and severe depression.)**Relationship between item difficulty and individuals’**traits • In IRT, item difficulty and the person’s traits are related. • A person needs to be really depressed to say yes to certain items. • A person only needs to be a little depressed to say yes to other items. • A person needs to be really smart to get certain math questions right. • A person only needs to be a little smart to get other math questions right.**The item’s difficulty is expressed in terms of the trait**level • The item’s difficulty is the amount of the trait necessary to have a 50% chance of answering the item correctly. • Usually expressed in standard deviations. • If an item has a difficulty of 1.5, a person with math ability (or depression) 1.5 standard deviations above the mean has a 50% chance of getting it correct (or saying yes to it). • If an item has a difficulty of 0, Average Joe Bloggs™ has a 50% chance of getting it correct.**Item characteristic curve (ICC)**Where it’s steepest, this item does the best job of differentiating between individuals. Most of the depressed people say yes Most of the non-depressed people say no**Comparing items**Easy item Difficult item**Why is this important?**• Want to develop the most valid (but also efficient) measures. • Want to develop measures that differentiate among individuals across a wide range of the construct of interest.**Other Examples of Uni-dimensional Scale Development Method**• Thurstone or Equal-Appearing Interval Scaling • Likert or “Summative” scaling • Guttman or “Cumulative” scaling**Thurstone or Equal-Appearing Interval Scaling**• method of equal-appearing intervals • Start with a large set of potential statements. • Ask people to rate level of statement being true (e.g. 1-11) for each of statements. • calculate the Q1, median, and Q3 percentils for the level. • Select the items with equal intervals across the median values, and with the smallest inter-quartile range. • The final scale will be the selected statement with “agree”/”Disagree” responses. • method of successive intervals • method of paired comparisons http://www.socialresearchmethods.net/kb/scalthur.php**Likert or “Summative” scaling**• Generating the items • Should be items that can be rated on a 1-to-5 or 1-7 disagree/Agree response scale. • Rating the items (e.g. on a 1-5 scale) • Selecting the items • Throw out any items that have a low correlation with the total • Obtain top and bottom sub-groups of subjects by the average score. Compare (t-test) the rating on each item between these two sub-groups, the items with larger t value should be retained because they are better discriminators.**Guttman or “Cumulative” scaling**• we would like a set of items or statements so that a respondent who agrees with any specific question in the list will also agree with all previous questions. • Put more formally, we would like to be able to predict item responses perfectly knowing only the total score for the respondent.**Guttman or “Cumulative” scaling**• Develop the itms (80-100 items !!!) • Ask the judges to Rate (Yes/No) the statement (not their own feeling, but the relevance of the statement with the focus of the measure) • Sort the matrix (respondent * item) by the frequency of the rate from more agreeable to less agreeable. • Use scalogram analysis to determine the selected items. • The final scale can then be used with Yes/No, Agree/Disagree type of answers.