E N D
1. Research Methods I Measurement
2. What is measurement? – attempt to quantify a hypothetical construct.
Psychometrics – Study of measurement in psychology; sociometry
Operationalism is important
3. Forms of the measurement
Observational measurement
Setting, disguise, how to measure behavior
Physiological measures
Brain activity, hormone levels, etc.
Response patterns
Latencies, choices, etc.
Self-report measures
Questionnaires, interviews, limitations
Archival measures
Often constructs can be measured in multiple ways.
E.g., anger
4. Often it is best to get multiple measures of the construct of interest
Triangulation (converging operations) – observing from several different viewpoints to understand the construct.
Several types of triangulation
Measurement
Observers
Theory
Method
5. What makes a good measure (test)? Reliability – consistency or dependability of the test.
If you use the test multiple times, do you get the same results?
E.g., scale for weight.
Measurement error affects reliability
Observed score = true score + measurement error
6. Factors that contribute to measurement error
Transient states of the participant
E.g., anxiety
Stable attributes of the participant
E.g., low motivation, low intelligence
Situational factors
E.g., treatment of participants, lighting, etc.
Characteristics of the measure
E.g., bad questions
Mistakes
E.g., miscounting
7. Inverse relationship between measurement error and reliability
Estimating reliability
Observed variance in a set of scores is due to two factors, individual differences and measurement error
Total variance = systematic variance + error
Reliability = systematic variance / total variance
Reliability coefficient – if scores are near 1, test is reliable. Closer to 0, the lower the reliability.
Rule of thumb – good if above 0.7
70% of the total variance in scores is systematic variance.
8. Assessing Reliability Use two scores: if they are similar, there is reliability
How similar are the scores?
Correlation coefficient – assesses how similar they are.
Correlation actually measures systematic variance / total variance
Three methods to estimate reliability
Test-retest reliability
Inter-item reliability
Interrater reliability
9. Test-retest reliability – give the test on multiple occasions. Correlate the scores.
Correlation reflects the degree of reliability
Assumes that participants behavior is stable over time
Some behaviors are not
E.g., personality probably is stable
E.g., hunger is not
For some behaviors, it is impossible to measure the reliability
10. Inter-item reliability – degree of consistency among the items on the test.
Most measures use many items and summate.
Methods of calculating
Item-total correlation – correlation between a particular item and the sum of all other items on the scale.
Can be used to assess bad questions.
Get rid of all of the bad questions, reliability increases
11. Split-half reliability – divide items on the test into two parts
Even/odd, First half/second half, random halves
If the two halves do not correlate well, there is measurement error.
Chronbach’s alpha – a measure of split-half reliability
12. Interrater reliability –
aka interjudge or interobserver reliability.
consistency among researchers that observe the behavior.
They both watch the same thing, do they see it the same?
E.g., conditioning
Percentage of time they agree
Correlation between the ratings
Generally need around 0.9. (90%)
13. Increasing Reliability Eliminate potential sources of measurement error.
Clearly conceptualize the construct
Test construction – clarify instructions and questions
Test administration and conditions – standardize for all measures
Test scoring and interpretation –
Train observers
Careful in coding, tabulating, or computing data.
Use pilot studies
14. Validity – the extent to which the measurement procedure measures what you think it is measuring.
Reliability is a necessary but not sufficient condition for validity.
Test can be reliable but not valid.
Test must be reliable to be valid.
E.g., dart board, phrenology
Different types of validity
Face
Content
Construct
Criterion
15. Face validity – on the surface, does the test seem to be measuring what it is supposed to.
If no face validity, many will doubt its relevance
E.g., Target and MMPI and CPI
Evil spirits possess me sometimes
Problems with face validity
1. Not always useful
Can have face validity but not be valid
2. Not necessary
Can be valid without face validity
3. Sometimes disguising purpose is important.
16. Content validity – special type of face validity
Extent to which a test measures the content area it was designed to measure.
Does it assess all of the content and not just a part of it?
E.g., Psychology GRE, Math test
17. Construct validity – does the test measure the construct of interest?
Construct – entities that cannot be directly observed on the basis of empirical evidence
Most variables in the social sciences
E.g., intelligence, media bias, anxiety
Correlate the score on the test with scores on other converging tests.
Two parts to construct validity
18. Convergent validity – measure should correlate with measures that it should correlate with.
E.g., measures of anxiety
Divergent validity – measure should not correlate with measures of a different construct
19. Criterion validity - the degree to which one measurement agrees with other approaches for measuring the same characteristic.
Two parts – depends on time
Concurrent validity – does the test agree with a preexisting measure?
Predictive validity – test’s ability to predict future behavior relevant to the behavior criterion.
E.g., SAT scores and college GPA
20. Comparing apples and oranges How do you compare scores on various measures with each other?
E.g., SAT and ACT
E.g., WAIS vs. Raven’s
Standardization – placing all scores in the same unit of measurement.
Force the measurement to have the same scale.
Z scores – the most common standardization procedure.
Convert the scores into standard deviation units.
21. Z-scores represent distance away from the mean in terms of standard deviation units.
Positive z-scores are above the mean.
Negative z-scores are below the mean.
Value represents distance from the mean.
Ex. z = 3 - 3 standard deviations above the mean.
Ex. Z = -1 - 1 standard deviation below the mean.
Note: mean of all z scores = 0, Standard deviation = 1
22. Calculating Z - scores
Individual score minus the mean, divided by the standard deviation
23. Example Calculate z scores on an IQ test if the mean is 100 and the standard deviation is 15
Brian – 130
Jan – 72
Jim – 100
Converting z-scores to raw scores
Jody – z = 3
Jeremiah – z = 0
Zach – z= - 2.5
24. Comparison between tests
Compare z-scores
Ex. Jerry makes a 1200 on the SAT, Terry makes a 30 on the ACT. Who scored better
SAT – mean=1000, sd = 150
ACT – mean = 21, sd = 3
In 1960, the mean baseball salary was $50,000 with a standard deviation of $10,000. Today, the mean salary is $2,000,000, with a standard deviation of $500,000. In 1960, Clete Boyer, the third baseman for the New York Yankees, made $30,000. What would he earn at today's salaries?
25. Percentiles and z-scores If normal distribution is assumed, the percentile score is known based on the z score.
27. Scaling and index construction Difference between the two
Scale – measure the direction or intensity of a construct
Index – measure that combines several indicators of a construct into a single score.
Ex. FBI crime index
Ex. Consumer price index
Constructing an instrument to assign numbers to a qualitative concept
28. Important characteristics of indexing and scaling
Mutual exclusiveness – Individual cases fit into one category only
Exhaustive – all cases fit into one of the categories formed.
Unidimensionality – items comprising a scale or index must measure one and only one dimension or concept at a time
Ex. Long tests
29. Index construction Exams are examples of indices.
Best places to live
America’s best colleges
The difficult aspect of index construction is the evaluation of the construct
Process is largely theoretical – face validity is often used.
Weighting – certain factors are valued more than others
Combination of factors does not assume that all are equal.
30. US News and World Report
Peer assessment (weighted by 25 percent).
Retention (20 percent).
Faculty resources (20 percent).
Student selectivity (15 percent).
Financial resources (10 percent).
Graduation rate performance (5 percent).
Alumni giving rate (5 percent).
31. Missing data – can be very damaging to reliability and validity
Systematic missing data is problematic
Ways to handle missing data
1. Eliminate that case
2. Substitute with the average of available sources
3. Try to estimate using another source
4. Insert a random value
5. Make not available a possible item
6. Analyze reason for missing data
32. Scaling Scaling is the assignment of objects to numbers according to a rule.
Types of scaling
Likert
Thurstone
Bogardus Social Distance
Semantic Differential
Guttman
33. Likert Scaling Assigns construct a value based on bipolar response set. E.g., level of agreement, approval, etc.
Also known as summated-rating or addivite scales
E.g., Capital punishment should be reinstated.
____Strongly agree _____Agree _____Neutral _____Disagree ____Strongly disagree
Generally want 5 point Likert scales
Can have more – no reason to have above 7. – limits in reliability
Collapsing data
Some prefer even numbers – forced decision
34. Multiple items can be combined into an index.
E.g., sum ten different questions together
Dummy code the responses
Strongly agree = 5
Agree = 4
Neutral = 3
Disagree = 2
Strongly disagree = 1
Very important: data from Likert scaling is ordinal data
Implication: no statistical analysis, no interval data
No single way to code data.
Strongly agree = 100, agree = 50, Neutral = 25, disagree = 10, strongly disagree = 1
35. Summing items
Questions included in the scale must be chosen carefully
Requirements:
Items must be more or less addressing the same concept
Check with item-whole correlation
Items need to have the same level of importance to the respondent
E.g., restaurant opinion
Do not ask the same question repeatedly
Only benefits the statistics
36. When coding, be careful of direction of question.
E.g., 1. I feel like I make a useful contribution at work - sd, d, n, a, sa
2. At times I feel incompetent at my job - sd, d, n, a, sa
Reverse coding
37. Thurstone scaling Also known as Method of Equal-appearing Intervals
Generally used to assess attitudes
First, generate a number of questions about the topic of interest (at least 100).
Ex. Attitude towards RSU
Second, have many judges (around 100) rate the questions on a scale of 1 to 11.
Is the question favorable towards the concept
1 = least favorable to concept
11 = most favorable to concept
Important – judges aren’t answering the question, but evaluating them.
38. Third, analyze the rating data
Calculate median judges responses for each item and the variability
Some questions will very favorable, some very unfavorable
E.g., RSU has small class sizes (11)
E.g., RSU only offers a few bachelor’s degrees (1)
Fourth, choose questions from all 11 median values
Reduce questionnaire to about 20 items.
Fifth, administer the scale
E.g., RSU has small class sizes: agree or disagree
Can combine it with Likert scale
39. Thurstone:
Allows for analysis of inter-item agreement among the judges
Allows for the identification of homogeneous items.
Very time-consuming and costly
No more reliable than a Likert scale
40. Bogardus Social Distance Scale Used to measure social distances between groups of people
E.g., ethnic closeness
Religious closeness
Prejudice
Participants respond to a series of ordered statements. Threatening ? Nonthreatening
41. E.g., Attitude towards homosexuals
Would marry
Would have as regular friends
Would work beside in an office
Would have several families in my neighborhood
Would have merely as speaking acquaintances
Would have live outside my neighborhood
Would have live outside my country
E.g., mental illness
42. Can measure social distance to other variables
E.g., level of education, geographical location, etc.
E.g., Angermeyer and Matschinger (1997) – mental illness perception in Germany
Alcoholics more social distance than schizophrenics
Personal experience with someone with mental illness reduced social distance
43. Semantic Differential Scaling Provides indirect measure about how a person feels about a concept, object or other person
Measuring of feelings by using adjectives
Humans use language to communicate feelings by using adjectives
Adjectives tend to have polar opposites
Hot/cold
Tall/short
Uses adjectives to create a rating scale.
45. Three main uses of adjectives: i.e., three dimensions of attitudes
Evaluation (good-bad, pleasant-unpleasant, kind-cruel)
Potency (strong-weak, thick-thin, hard-soft)
Activity (active-passive, slow-fast, hot-cold)
Multiple uses for semantic differential scale
46. Guttman Scaling Also known as cumulative scaling
Evaluating of data after they are collected
Meant to determine if a relationship exists within a group of items
Items are arranged such that a person who agrees with an item will also agree with less extreme items.
47. Example:
1. Slapping a child’s hand is an appropriate way to teach the meaning of “No!”
2. Spanking is sometimes necessary
3. Sometimes discipline requires using a belt or paddle
4. Some children need a good beating to keep them in line.
Source: Monette et al. (1994)
48. Factor Analysis Statistical method for determining unidimensionality.
Is the test measuring more than one concept?
Analyzes the pattern of responding to each item.
E.g., Factor analysis and intelligence
Tests unidimensionality
Can allow for evaluation of constructs