html5-img
1 / 47

Research Methods I

What is measurement?

sarai
Download Presentation

Research Methods I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Research Methods I Measurement

    2. What is measurement? – attempt to quantify a hypothetical construct. Psychometrics – Study of measurement in psychology; sociometry Operationalism is important

    3. Forms of the measurement Observational measurement Setting, disguise, how to measure behavior Physiological measures Brain activity, hormone levels, etc. Response patterns Latencies, choices, etc. Self-report measures Questionnaires, interviews, limitations Archival measures Often constructs can be measured in multiple ways. E.g., anger

    4. Often it is best to get multiple measures of the construct of interest Triangulation (converging operations) – observing from several different viewpoints to understand the construct. Several types of triangulation Measurement Observers Theory Method

    5. What makes a good measure (test)? Reliability – consistency or dependability of the test. If you use the test multiple times, do you get the same results? E.g., scale for weight. Measurement error affects reliability Observed score = true score + measurement error

    6. Factors that contribute to measurement error Transient states of the participant E.g., anxiety Stable attributes of the participant E.g., low motivation, low intelligence Situational factors E.g., treatment of participants, lighting, etc. Characteristics of the measure E.g., bad questions Mistakes E.g., miscounting

    7. Inverse relationship between measurement error and reliability Estimating reliability Observed variance in a set of scores is due to two factors, individual differences and measurement error Total variance = systematic variance + error Reliability = systematic variance / total variance Reliability coefficient – if scores are near 1, test is reliable. Closer to 0, the lower the reliability. Rule of thumb – good if above 0.7 70% of the total variance in scores is systematic variance.

    8. Assessing Reliability Use two scores: if they are similar, there is reliability How similar are the scores? Correlation coefficient – assesses how similar they are. Correlation actually measures systematic variance / total variance Three methods to estimate reliability Test-retest reliability Inter-item reliability Interrater reliability

    9. Test-retest reliability – give the test on multiple occasions. Correlate the scores. Correlation reflects the degree of reliability Assumes that participants behavior is stable over time Some behaviors are not E.g., personality probably is stable E.g., hunger is not For some behaviors, it is impossible to measure the reliability

    10. Inter-item reliability – degree of consistency among the items on the test. Most measures use many items and summate. Methods of calculating Item-total correlation – correlation between a particular item and the sum of all other items on the scale. Can be used to assess bad questions. Get rid of all of the bad questions, reliability increases

    11. Split-half reliability – divide items on the test into two parts Even/odd, First half/second half, random halves If the two halves do not correlate well, there is measurement error. Chronbach’s alpha – a measure of split-half reliability

    12. Interrater reliability – aka interjudge or interobserver reliability. consistency among researchers that observe the behavior. They both watch the same thing, do they see it the same? E.g., conditioning Percentage of time they agree Correlation between the ratings Generally need around 0.9. (90%)

    13. Increasing Reliability Eliminate potential sources of measurement error. Clearly conceptualize the construct Test construction – clarify instructions and questions Test administration and conditions – standardize for all measures Test scoring and interpretation – Train observers Careful in coding, tabulating, or computing data. Use pilot studies

    14. Validity – the extent to which the measurement procedure measures what you think it is measuring. Reliability is a necessary but not sufficient condition for validity. Test can be reliable but not valid. Test must be reliable to be valid. E.g., dart board, phrenology Different types of validity Face Content Construct Criterion

    15. Face validity – on the surface, does the test seem to be measuring what it is supposed to. If no face validity, many will doubt its relevance E.g., Target and MMPI and CPI Evil spirits possess me sometimes Problems with face validity 1. Not always useful Can have face validity but not be valid 2. Not necessary Can be valid without face validity 3. Sometimes disguising purpose is important.

    16. Content validity – special type of face validity Extent to which a test measures the content area it was designed to measure. Does it assess all of the content and not just a part of it? E.g., Psychology GRE, Math test

    17. Construct validity – does the test measure the construct of interest? Construct – entities that cannot be directly observed on the basis of empirical evidence Most variables in the social sciences E.g., intelligence, media bias, anxiety Correlate the score on the test with scores on other converging tests. Two parts to construct validity

    18. Convergent validity – measure should correlate with measures that it should correlate with. E.g., measures of anxiety Divergent validity – measure should not correlate with measures of a different construct

    19. Criterion validity - the degree to which one measurement agrees with other approaches for measuring the same characteristic. Two parts – depends on time Concurrent validity – does the test agree with a preexisting measure? Predictive validity – test’s ability to predict future behavior relevant to the behavior criterion. E.g., SAT scores and college GPA

    20. Comparing apples and oranges How do you compare scores on various measures with each other? E.g., SAT and ACT E.g., WAIS vs. Raven’s Standardization – placing all scores in the same unit of measurement. Force the measurement to have the same scale. Z scores – the most common standardization procedure. Convert the scores into standard deviation units.

    21. Z-scores represent distance away from the mean in terms of standard deviation units. Positive z-scores are above the mean. Negative z-scores are below the mean. Value represents distance from the mean. Ex. z = 3 - 3 standard deviations above the mean. Ex. Z = -1 - 1 standard deviation below the mean. Note: mean of all z scores = 0, Standard deviation = 1

    22. Calculating Z - scores Individual score minus the mean, divided by the standard deviation

    23. Example Calculate z scores on an IQ test if the mean is 100 and the standard deviation is 15 Brian – 130 Jan – 72 Jim – 100 Converting z-scores to raw scores Jody – z = 3 Jeremiah – z = 0 Zach – z= - 2.5

    24. Comparison between tests Compare z-scores Ex. Jerry makes a 1200 on the SAT, Terry makes a 30 on the ACT. Who scored better SAT – mean=1000, sd = 150 ACT – mean = 21, sd = 3 In 1960, the mean baseball salary was $50,000 with a standard deviation of $10,000. Today, the mean salary is $2,000,000, with a standard deviation of $500,000.  In 1960, Clete Boyer, the third baseman for the New York Yankees, made $30,000.  What would he earn at today's salaries?

    25. Percentiles and z-scores If normal distribution is assumed, the percentile score is known based on the z score.

    27. Scaling and index construction Difference between the two Scale – measure the direction or intensity of a construct Index – measure that combines several indicators of a construct into a single score. Ex. FBI crime index Ex. Consumer price index Constructing an instrument to assign numbers to a qualitative concept

    28. Important characteristics of indexing and scaling Mutual exclusiveness – Individual cases fit into one category only Exhaustive – all cases fit into one of the categories formed. Unidimensionality – items comprising a scale or index must measure one and only one dimension or concept at a time Ex. Long tests

    29. Index construction Exams are examples of indices. Best places to live America’s best colleges The difficult aspect of index construction is the evaluation of the construct Process is largely theoretical – face validity is often used. Weighting – certain factors are valued more than others Combination of factors does not assume that all are equal.

    30. US News and World Report Peer assessment (weighted by 25 percent). Retention (20 percent). Faculty resources (20 percent). Student selectivity (15 percent). Financial resources (10 percent). Graduation rate performance (5 percent). Alumni giving rate (5 percent).

    31. Missing data – can be very damaging to reliability and validity Systematic missing data is problematic Ways to handle missing data 1. Eliminate that case 2. Substitute with the average of available sources 3. Try to estimate using another source 4. Insert a random value 5. Make not available a possible item 6. Analyze reason for missing data

    32. Scaling Scaling is the assignment of objects to numbers according to a rule. Types of scaling Likert Thurstone Bogardus Social Distance Semantic Differential Guttman

    33. Likert Scaling Assigns construct a value based on bipolar response set. E.g., level of agreement, approval, etc. Also known as summated-rating or addivite scales E.g., Capital punishment should be reinstated. ____Strongly agree _____Agree _____Neutral _____Disagree ____Strongly disagree Generally want 5 point Likert scales Can have more – no reason to have above 7. – limits in reliability Collapsing data Some prefer even numbers – forced decision

    34. Multiple items can be combined into an index. E.g., sum ten different questions together Dummy code the responses Strongly agree = 5 Agree = 4 Neutral = 3 Disagree = 2 Strongly disagree = 1 Very important: data from Likert scaling is ordinal data Implication: no statistical analysis, no interval data No single way to code data. Strongly agree = 100, agree = 50, Neutral = 25, disagree = 10, strongly disagree = 1

    35. Summing items Questions included in the scale must be chosen carefully Requirements: Items must be more or less addressing the same concept Check with item-whole correlation Items need to have the same level of importance to the respondent E.g., restaurant opinion Do not ask the same question repeatedly Only benefits the statistics

    36. When coding, be careful of direction of question. E.g., 1. I feel like I make a useful contribution at work - sd, d, n, a, sa 2. At times I feel incompetent at my job - sd, d, n, a, sa Reverse coding

    37. Thurstone scaling Also known as Method of Equal-appearing Intervals Generally used to assess attitudes First, generate a number of questions about the topic of interest (at least 100). Ex. Attitude towards RSU Second, have many judges (around 100) rate the questions on a scale of 1 to 11. Is the question favorable towards the concept 1 = least favorable to concept 11 = most favorable to concept Important – judges aren’t answering the question, but evaluating them.

    38. Third, analyze the rating data Calculate median judges responses for each item and the variability Some questions will very favorable, some very unfavorable E.g., RSU has small class sizes (11) E.g., RSU only offers a few bachelor’s degrees (1) Fourth, choose questions from all 11 median values Reduce questionnaire to about 20 items. Fifth, administer the scale E.g., RSU has small class sizes: agree or disagree Can combine it with Likert scale

    39. Thurstone: Allows for analysis of inter-item agreement among the judges Allows for the identification of homogeneous items. Very time-consuming and costly No more reliable than a Likert scale

    40. Bogardus Social Distance Scale Used to measure social distances between groups of people E.g., ethnic closeness Religious closeness Prejudice Participants respond to a series of ordered statements. Threatening ? Nonthreatening

    41. E.g., Attitude towards homosexuals Would marry Would have as regular friends Would work beside in an office Would have several families in my neighborhood Would have merely as speaking acquaintances Would have live outside my neighborhood Would have live outside my country E.g., mental illness

    42. Can measure social distance to other variables E.g., level of education, geographical location, etc. E.g., Angermeyer and Matschinger (1997) – mental illness perception in Germany Alcoholics more social distance than schizophrenics Personal experience with someone with mental illness reduced social distance

    43. Semantic Differential Scaling Provides indirect measure about how a person feels about a concept, object or other person Measuring of feelings by using adjectives Humans use language to communicate feelings by using adjectives Adjectives tend to have polar opposites Hot/cold Tall/short Uses adjectives to create a rating scale.

    45. Three main uses of adjectives: i.e., three dimensions of attitudes Evaluation (good-bad, pleasant-unpleasant, kind-cruel) Potency (strong-weak, thick-thin, hard-soft) Activity (active-passive, slow-fast, hot-cold) Multiple uses for semantic differential scale

    46. Guttman Scaling Also known as cumulative scaling Evaluating of data after they are collected Meant to determine if a relationship exists within a group of items Items are arranged such that a person who agrees with an item will also agree with less extreme items.

    47. Example: 1. Slapping a child’s hand is an appropriate way to teach the meaning of “No!” 2. Spanking is sometimes necessary 3. Sometimes discipline requires using a belt or paddle 4. Some children need a good beating to keep them in line. Source: Monette et al. (1994)

    48. Factor Analysis Statistical method for determining unidimensionality. Is the test measuring more than one concept? Analyzes the pattern of responding to each item. E.g., Factor analysis and intelligence Tests unidimensionality Can allow for evaluation of constructs

More Related