Test Validity S-005
Validity of measurement • Reliability refers to consistency • Are we getting something stable over time? • Internally consistent? • Validity refers to accuracy • Is the measure accurate? • Are we really measuring what we want?
Important distinction!The term “validity” is used in two different ways • Validity of an assessment or method of collecting data • The validity of a test or questionnaire or interview • Validity of a research study • Was the entire study of high quality • Did it have high internal and external validity
Important distinction!The term “validity” is used in two different ways • Referring to entire studies or research reports: • OK: “We examined the internal validity of the study.” • OK: “We looked for the threats to validity.” • OK: “That study involved randomly assigning students to groups, so it had strong internal validity, but it was carried out in a special school, so it is weak on external validity.” • Referring to a test or questionnaire or some assessment: • OK: “The test is a widely used and well-validated measure of student achievement.” • OK: “The checklist they used seemed reasonable, but they did not present any information on its reliability or validity.” • NOT: “The test lacked internal validity.” (This sounds very strange to me.)
Types of validity • Validity – the extent to which the instrument (test, questionnaire, etc.) is measuring what it intends to measure • Examples: • Math test • is it covering the right content and concepts? • is it also influenced by reading level or background knowledge? • Attitude assessment • are the questions appropriate? • does it assess different dimensions of attitudes (intensity, direction, etc.) • Validity is also assessed in a particular context • A test may be valid in some contexts and not in others • A questionnaire may be useful with some populations and not so useful with other groups • Not: “The test has high validity.” • OK: “The test has been useful in assessing early reading skills among native speakers of English.”
Types of validity • Content validity • The extent to which the items reflect a specific domain of content • Is the sample of items really representative? • Often a matter of judgment • Experts may be asked to rate the relevance and appropriateness of the items or questions • e.g., rate each item: very important / nice to know / not important • “Face validity” refers to whether the items appear to be valid (to the test taker or test user)
Types of validity Criterion-related validity • Concurrent validity • agreement with a separate measure • common in educational assessments • e.g., Bayley Scales and S-B IQ test • Complete version and screening test version • Issue: Is there really a strong existing measure, a “gold standard” we can use for validating a new measure? • Predictive validity • agreement with some future measure • SAT scores and college GPA • GRE scores and graduate school performance
Types of validity (cont.) Construct validity • Does the measure appear to produce results that are consistent with our theories about the construct? • Example: We have a “stage-model” of development, so does out measure produce scores/results that look like “stages”? • Convergent validity • Does out measure converge or agree with other measures that should be similar? And . . . • Discriminant validity • Does our measure disagree (or diverge) where it should be different?
McCarthy Screening test example • A test for pre-school children (2.5 – 8.5) • Six subtests: • Verbal, perceptual-performance, quantitative, general cognitive (composite), memory, motor • Reliability evidence for using a short version as a screening test • Split-half correlations for several scales (r = .60 to .80) • Test-retest reliability for other scales (on a subset of children) showed a range of correlations, from .32 to .70.
McCarthy Scales of Children’s Abilities • Reliability • The internal consistency coefficients for the General Cognitive Index (GCI) averaged .93 across 10 age groups between 2.5, and 8.5 years. • Test-retest reliability of GCI over a one month interval was .80. Stability coefficients of the cognitive scales ranged from .62 to .76 with the Motor Scale emerging as the only scale that lacked stability (r=.33).
A short version developed as a screening test Validity information for a short version • A sample of 60 children with learning disabilities • On full version of entire test • 53 out of 60 (88%) failed at least 2 of the 6 subtests • On the short version (the proposed screening version) • 40 out of 60 (67%) failed (and would be identified) • Is this enough information?