Understanding Validity in Assessments: Ensuring Accuracy and Purposefulness

Validity Does the test cover what we are told (or believe) it covers? To what extent? Is the assessment being used for an appropriate purpose?

Validity Topics: • Definition (usual and refined) • Categories of validity evidence • A. face validity • B. content validity : table of specifications, alignment analysis, opportunity to learn • C. criterion-related validity • D. construct validity • E. consequential validity • Test fairness

Introduction Without good validity, all else is lost. Validity is the most important characteristic of a test or assessment technique. • Usual Definition: • It measures what it purports to measure. • Refined Definition: • It involves the interpretation of a score for a particular purpose or use (because, a score may be valid for one use but not another) • It is a matter of degree, not all-or-none. As a practical matter, our concern is to determine the extent (for example in non-mathematical terms we might say: slight, moderate, considerable)

Some Helpful Terms • Construct: • The trait or characteristic that interests us. We might call it a “target” or “what we want to get at”. We create a test to “cover” this attribute. • Validity addresses how well an assessment technique provides useful information about the construct / target. • Construct underrepresentation: • The test we made is not assessing all of the construct; our test misses things we should be assessing. • Construct irrelevant variance: • The test we made is assessing things that are not really part of our construct; we are assessing irrelevant stuff that we don’t want. [see next two slides for illustrations]

The Construct and Valid Measurement

Varying Degrees ofConstruct Underrepresentation andConstruct Irrelevant Variance

A. Face ValidityThink of the idiom “on the face of it . . .” • A test is said to have face validity if it "looks like" it is going to measure what it is supposed to measure • Face validity is not empirical; one is saying that the test “appears it will work,” as opposed to saying “it has been shown to work.” • Face validity is often “created” to influence the opinions of participants who are not expert in testing methodologies, e.g. test takers, parents, politicians.

B. Content ValidityMost used in achievement tests and employment exams • Meaning of this type of validity • there is a good match between the content of the test and some well-defined domain of knowledge or behavior. Reference to content defines the orientation of the test. • For teachers, considered most important type of validity for • your own classroom tests • achievement tests • Where do we find the “well-defined domain” • Examination of textbooks in the field with special attention to the learning objectives at beginning of chapter and terms at the end. • Curriculum guides of school districts • Ohio’s Academic Content Standards So, we now we have the content topics identified, but what should we actually expect “students to know and be able to do” in relation to these topics? This question deals with “process” or “depth” indicators. How should we make sure we include both the content and the depth expected in our tests?

The Table of SpecificationsBuilding content validity into classroom tests • Table of Specifications – this connects the content determined earlier to the mental processes students are expected to employ regarding this content • Two way table • Content • Bloom’s taxonomy (simplest mental operation to the most complex) • Each test item I create then falls into one cell • By creating the table, I can see the relative weight assign to each cell. Is this what I want?

Alignment AnalysisChecking content validity in existing tests • These steps are parallel to building your own good test and the table of specifications construction. There are some things to watch for and consider as you do this: • Be wary of using the summary outline provided by the test maker; examine the actual test items • Match items on test with content you are teaching; watch for mismatches • Items on the test you are not teaching • Content you are teaching that is not tested • This matching requires considerable judgment • The test does not have to cover every detail; it could be a representative sample • If stakes are high, use a panel of individuals

Opportunity to LearnBut was it taught . . . An emerging idea related to content validity is a concern called instructional validity. This relates to your behavior as teacher. The content may be in the book; the content may be in the state standards . . . BUT . . . did you actually teach it? Some teachers skip items of instruction they don’t like, don’t understand or don’t have time for. If related items appear on a test, this would reduce the validity of the test since the students had no opportunity to learn the knowledge or skill being assessed.

C. Criterion-Related ValidityWhile the term “test” is used, also think “measure” or “procedure” • The basic idea – to demonstrate the degree of accuracy of a test by comparing it with another “test, measure or procedure which has been demonstrated to be valid” (i.e. a valued criterion). • Two general contexts • predictive validity - one measure is now one is later. The later test is known to be valid. This approach allows me to show my current test is valid by comparing it to a future valid test. • For example, a behind-the-wheel driving test has been shown to be an accurate test of driving skills. By comparing the scores on a written rules-of-the-road test with the scores from the driving test, the written test can be validated by using a criterion related strategy. • concurrent validity – both measures are current. This approach allows me to show my test is valid by comparing it with an already valid test. I can do this if I can show my test varies directly with a measure of the same construct or indirectly with a measure of an opposite construct. • The computed statistic in both cases is “r” (which we now call a validity coefficient) and it has all the characteristics we have already discussed about correlations coefficients in general.

Special Considerations for Interpreting Criterion-Related Validity • Group Variability • Greater the variability, the greater the “r”. • Reliability-Validity Relationship • Reliability limits validity; reliability is a prerequisite to validity • Validity of the Criterion • How good is the criterion? Do you agree with the operational definition of the critierion?

D. Construct Validity • When we ask about a test’s construct validity, we are taking a broad view of the test. Does the test adequately measure the underlying, unobserved construct? The question is asked both in terms of • convergent validity, are test scores related to behaviors and tests that it should be related to and • divergent validity, are test scores unrelated to behaviors and tests that it should be unrelated to? • There is no single measure of construct validity. Construct validity is based on the accumulation of knowledge about the test and its relationship to other tests and behaviors. • To establish construct validity, we demonstrate that the measure changes in a logical way when other conditions change.

E. Consequential ValidityRecent controversial entry into assessment lexicon . . . • Some professionals feel that, in the real world, the consequences that follow from the use of assessments are important indications of validity. • Some professionals feel that these consequences are matters of politics and policymaking; important considerations, yes, but not matters of validity. • On which side are we? As educators, we sometimes see the consequences as more important than the technical validity of the test. Judgments based on assessments we give and use have value implications and social consequences. • What is the intended use of these test scores? • How are the scores really being used? • Does this testing lead to educational benefits? • Are there negative spin-offs?

Test Fairness, Test Bias • Test fairness / test bias have the same meaning with opposite connotations • Fairness – an assessment or test measures a trait, construct, or target with equal validity for different groups. • Bias – the groups do not differ in terms of real status on the trait, construct, or target being assessed; yet, the test suggests they do.

Methods of Reviewing Fairness • Test Companies : (look in test manual to see what a particular company did about test fairness issues on this test) • Panel review - most “popular” but is this just face validity? • Differential item functioning (DIF) - subsets • Criterion-related validity – whole test • Teacher –Created Assessments : (teachers need to be knowledgeable about, and sensitive to, issues of test fairness) • Is there anything about my test that will unfairly advantage or disadvantage a student or group of students? • Is there anything about the mechanics of the test that calls for skills other than those I intend to measure?

Practical Advice • For building your own tests, think content validity. • For judging externally prepared achievement test, start with a clear definition of what’s to be covered. • For criterion-related validity, take into account group variability; and think about validity of the criterion. • For test fairness (bias), distinguish between differences in groups’ average scores and group status on the trait. • For your own assessments, try to eliminate the influence of any factors not related to what you want to measure.

Understanding Validity in Assessments: Ensuring Accuracy and Purposefulness