Packing and Unpacking Sources of Validity Evidence: History Repeats Itself Again

Packing and Unpacking Sources of Validity Evidence: History Repeats Itself Again Stephen G. Sireci University of Massachusetts Amherst Presentation for the conference “The Concept of Validity: Revisions, New Directions, & Applications” October 9, 2008 University of Maryland, College Park

Validity • A concept that has evolved and is still evolving • The most important consideration in educational and psychological testing • Simple, but complex • Can be misunderstood • Disagreements regarding what it is, and what is important

Purposes of this presentation • Provide some historical context on the concept of validity in testing • Present current, consensus definitions of validity • Describe the validation framework implied in the Standards • Discuss limitations of current framework • Suggest new directions for validity research and practice

Packing and unpacking: A prelude

Validity defined • What is validity? • How have psychometricians come to define it?

What does valid mean? • Truth? • According to Websters, Valid: 1. having legal force; properly executed and binding under the law. 2. sound; well grounded on principles or evidence; able to withstand criticism or rejection. 3. effective, effectual, cogent 4. robust, strong, healthy (rare)

What is validity? • According to Websters: Validity: 1. the state or quality of being valid; specifically, (a) strength or force from being supported by fact; justness; soundness; (b) legal strength or force. 2. strength or power in general 3. value (rare)

How have psychometricians defined validity? • Some History

In the beginning

In the beginning Modern measurement started at the turn of the 20th century • 1905: Binet-Simon scale • 30-item scale designed to ensure that no child could be denied instruction in the Paris school system without formal examination • Binet died in 1911 at age 54

Note • College Board was established in 1900 • Began essay testing in 1901

What else was happening around the turn of the century? • 1896: Karl Pearson, Galton Professor of Eugenics at University College, published the formula for the correlation coefficient

Given the predictive purpose of Binet’s test, • interest in heredity and individual differences, • and a new statistical formula relating variables to one another • validity was initially defined in terms of correlation

Earliest definitions of Validity • “Valid scale” (Thorndike, 1913) • A test is valid for anything with which it correlates • Kelley, 1927; Thurstone, 1932; • Bingham, 1937; Guilford (1946); others • Validity coefficients • correlations of test scores with grades, supervisor ratings, etc.

Validation started with group tests • 1917: Army Alpha and Army Beta • (Yerkes) • Classification of 1.5 million recruits • Borrowed items and ideas from Otis Tests • Otis was one of Terman’s graduate students

Military Testing • Tests were added or subtracted to batteries based solely on correlational evidence (e.g, increase in R2). • How well does test predict pass/fail criterion several weeks later? • Jenkins (1946) and others emerged in response to problems with notion that validity=correlation • See also Pressey (1920)

Problems with notion that validity = correlation • Finding criterion data • Establishing reliability of criterion • Establishing validity of criterion • If valid, measurable, criteria exist, why do we need the test?

What did critics of correlational evidence of validity suggest for validating tests? • Professional judgment “...it is proper for the test developer to use his individual judgment in this matter though he should hardly accept it as being on par with, or as worthy of credence as, experimentally established facts showing validity.” • (Kelley, 1927, pp. 30-31)

What did critics of correlational evidence of validity suggest for validating tests? • Appraisal of test content with respect to the purpose of testing (Rulon, 1946) • rational relationship Sound familiar? • Early notions of content validity • (Kelley, Mosier, Rulon, Thorndike, others) • but notice Kelley’s hesitation in endorsing this evidence, or going against the popular notion

Other precursors to content validity Guilford (1946): validity by inspection? Gulliksen (1950): “Intrinsic Validity” • pre/post instruction test score change • consensus of expert judgment regarding test content • examine relationship of test to other tests measuring same objectives Herring (1918): 6 experts evaluated the “fitness of items”

Development of Validity Theory • By the 1950s, there was consensus that correlational evidence was not enough • and that judgmental data of the adequacy of test content should be gathered • Growing idea of multiple lines of “validity evidence”

Emergence of Professional Standards • Cureton (1951): First “Validity” chapter in first edition of “Educational Measurement” (edited by Lindquist). • Two aspects of validity • Relevance (what we would call criterion-related) • Reliability

Cureton (1951) • Validity defined as “the correlation between actual test scores and true criterion scores” • but: “curricular relevance or content validity” may be appropriate in some situations.

Emergence of Professional Standards • 1952: APA Committee on Test Standards • Technical Recommendations for Psychological Tests and Diagnostic Techniques: A Preliminary Proposal • Four “categories of validity” • predictive, status, content, congruent

Emergence of Professional Standards • 1954: APA, AERA, & NCMUE produced • Technical Recommendations for Psychological Tests and Diagnostic Techniques • Four “types” or “attributes” of validity: • construct validity (instead of congruent) • concurrent validity (instead of status) • predictive • content

1954 Standards • Chair was Cronbach and guess who else was on the Committee? • Hint: A philosopher • Promoted idea of: • different types of validity • multiple types of evidence preferred • some types preferable in some situations

Subsequent Developments • 1955: Cronbach and Meehl • Formally defined and elaborated the concept of construct validity. • Introduced term “criterion-related validity” • 1956: Lennon • Formally defined and elaborated the concept of content validity.

Subsequent Developments • Loevinger (1957): big promoter of construct validity idea. • Ebel (1961…): big antagonist of unified validity theory • Preferred “meaningfulness”

Evolution of Professional Standards • 1966: AERA, APA, NCME Standards for Educational and Psychological Tests and Manuals • Three “aspects” of validity: • Criterion-related (concurrent + predictive) • Construct • Content

1966: Standards • Introduced notion that test users are also responsible for test validity • Specific testing purposes called for specific types of validity evidence. • Three “aims of testing” • present performance • future performance • standing on trait of interest • Important developments in content validation

Evolution of Professional Standards • 1974: AERA, APA, NCME Standards for Educational and Psychological Tests • Validity descriptions borrowed heavily from Cronbach (1971) • Validity chapter in 2nd edition of “Educational Measurement” (edited by R.L. Thorndike)

1974: Standards • Defined content validity in operational, rather than theoretical, terms. • Beginning of notion that construct validity is much cooler than content or criterion-related. • Early consensus of “unitary” conceptualization of validity

Evolution of Professional Standards • 1985: AERA, APA, NCME Standards for Educational and Psychological Testing • note “ing” • Described validity as unitary concept • Notion of validating score-based inferences • Very Messick-influenced

1985 Standards • More responsibility on test users • More standards on applications and equity issues • Separate chapters for • Validity • Reliability • Test development • Scaling, norming, equating • Technical manuals

1985 Standards • New chapters on specific testing situations • Clinical • Educational • Counseling • Employment • Licensure & Certification • Program Evaluation • Linguistic Minorities • “People who have handicapping conditions”

1985 Standards • New chapters on • Administration, scoring, reporting • Protecting the rights of test takers • General principles of test use • Listed standards as • primary, • secondary, or • conditional.

1999 Standards • New “Fairness in Testing” section • No more “primary,” “secondary,” “conditional.” • 3-part organizational structure • Test construction, evaluation, & documentation • Fairness in testing • Testing applications

1999 Standards (2) • Incorporated the “argument-based approach to validity” Five “Sources of Validity Evidence” • Test content • Response processes • Internal structure • Relations to other variables • Testing consequences We’ll return to these sources later.

Comparing the Standards: Packing & Unpacking Validity Evidence

What are the current and influential definitions of validity? • Cronbach: Influential, but not current (1971…) • Messick (1989…) • Shepard (1993) • Standards (1999) • Kane (1992, 2006)

Messick (1989): 1st sentence “Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores and other modes of assessment.” (p. 13)

This “integrated” judgment led Messick, and others, to conclude All validity is construct validity. It outside my purpose today to debate the unitary conceptualization of validity, but like all theories, it has strengths and limitations. But two quick points…

Unitary conceptualization of validity • Focuses on inferences derived from test scores • Assumes measurement of a construct motivates test development and purpose • The focus on analysis of scores may undermine attention to content validity • Removal of term “content validity” may have had negative effect on validation practices.

Consider Ebel (1956) “The degree of construct validity of a test is the extent to which a system of hypothetical relationships can be verified on the basis of measures of the construct…but this system of relationships always involves measures of observed behaviors which must be defended on the basis of their content validity” (p. 274).

Consider Ebel (1956) “Statistical validation is not ann alternative to subjective evaluation, but an extension of it. All statistical procedures for validating tests are based ultimately upon common sense agreement concerning what is being measured by a particular measurement process” (p. 274).

The 1999 Standards accepted the unitary conceptualization, but also took a practical stance. • The practical stance stems from the use of an argument-based approach to validity. • Cronbach (1971, 1988) • Kane (1992, 2006)

The Standards (1999) succinctly defined validity “Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests.” (p. 9)

Why do I say the Standards incorporated the argument-based approach to validation? “Validation can be viewed as developing a scientifically sound validity argument to support the intended interpretation of test scores and their relevance to the proposed use.” (AERA et al., 1999, p. 9)

Kane (1992) “it is not possible to verify the interpretive argument in any absolute sense. The best that can be done is to show that the interpretive argument is highly plausible, given all available evidence” (p. 527).

Kane: Argument-based approach • Decide on the statements and decisions to be based on the test scores. • Specify inferences/assumptions leading from test scores to statements and decisions. • Identify competing interpretations. • Seek evidence supporting inferences and assumptions and refuting counterarguments.

Packing and Unpacking Sources of Validity Evidence: History Repeats Itself Again