Chapter 7 Evaluating What a Test Really Measures

Chapter 7Evaluating What a Test Really Measures

Validity • APA – Standards for Educational and Psychological Testing (1985) – Recognized three ways of deciding whether a test is sufficiently valid to be useful.

Validity: Does the test measure what it claims to measure?The appropriateness with which inferences can be made on the basis of test results.

Validity • There is no single type of validity appropriate for all testing purposes. • Validity is not a matter of all or nothing, but a matter of degree.

Types of Validity • Content • Criterion-Related (concurrent or predictive) • Construct • Face

Content Validity • Whether items (questions) on a test are representative of the domain (material) that should be covered by the test. • Most appropriate for test like achievement tests (i.e., concrete attributes)

Content Validity Guiding Questions: • Are the test questions appropriate and does the test measure the domain of interest? • Does the test contain enough information to cover appropriately what it is supposed to measure? • What is the level of master at which the content is being assessed? ***NOTE – Content validity does not involve statistical analysis.

Obtaining Content Validity Two ways: • Define the testing universe and administer the test. • Have experts rate “how essential” each question is. (1- essential, 2-useful, but not essential, and 3-not necessary) Questions are considered valid if more than ½ experts indicate question is “essential”.

Defining the Testing Universe • What is the body of knowledge or behaviors that the test represents? • What are the intended outcomes (skills, knowledge)?

Developing A Test Plan Step 1 • Define testing universe • Locate theoretical or empirical research on the attribute • Interview experts Step 2 • Develop test specifications • Identify content areas (topics to be covered in test) • Identify instructional objectives (what one should be able to do with these topics) Step 3 • Establish a test format Step 4 • Construct test questions

Concrete Attributes Attributes that can be described in terms of specific behaviors. e.g., ability to play piano, do math problems Abstract Attributes More difficult to describe in terms of behaviors because people might disagree on what the behaviors present e.g., intelligence, creativity, personality Attributes

Chapter 8Using Tests to Make Decisions:Criterion-Related Validity

What is a criterion? • This is the standard by which your measure is being judged or evaluated. • The measure of performance that is correlated with test scores. • An evaluative standard that can be used to measure a person’s performance, attitude, or motivation.

Two Ways to Demonstrate Criterion-Related Validity • Predictive Method • Concurrent Method

Criterion-Related Validity • Predictive validity – correlating test scores with future behavior on the behavior…after examinees have had a chance to exhibit the predicted behavior; e.g., success on the job.

Concurrent validity – correlating test scores with an independent measure of the same trait that the test is designed to measure – currently available. Or being able to distinguish between groups known to be different; i.e., significantly different mean scores on the test.

Examples of Concurrent Validity E.g.1, Teachers’ ratings of reading ability validated by correlating with reading test scores. E.g.2, validate an index of self-reported delinquency by comparing responses to office police records on the respondents.

In both predictive and concurrent validity, we validate by comparing scores with a criterion (the standard by which your measure is being judged or evaluated). • Most appropriate for tests that claim to predict outcomes. • Evidence of criterion-related validity depends on empirical or quantitative methods of data analysis.

Example of How To Determine Predictive Validity • Give test to applicants for a position. • For all those hired, compare their test scores to supervisors’ rating after 6 months on the job. • The supervisors’ ratings are the criterion. • If employees scored on the test similarly to supervisors’ ratings, then predictive validity of test is supported.

Problems with using predictive validity • Restricted range of scores on either predictor or criterion measure will cause an artificially lower correlation. • Attrition of criterion scores; i.e., some folks drop out before you can measure them on the criterion measure (e.g., 6 months later).

Selecting a Criterion • Objective criteria: observable and measurable; e.g., sales figures, number of accidents, etc. • Subjective criteria: based on a person’s judgment; e.g., employee job ratings. Example…

CRITERION MEASUREMENTS MUST THEMSELVES BE VALID! • Criteria must be representative of the events that they are supposed to measure. • i.e., sales ability – not just $ amount, but also # of sales calls made, size of target population, etc. • Criterion Contamination – If the criterion measures more dimensions than those measured by the test.

BOTH PREDICTOR AND CRITERION MEASURES MUST BE RELIABLEFIRST! • E.g., inter-rater reliability obtained by supervisors rating the same employees independently. • Reliability estimates of predictors can be obtained by one of the 4 methods covered in Chapter 6.

Calculating & Estimating Validity Coefficients • Validity Coefficient – Predictive and concurrent validity also represented by correlation coefficients. Represents the amount or strength of criterion-related validity that can be attributed to the test.

Two Methods for Evaluating Validity Coefficients • Test of significance: A process of determining what the probability is that the study would have yielded the validity coefficient calculated by chance. -Requires that you take into account the size of the group (N) from whom we obtained our data. -When researchers or test developers report a validity coefficient, they should also report its level of significance. • must be demonstrated to be greater than zero • p < .05. Look up in table.

Two Methods for Evaluating Validity Coefficients • Coefficient of determination: The amount of variance shared by two variables being correlated, such as test and criterion, obtained by squaring the validity coefficient. r2 tells us how much covariation exists between predictor and criterion; e.g., if r = .7, then 49% of the variance is common to both. i.e., If correlation (r) is .30, then the coefficient of determination (r2) is .09. (This means that the test and criterion have 9% of their variation in common.)

Using Validity Information To Make Predictions • Linear regression: predicting Y from X. • Set a “pass” or acceptance score on Y. • Determine what minimum X score (“cutting score”) will produce that Y score or better (“success” on the job) • Examples…

Outcomes of Prediction Hits: a) True positives - predicted to succeed and did. b) True negatives - predicted to fail and did. Misses: a) False positives - predicted to succeed and didn’t. b) False negatives - predicted to fail and would have succeeded. WE WANT TO MAXIMIZE TRUE HITS AND MINIMIZE MISSES!

Predictive validity correlation determines accuracy of prediction

Chapter 9Construct Validity

Construct Validity • The extent to which the test measures a theoretical construct. • Most appropriate when a test measures an abstract construct (i.e., marital satisfaction)

What is a construct? • An attribute that exists in theory, but is not directly observable or measurable. (Remember there are 2 kinds: concrete and abstract.) • We can observe & measure the behaviors that show evidence of these constructs. • Definitions of constructs can vary from person to person. • i.e., Self-efficacy • Example…

When some trait, attribute or quality is not operationally defined you must use indirect measures of the construct, e.g., a scale which references behaviors that we consider evidence of the construct. • But how can we validate that scale?

Construct Validity • Evidence of construct validity of a scale may be provided by comparing high vs. low scoring people on behavior implied by the construct, e.g., Do high scorers on the Attitudes Toward Church Going Scale actually attend church more often than low scorers? • Or by comparing groups known to differ on the construct; e.g., comparing pro-life members with pro-choice members on Attitudes Toward Abortion scale.

Construct Validity (cont’d) • Factor analysis also gives you a look at the unidimensionality of the construct being measured; i.e., homogeneity of items. • As does the split-half reliability coefficient. • ONLY ONE CONSTRUCT CAN BE MEASURED BY ONE SCALE!

Convergent Validity • Evidence that the scores on a test correlate strongly with scores on other tests that measure the same construct. • i.e.,would expect two measures on general self-efficacy to yield strong, positive, and statistically significant correlations.

Discriminant Validity • When the test scores are not correlated with unrelated constructs.

Multitrait-Multimethod Method • Searching for convergence across different measures of the same thing and for divergence between measures of different things.

Face Validity • The items look like they reflect whatever is being measured. • The extent to which the test taker perceives that the test measures what it is supposed to measure. • The attractiveness and appropriateness of the test at perceived by the test takers. • Influences how test takers approach the test. • Uses experts to evaluate.

Which type of validity would be most suitable for the following? a) mathematics test b) intelligence test c) vocational interest inventory d) music aptitude test

Discuss the value of predictive validity to each of the following? a) personnel manager b) teacher or principal c) college admissions officer d) prison warden e) psychiatrist f) guidance counselor g) veterinary dermatologist h) professor in medical school

Chapter 7 Evaluating What a Test Really Measures

Chapter 7 Evaluating What a Test Really Measures

Presentation Transcript

What do Test Scores Really Mean?

WHAT DOES DNR REALLY MEAN? COMFORT MEASURES ONLY

Chapter 7 Test Review

Chapter 7: Evaluating and Controlling Technology

Chapter 7: Evaluating and Controlling Technology

Chapter 7: Types of Educational Measures

Evaluating Measures

Chapter Seven What Parents Really Want

What a Family Really Wants

Chapter 7 Test Review!

Jet Quenching: What it really measures?

Chapter 7 Test Review

Evaluating measures of core inflation

Chapter 7 Practice Test

CHAPTER 7 TEST ESSAYS

Chapter 7 Practice Test

Chapter 7 Test Review

Chapter 7 Evaluating What a Test Really Measures

Chapter 7 Test Review

What a Mom really wants…

The Wilcoxon rank sum test: What does it really test?

Chapter 7 Performance Measures