- 107 Views
- Uploaded on
- Presentation posted in: General

Chapter 7 Evaluating What a Test Really Measures

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Chapter 7Evaluating What a Test Really Measures

- APA – Standards for Educational and Psychological Testing (1985) – Recognized three ways of deciding whether a test is sufficiently valid to be useful.

Validity: Does the test measure what it claims to measure?The appropriateness with which inferences can be made on the basis of test results.

- There is no single type of validity appropriate for all testing purposes.
- Validity is not a matter of all or nothing, but a matter of degree.

- Content
- Criterion-Related (concurrent or predictive)
- Construct
- Face

- Whether items (questions) on a test are representative of the domain (material) that should be covered by the test.
- Most appropriate for test like achievement tests (i.e., concrete attributes)

Guiding Questions:

- Are the test questions appropriate and does the test measure the domain of interest?
- Does the test contain enough information to cover appropriately what it is supposed to measure?
- What is the level of master at which the content is being assessed?
***NOTE – Content validity does not involve statistical analysis.

Two ways:

- Define the testing universe and administer the test.
- Have experts rate “how essential” each question is. (1- essential, 2-useful, but not essential, and 3-not necessary) Questions are considered valid if more than ½ experts indicate question is “essential”.

- What is the body of knowledge or behaviors that the test represents?
- What are the intended outcomes (skills, knowledge)?

Step 1

- Define testing universe
- Locate theoretical or empirical research on the attribute
- Interview experts
Step 2

- Develop test specifications
- Identify content areas (topics to be covered in test)
- Identify instructional objectives (what one should be able to do with these topics)
Step 3

- Establish a test format
Step 4

- Construct test questions

Concrete Attributes

Attributes that can be described in terms of specific behaviors.

e.g., ability to play piano, do math problems

Abstract Attributes

More difficult to describe in terms of behaviors because people might disagree on what the behaviors present

e.g., intelligence, creativity, personality

Chapter 8Using Tests to Make Decisions:Criterion-Related Validity

- This is the standard by which your measure is being judged or evaluated.
- The measure of performance that is correlated with test scores.
- An evaluative standard that can be used to measure a person’s performance, attitude, or motivation.

- Predictive Method
- Concurrent Method

- Predictive validity – correlating test scores with future behavior on the behavior…after examinees have had a chance to exhibit the predicted behavior; e.g., success on the job.

Concurrent validity – correlating test scores with an independent measure of the same trait that the test is designed to measure – currently available.

Or being able to distinguish between groups known to be different; i.e., significantly different mean scores on the test.

E.g.1, Teachers’ ratings of reading ability validated by correlating with reading test scores.

E.g.2, validate an index of self-reported delinquency by comparing responses to office police records on the respondents.

- In both predictive and concurrent validity, we validate by comparing scores with a criterion (the standard by which your measure is being judged or evaluated).
- Most appropriate for tests that claim to predict outcomes.
- Evidence of criterion-related validity depends on empirical or quantitative methods of data analysis.

- Give test to applicants for a position.
- For all those hired, compare their test scores to supervisors’ rating after 6 months on the job.
- The supervisors’ ratings are the criterion.
- If employees scored on the test similarly to supervisors’ ratings, then predictive validity of test is supported.

- Restricted range of scores on either predictor or criterion measure will cause an artificially lower correlation.
- Attrition of criterion scores; i.e., some folks drop out before you can measure them on the criterion measure (e.g., 6 months later).

- Objective criteria: observable and measurable; e.g., sales figures, number of accidents, etc.
- Subjective criteria: based on a person’s judgment; e.g., employee job ratings. Example…

CRITERION MEASUREMENTS MUST THEMSELVES BE VALID!

- Criteria must be representative of the events that they are supposed to measure.
- i.e., sales ability – not just $ amount, but also # of sales calls made, size of target population, etc.

- Criterion Contamination – If the criterion measures more dimensions than those measured by the test.

BOTH PREDICTOR AND CRITERION MEASURES MUST BE RELIABLEFIRST!

- E.g., inter-rater reliability obtained by supervisors rating the same employees independently.
- Reliability estimates of predictors can be obtained by one of the 4 methods covered in Chapter 6.

- Validity Coefficient – Predictive and concurrent validity also represented by correlation coefficients. Represents the amount or strength of criterion-related validity that can be attributed to the test.

- Test of significance: A process of determining what the probability is that the study would have yielded the validity coefficient calculated by chance.
-Requires that you take into account the size of the group (N) from whom we obtained our data.

-When researchers or test developers report a validity coefficient, they should also report its level of significance.

- must be demonstrated to be greater than zero
- p < .05. Look up in table.

- Coefficient of determination: The amount of variance shared by two variables being correlated, such as test and criterion, obtained by squaring the validity coefficient.
r2 tells us how much covariation exists between predictor and criterion; e.g., if r = .7, then 49% of the variance is common to both.

i.e., If correlation (r) is .30, then the coefficient of determination (r2) is .09. (This means that the test and criterion have 9% of their variation in common.)

- Linear regression: predicting Y from X.
- Set a “pass” or acceptance score on Y.
- Determine what minimum X score (“cutting score”) will produce that Y score or better (“success” on the job)
- Examples…

Hits: a) True positives - predicted to succeed and did.

b) True negatives - predicted to fail and did.

Misses: a) False positives - predicted to succeed and didn’t.

b) False negatives - predicted to fail and would have succeeded.

WE WANT TO MAXIMIZE TRUE HITS AND MINIMIZE MISSES!

- The extent to which the test measures a theoretical construct.
- Most appropriate when a test measures an abstract construct (i.e., marital satisfaction)

- An attribute that exists in theory, but is not directly observable or measurable. (Remember there are 2 kinds: concrete and abstract.)
- We can observe & measure the behaviors that show evidence of these constructs.
- Definitions of constructs can vary from person to person.
- i.e., Self-efficacy

- Example…

- When some trait, attribute or quality is not operationally defined you must use indirect measures of the construct, e.g., a scale which references behaviors that we consider evidence of the construct.
- But how can we validate that scale?

- Evidence of construct validity of a scale may be provided by comparing high vs. low scoring people on behavior implied by the construct, e.g., Do high scorers on the Attitudes Toward Church Going Scale actually attend church more often than low scorers?
- Or by comparing groups known to differ on the construct; e.g., comparing pro-life members with pro-choice members on Attitudes Toward Abortion scale.

- Factor analysis also gives you a look at the unidimensionality of the construct being measured; i.e., homogeneity of items.
- As does the split-half reliability coefficient.
- ONLY ONE CONSTRUCT CAN BE MEASURED BY ONE SCALE!

- Evidence that the scores on a test correlate strongly with scores on other tests that measure the same construct.
- i.e.,would expect two measures on general self-efficacy to yield strong, positive, and statistically significant correlations.

- When the test scores are not correlated with unrelated constructs.

- Searching for convergence across different measures of the same thing and for divergence between measures of different things.

- The items look like they reflect whatever is being measured.
- The extent to which the test taker perceives that the test measures what it is supposed to measure.
- The attractiveness and appropriateness of the test at perceived by the test takers.
- Influences how test takers approach the test.
- Uses experts to evaluate.

a) mathematics test

b) intelligence test

c) vocational interest inventory

d) music aptitude test

a) personnel manager

b) teacher or principal

c) college admissions officer

d) prison warden

e) psychiatrist

f) guidance counselor

g) veterinary dermatologist

h) professor in medical school