Test validity revisited again
1 / 33

- PowerPoint PPT Presentation

  • Uploaded on

Test Validity – Revisited Again. Virginia Association of Test Directors Conference, Richmond, VA October 29, 2008, 1:30 – 2:30. David Mott Tests for Higher Standards ROS works, LLC. The title of this talk is deliberately repetitive. It’s like: “The Department of Redundancy Department”,

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about '' - betrys

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Test validity revisited again

Test Validity – Revisited Again

Virginia Association of Test Directors Conference, Richmond, VA October 29, 2008, 1:30 – 2:30

David Mott

Tests for Higher Standards

ROSworks, LLC

Test validity revisited again

The title of this talk is deliberately repetitive.

It’s like:

“The Department of Redundancy Department”,


“Déjà vu, all over again”,


“Mount Fujiyama”.

Test validity revisited again

Why is this?

Because validity is seen as the most important quality that a test or assessment can have.


Discussions about test validity have been going on for years. One simple prepositional phrase needs to be added to make the ongoing discussion meaningful — “FOR . . .” Test scores can never be either valid or invalid unless the space after FOR is filled in. This is a presentation and a discussion of what needs to go after the FOR, within the world of AYP and NCLB, and why it matters. Data will be presented and there will be a test.

What about me
What about me?

Began as a psychologist

  • Worked at the VA DOE for 17+ years as “Supervisor of various things” in research and testing:

    Including Supervisor of Test Development

    Testing Supervisor in charge of VSAP

  • Took some time off to decide what I wanted to do with the rest of my life — Then I started writing tests.

  • Continued writing tests until we had K – 11 in four subject areas —TfHS.

  • Then discovered we needed a way to score them and report them — so we created a way to score and report them —ROS.

Back to validity
Back to Validity

  • Validity means are you measuring what you think you’re measuring.

  • Validity also means that the way you are measuring something supports the purpose you have in measuring — and what inferences you wish to make.

    This is the FOR in the abstract.

Two basic types of validity
Two Basic Types of Validity

  • The first type is correlational validity

    • Concurrent Validity

    • Predictive Validity

    • Convergent Validity

  • The second type is non-correlational validity (for lack of a better term).

    • Face validity

    • Content validity

    • Diagnostic Validity

Test validity revisited again

The correlational types of validity have been becoming less relevant as more attention has been paid to the “non-correlational” validities.

I will call this (modestly), “Mott’s Law of Validities.”

Concurrent validity
Concurrent validity

Concurrent validity refers to the degree to which the test scores correlate with other measures of the same underlying thing that are measured at the same time. For example, this would mean that the scores on tests administered to students should correlate with their grades. That scores on one reading test compare with scores on another reading test.

It can have a useful function for us.

Predictive validity
Predictive Validity

Predictive validity is concurrent validity with a sense of time. How does one score predict another in the future. How will this bench-mark test predict scores on the SOL test?

This is what divisions ask my company and others to supply when they ask us for “the validity evidence” for our tests.

Problems, with this?

Convergent validity
Convergent Validity

Convergent validity refers to the degree to which a measure is correlated with many other measures that it is theoretically predicted to correlate with. This is sort of concurrent validity on steroids.

Reading ability is the best example of this:

DRP units or Lexiles.

We have long had a hint of this because all decent reading tests correlate pretty highly with each other.

Face validity
Face Validity

Face validity: If it looks like a duck, and walks like a duck, and quacks like a duck — It’s a . . .

It’s a non-statistical type of validity

Content validity
Content Validity

Content validity is systematic examination of test content to determine whether it covers a good sample of the domain intended to be measured by a test. For standards-related tests: Does the test cover the all desired standards. Does it cover them evenly or in the proportion it should. Does it’s coverage not include other standards? How about matching the cognitive levels?

Diagnostic validity
Diagnostic Validity

Diagnostic validity is the ability of a test to discriminate accurately between the skills, abilities, etc. an individual has and can do versus those not attained. The overall score is not really important, except as context.

Okay, I made up this term, but it is akin to what Jim Popham has called “instructional sensitivity”. His point is that many of our tests don’t have it.

Diagnostic validity1
Diagnostic Validity

Why is diagnostic validity of importance?

Because it is what most testing on the formative end of the spectrum is used for.

This type is not well described statistically. There are no well-known statistics used to measure it.

And there is always questions about how specific we should be or can be in our diagnoses:

Total Test Score

Reporting Category



Moving back to the for
Moving back to the FOR

Some common reasons for testing:

  • Selection or sorting

  • Summative

  • Formative

  • Motivational

Formative assessment
Formative Assessment

Arguments about what is formative vs. summative go on and on.

However, most of the research which shows formative assessment is highly effective in learning is based on assessment that is deeply integrated into learning. These are techniques used by teachers that are used in the midst of teaching. The teacher and the student are in a very short feedback loop.

From assessment manifesto rick stiggins 2008
From Assessment ManifestoRick Stiggins, 2008

“ . . . evidence gathered from dozens of studies conducted around the world consistently reveals a half to a full standard deviation gain in student achievement attributable to the careful management of the classroom assessment process, with the largest gains accruing for struggling learners.” (Black and Wiliam, 1998; Hattie and Timperley, 2007).


This really does not mean benchmark tests, mid-semester exams, end-of-course tests and the like. It may not even mean teacher made pop quizzes and so on.

It does mean tests in which the student finds out what he does and doesn’t know and can use the information, together with his teacher, to drive his own learning.

M o t i v a t i o n

This is usually not the stated purpose for assessment, but both the good things and the bad things that come out of testing often relate to this function.

It is the “stakes” part of the high-stakes vs. low-stakes dichotomy.

High-stakes for whom, is a good question to ask.

Motivation the key factor
Motivation — The Key Factor

Stiggins make a number of points about motivation in his Manifesto.

  • Tests can not motivate students if they have given up.

  • Too-hard tests can make student give up.

  • Too-hard tests (not properly targeted) can perpetuate the cycle of failure because they reinforce the student’s loss of self-efficacy.

Test validity revisited again
So . . .

  • If we give too-hard (mistargeted) tests we are directly part of the problem.

  • The tests are invalidfor any usebecause the studentsdon’t try.

  • We must avoid this vicious cycle.

    I say we because I mean me, my company, you, your division, your teachers, all of us.

How can we tell
How Can We Tell?

  • Christmas trees on answer sheets

  • A B C D A B C D A B C D

  • B B B B B B B B B B B B

  • Getting finished too fast

  • Talking, etc.

  • Scores near the chance point (<30% correct)

  • Ask the students

How can we tell 2
How Can We Tell #2?

  • Distracter choices that don’t make sense.

  • Response times for CBTs (a more sophisticated approach).

    Let’s look at this last, because this started me off on this tack.


  • Much of the clever psychometric work of the last 100 years is heavily contaminated by invalid data.

  • Many students are only partially motivated.

  • Some questions look hard, even if they aren’t.

  • We can may be able to use this data to better uncover what student don’t know.

  • Now that we know this we can watch for it, especially with CBT.

What can we do about these motivational issues
What can we do about these motivational issues?

  • Give properly targeted tests

    • That means we can’t always give the same tests to all the students. Not a problem with formative assessment because they’re not used for evaluation but for learning. (Could we call this “Differentiated Assessment”?)

    • Correct misunderstandings early, before they multiply.

    • Encourage students to admit when they don’t know.

A technique to encourage students to admit when they don t know
A Technique to Encourage Students to Admit When They Don’t Know

Try A different method of scoring:

Tell the students that they get 1 point for every answer they get right, 1/3 of a point for every answer they omit, and 0 points for every answer they get wrong. This removes the incentive for guessing and it lets you know explicitly that students need help.

Discussion . . .

Comes full circle
Comes Full Circle Know

So it comes full circle. Assessment is about learning. Assessment, used correctly, is really all FOR learning.

The test that follows this session is really a test of us.

Resources Know

Stiggins, R. (2008). Assessment Manifesto: A call for the Development of Balanced Assessment Systems. Portland, OR: ETS Assessment Training Institute. [www.ets.org/ati]

Popham, W. J. (2008). All about Assessment: A Misunderstood Grail. Educational Leadership, 66(1), 82-83.

Popham, W. J. (2008). Transformative Assessment. Alexandria, VA: Association for Supervision and Curriculum Development.

Wise, S. L. and DeMars, C. E. at the Center for Assessment & Research Studies (CARS) in the JMU School of Education — Response-time studies.

Test validity revisited again

Discussion . . . Know


David Mott




5310 Markel Road, Suite 104

Richmond, VA 23230