characteristics of successful assessment measures n.
Skip this Video
Loading SlideShow in 5 Seconds..
Characteristics of Successful Assessment Measures PowerPoint Presentation
Download Presentation
Characteristics of Successful Assessment Measures

Loading in 2 Seconds...

play fullscreen
1 / 61

Characteristics of Successful Assessment Measures - PowerPoint PPT Presentation

  • Uploaded on

Characteristics of Successful Assessment Measures. Reliable Valid Efficient - Time - Money - Resources Don’t result in complaints. Reliability. What Do We Mean by Reliability?. The extent to which a score from a test is consistent and free from errors of measurement.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Characteristics of Successful Assessment Measures

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
characteristics of successful assessment measures
Characteristics of Successful Assessment Measures
  • Reliable
  • Valid
  • Efficient
    • - Time
    • - Money
    • - Resources
  • Don’t result in complaints
what do we mean by reliability
What Do We Mean by Reliability?
  • The extent to which a score from a test is consistent and free from errors of measurement
methods of determining reliability
Methods of Determining Reliability
  • Test-retest (temporal stability)
  • Alternate forms (form stability)
  • Internal reliability (item stability)
  • Interrater Agreement



test retest reliability
Test-Retest Reliability
  • Measures Temporal Stability
    • Stable measures
    • Measures expected to vary
  • Administration
    • Same participants
    • Same test
    • Two testing periods
test retest reliability scoring
Test-Retest ReliabilityScoring
  • To obtain the reliability of an instrument, the scores at time one are correlated with the scores at time two
  • The higher the correlation the more reliable the test
test retest reliability problems
Test-Retest ReliabilityProblems
  • Sources of measurement errors:
    • Characteristic or attribute being measured
    • may change over time.
    • - Reactivity
    • - Carry over effects
  • Practical problems:
    • - Time consuming
    • - Expensive
    • - Inappropriate for some types of test
standard error of measurement
Standard Error of Measurement
  • Provides a range of estimated accuracy
    • 1 SE = 68% confident
    • 1.98 SE = 95% confident
  • The higher the reliability of a test, the lower the standard error of measurement
  • Formula
serial killer iq exercise mean 100 sd 15 reliability 90 iq of 70 for death penalty
Serial Killer IQ ExerciseMean = 100, SD = 15, Reliability=.90IQ of 70 for death penalty
serial killer iq answers mean 100 sd 15 reliability 90 iq of 70 for death penalty
Serial Killer IQ - AnswersMean = 100, SD = 15, Reliability=.90IQ of 70 for death penalty


Alternate Forms

alternate forms reliability
Alternate Forms Reliability
  • Establishes form stability
  • Used when there are two or more forms of the same test
    • Different questions
    • Same questions, but different order
    • Different administration or response method (e.g., computer, oral)
  • Why have alternate forms?
    • - Prevent cheating
    • Prevent carry over from people who take a test more than once
      • GRE or SAT
      • Promotion exams
      • Employment tests
alternate forms reliability administration
Alternate Forms ReliabilityAdministration
  • Two forms of the same test are developed, and to the highest degree possible, are equivalent in terms of content, response process, and statistical characteristics
  • One form is administered to examinees, and at some later date, the same examinees take the second form
alternate forms reliability scoring
Alternate Forms ReliabilityScoring
  • Scores from the first form of test are correlated with scores from the second form
  • If the scores are highly correlated, the test has form stability
alternate forms reliability disadvantages
Alternate Forms ReliabilityDisadvantages
  • Difficult to develop
  • Content sampling errors
  • Time sampling errors
what the research shows
What the Research Shows
  • Computer vs. Paper-Pencil
    • Few test score differences
    • Cognitive ability scores are lower on the computer
    • for speed tests but not power tests
  • Item order
    • - Few differences
  • Video vs. Paper-Pencil
    • Little difference in scores
    • Video reduces adverse impact



internal reliability
Internal Reliability
  • Defines measurement error strictly in terms of consistency or inconsistency in the content of the test
  • With this form of reliability the test is administered only once and measures item stability
determining internal reliability split half method
Determining Internal ReliabilitySplit-Half Method
  • Test items are divided into two equal parts
  • Scores for the two parts are correlated to get a measure of internal reliability
  • Need to adjust for smaller number of items
  • Spearman-Brown prophecy formula:
    • (2 x split half reliability) ÷ (1 + split-half reliability)
spearman brown formula
Spearman-Brown Formula

(2 x split-half correlation)

(1 + split-half correlation)

If we have a split-half correlation of .60, the corrected reliability would be:

(2 * .60) ÷ (1 + .60) = 1.2 ÷ 1.6 = .75

spearman brown formula estimating the reliability of a longer test
Spearman-Brown FormulaEstimating the Reliability of a Longer Test

L = the number of time longer the new test will be


Suppose you have a test with 20 items and it has a reliability of .50. You wonder if using a 60-item test would result in acceptable reliability.




Estimated New Reliability = .75

common methods to determine internal reliability
Common Methods to Determine Internal Reliability
  • Cronbach’s Coefficient Alpha
    • - Used with ratio or interval data.
  • Kuder-Richardson Formula
    • Used for test with dichotomous items
      • yes-no
      • true-false
      • right-wrong
interrater reliability
Interrater Reliability
  • Used when human judgment of performance is involved in the selection process
  • Refers to the degree of agreement between 2 or more raters
  • 3 common methods used to determine interrater reliability
    • Percent agreement
    • Correlation
    • Cohen’s Kappa
interrater reliability methods percent agreement
Interrater Reliability MethodsPercent Agreement
  • Determined by dividing the total number of agreements by the total number of observations
  • Problems
    • Exact match?
    • Very high or very low frequency behaviors can
    • inflate agreement
interrater reliability methods correlation
Interrater Reliability MethodsCorrelation
  • Ratings of two judges are correlated
  • Pearson for interval or ratio data and Spearman for ordinal data (ranks)
  • Problems
    • Shows pattern similarity but not similarity of actual ratings
interrater reliability methods cohen s kappa
Interrater Reliability MethodsCohen’s Kappa
  • Allows one to determine not only the level of agreement, but the level that would be determined by chance
  • A Kappa of .70 or higher is considered acceptable agreement

Forensic Examiner A


Examiner B

increasing rater reliability
Increasing Rater Reliability
  • Have clear guidelines regarding various levels of performance
  • Train raters
  • Practice rating and provide feedback
scorer reliability
Scorer Reliability
  • Allard, Butler, Faust, & Shea (1995)
    • 53% of hand scored personality tests contained at least one
    • error
    • 19% contained enough errors to alter a clinical diagnosis


The degree to which inferences from scores on tests or assessments are justified by the evidence


Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests. ... The process of validation involves accumulating evidence to provide a sound scientific basis for the proposed score interpretations. It is the interpretations of test scores required by proposed uses that are evaluated, not the test itself. When test scores are used or interpreted in more than one way, each intended interpretation must be validated. Sources of validity evidence include but not limited to: evidence based on test content, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, evidence based on consequences of testing.

Standards for Educational and Psychological Testing (1999)

common methods of determining validity
Common Methods of Determining Validity
  • Content Validity
  • Criterion Validity
  • Construct Validity
  • Known Group Validity
  • Face Validity


Content Validity

content validity
Content Validity
  • The extent to which test items sample the content that they are supposed to measure
  • In industry the appropriate content of a test of test battery is determined by a job analysis
  • Considerations
    • The content that is actually in the test
    • The content that is not in the test
    • The knowledge and skill needed to answer the question
test of logic
Test of Logic
  • Stag is to deer as ___ is to human
  • Butch is to Sundance as ___ is to Sinatra
  • Porche is to cars as Gucci is to ____
  • Puck is to hockey as ___ is to soccer

What is the content of this exam?

messick 1995 sources of invalidity
Messick (1995)Sources of Invalidity
  • Construct underrepresentation
  • Construct-irrelevant variance
      • Construct-irrelevant difficulty
      • Construct-irrelevant easiness

Domain Content

Test Content



Criterion Validity

criterion validity
Criterion Validity
  • Criterion validity refers to the extent to which a test score is related to some measure of job performance called a criterion
  • Established using one of the following research designs:
    • - Concurrent Validity
    • - Predictive Validity
    • - Validity Generalization
concurrent validity
Concurrent Validity
  • Uses current employees
  • Range restriction can be a problem
predictive validity
Predictive Validity
  • Correlates test scores with future behavior
  • Reduces the problem of range restriction
  • May not be practical
validity generalization
Validity Generalization
  • Validity Generalization is the extent to which a test found valid for a job in one location is valid for the same job in a different location
  • The key to establishing validity generalization is meta-analysis and job analysis


Construct Validity

construct validity
Construct Validity
  • The extent to which a test actually measures the construct that it purports to measure
  • Is concerned with inferences about test scores
  • Determined by correlating scores on a test with scores from other test
construct valid measures
Construct Valid Measures
  • Correlate highly with measures of similar constructs
  • Don’t correlated highly with irrelevant or competing constructs
  • Correlate with other measures of the construct that are measured in different ways
  • Don’t correlate highly with competing or irrelevant constructs that are measured in similar ways


Face Validity

face validity
Face Validity
  • The extent to which a test appears to be job related
  • Reduces the chance of legal challenge
  • Increases test taker motivation
  • Increases acceptance of test results
  • Increasing face validity
    • Item content
    • Professional look of the test
    • Explaining the nature of the test


Known Group Validity

known group validity
Known-Group Validity
  • Compares test scores of groups “known” to be different
  • If no differences, test may not be valid
  • If differences, conclusions are hard to make