Advanced Topics in Standard Setting - PowerPoint PPT Presentation

Advanced Topics in Standard Setting
1 / 41

  • Uploaded on
  • Presentation posted in: General

Advanced Topics in Standard Setting. Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting. What is Standard Setting?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Advanced Topics in Standard Setting

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Advanced topics in standard setting

Advanced Topics in Standard Setting

Advanced topics in standard setting

Advanced Topics in Standard Setting

  • Methodology

  • Implementation

  • Validity of standard setting

What is standard setting

What is Standard Setting?

  • Students taking standard-based tests must perform in such a way that it can be evident from their test scores that they have achieved at least a minimum level of competency (MLC). This MLC is defined operationally as the cut score.

  • The process of recommending the cut score is called standard setting.

Standard setting methods

Standard Setting Methods

  • Test Centered

  • Examinee Centered

  • Item Mapping

Standard setting methods1

Standard Setting Methods

  • Test Centered:

    • Items are presented to the panelists in the same order as they appear on the test

    • Panelists predict a target student’s (e.g., a barely proficient) performance on each item in the test

Angoff 1971

Angoff (1971)

  • Very popular in licensure and certification tests

  • What is the probability that a barely proficient student will answer the item correctly?

Angoff 19711

Angoff (1971)

  • Cognitively very challenging

    • Conceptualize a barely proficient student

    • Predict probability of giving correct response to the item

  • Panelists tend to underestimate difficulty of an easy item and overestimate difficulty of a hard item

Yes no 1997

Yes/No (1997)

  • For each prediction panelist asks, “whether 2/3 of the barely proficient students will be able to answer the test item correctly (yes or no)?”

  • Cognitively less challenging

  • Used only for MC item tests

Extended angoff

Extended Angoff

  • Each item has maximum possible score of 4.

  • Scoring rubric: students can get only 0, 1, 2, 3, or 4 points.

  • What is the probable score that a barely proficient student will get on the item?

Standard setting methods2

Standard Setting Methods

  • Examinee Centered:

    • Panelists identify a target student’s paper (i.e., student folder) that is consistent with performance level descriptions

  • Popular Methods

    • Analytical Judgmental Method (2000)

      • Used for only CR and writing tests

      • Not used for a test that has both MC and CR items

    • Body of Work (1994)

Standard setting methods3

Standard Setting Methods

  • Item Mapping:

    • Items are presented to panelists in an order item booklet (OIB) based on difficulty level of the items

    • Panelist review each item and determine what knowledge, skills, and abilities must be required to answer a given item correctly

    • What makes each item progressively more difficult than the previous item in the OIB.

Standard setting methods4

Standard Setting Methods

  • Bookmark (1995)

    • Can be used both MC and CR item tests

    • Cognitively less challenging

  • Mapmark (2005)

    • Used in NAEP (2005)

    • Too expensive: may not be suitable for state assessment programs

  • Standard setting methods5

    Standard Setting Methods

    • Mixed Method (2006)

      • Blend of Angoff and Bookmark standard setting method

      • Using the strengths of these two methods

      • Not yet operationally implemented

      • Experts in standard setting (personal communication) seem to have liked this method.

    Summary methods

    Summary: Methods

    • Select the method that is appropriate for the assessment program

      • Test item type

      • Consistency, method used previously in the program

      • Prior experience with the method (person implementing)

      • Resources available

    Advanced topics in standard setting




    • General Standard Setting Process

      • Selection of panelists

      • Orientation

      • Review test materials

      • Review and discuss Performance Level Descriptors (PLDs)

      • Round 1 ratings

      • Feedback

      • Round 2 ratings

      • Evaluation

    Who are the panelists

    Who are the Panelists

    (1) Psychometricians (2) DOE staffs (3) Item writers (4) Subject matter experts

    • Content knowledge

    • Understands student population

    • Knowledge of instructional environment

    • Appreciates the consequences of the standards

    • Relevant stakeholders (parents, university level educators, etc.)

    How many panelists

    How Many Panelists

    • How many panelists should we have for a standard setting study?

      • 0 (b) 5-6 (c) 7-9 (d) 10-15 (e) 16-40

    • 10-15 panelists may be sufficient to set a defensible cut score

    • The magnitude of error is influenced by more than the number of participants

    • It suggests that precision of the cut score is influenced by other factors.

    What factors influence their prediction

    What Factors Influence their Prediction

    • Qualification of panelists

    • Orientation

    • Impact data

    • NCLB

    • Conceptualization of a barely proficient student

    • Thinking about one, who is in my classroom

    Number of rounds

    Number of Rounds

    • How many times should we collect panelists’ ratings?

      • 0 (b) 2 (c) 3 (d) 4 (e) 5

    • Two rounds of panelists’ ratings may be adequate to get reasonable and defensible cut score

    • Third round ratings may not differ much from round two ratings



    • Should we provide feedback to the panelists between the rounds?

    • If yes, what kind of data?

    • We often provide

      • Impact data (e.g., if the cut score is 23 out of a possible score point of 40 what % of students in the state will be classified as Proficient?)

      • Summary of their round 1 ratings (Panelists’ locations)

      • Student’s profile



    • Process feedback (panelists’ location) typically has the effect of reducing the standard deviation of cut scores set by panelists

    • It is also evident that the effect of feedback diminishes as the number of rounds of feedback and rating increases.



    • It will be discussed in the Validity section

    • It is a very important component in a standard setting study

    Streamlining of standard setting methods

    Streamlining of Standard Setting Methods

    Angoff-based Methods

    • Web-based standard setting

    • Reduce number of items rated by panelists

      • Divide the test and the panelists into equivalent groups

      • Use only a subset of items

        • content and difficulty

        • content, discrimination, and difficulty

        • 50% items may be adequate

    Streamlining of standard setting methods1

    Streamlining of Standard Setting Methods

    • Bookmark Method

      • Use only a subset of items

        • Content, item-type, and difficulty

        • 70% items may be adequate

    Ordered item booklet in bookmark

    Ordered Item Booklet in Bookmark



    Cut 1 = avg.(-1.85 and -1.5) = -1.68

    Cut 2

    Cut 3

    Conversion of raw cuts to theta cuts

    Conversion of Raw Cuts to Theta Cuts

    Cut 3 = 4

    Cut 2 = 3

    Cut 1 = 2

    Cut 1

    Cut 2

    Cut 3

    Scaled score

    Scaled Score

    • Scaled scores are typically a linear transformation of ability estimates

    • Example of a linear transformation:

      • (Ability x Slope) + Intercept

    Summary implementation

    Summary: Implementation

    • Panelists should be subject matter experts

    • 10-15 panelists are adequate for a standard setting study

    • Two-round of panelists’ rating may be sufficient to estimate a defensible cut

    • A reasonable feedback data should be provided

    Summary implementation1

    Summary: Implementation

    • When designing a standard setting study, potential influencing factors should be considered

    • Streamlined standard setting procedures may be a good consideration for low-stake tests.

    Advanced topics in standard setting

    Validity of Standard Setting Process

    Validity of standard setting

    Validity of Standard Setting

    • Assumptions:

      • Policy assumption: It claims that the performance standards are appropriate, given the purpose of the decisions

      • Operational assumption: It claims that students with scores at or above the cutscores are likely to meet the performance standards, and students with scores below the cutscore are not likely to meet the standard.

    Validity of standard setting1

    Validity of Standard Setting

    • Policy assumption is often evaluated by documenting procedural evidence (e.g.,)

      • Purposes of the decision process

      • Selection of panelists

      • Training of panelists

      • Definition of performance standard

      • Data collection procedure

    Validity of standard setting2

    Validity of Standard Setting

    • Operational assumption is examined through internal consistency evidence and external criteria

      • Internal consistency: Results that are not internally consistent do not justify any conclusions.

    Internal consistency evidence

    Internal Consistency Evidence

    • Precision of estimates of the cutscore

      • Standard error of the cutscore: If the standard setting study was repeated, to what extent we would be likely to get the same cutscore

    • Analysis of item-level data

      • Examining performances of students with scores near the cut score

      • Comparing performance of two groups of students (one with scores a bit above the cut and the other a bit below the cut)

    Internal consistency evidence1

    Internal Consistency Evidence

    • Examining performances of students with scores near the cut score. For example, panelists set the proficient cut at 25 out of a possible score of 40.

      • For an item, panelists think almost 90% of borderline proficient students should be able to answer the item correctly. If conditional p-value of the item for students who got a score of 25 is much different than 0.90, it implies that the cut score may not be accurately placed.

    Internal consistency evidence2

    Internal Consistency Evidence

    • Comparing performance of two groups of students (one with scores a bit above the cut and the other a bit below the cut):

      • Compare conditional p-values of the items. We would expect that p-values for above-the-cut student group will be higher than p-value for below-the-cut student group.

    Internal consistency evidence3

    Internal Consistency Evidence

    • Intra-panelist consistency: A measure of how consistently a panelist provides judgments across the items

    • Inter-panelist consistency: A measure of how consistently the panelists provide judgments on each item

      • An index that measures inter-panelist consistency for an Angoff, Body of work, and Bookmark standard setting methods

    External criteria

    External Criteria

    • Comparisons to results of other standard setting methods

      • Challenges: Different methods are not equally appropriate for a certain type of test

    • Judgments by stakeholder groups (e.g., classroom judgment data)

      • Challenges: Finding stakeholders who are qualified to make this judgment (e.g., stakeholders may have incomplete understanding about the performance level definition)

    • Comparisons involving other assessment methods

      • Existing classification data could be used as the basis for checking the appropriateness

      • Challenges: Finding an appropriate external criterion

    Summary validity

    Summary: Validity

    • Standard setting (or setting performance standards) is a judgmental method

    • Different methods may set different performance standards for the same test

    • This is ultimately a policy decision

    • Standard setting procedure needs to be well documented in order to make it defensible.

  • Login