Validating Scoring Rules for Technology-enhanced Items (TEI s ): A Case Study from ELPA21

Validating Scoring Rules for Technology-enhanced Items (TEIs): A Case Study from ELPA21 National conference on student assessment- June 20, 2016 Terri schuster, mark hansen, & Phoebe Winter

Purpose of Presentation Spark thinking about developing principles for scoring (and developing) TEIs through • ELPA21 experience • Discussion of what others have done • Discussion of what we already know • Discussion of questions we need to ask and answer

Background on ELPA21 • State consortium-developed assessment of English language acquisition • Based on ELP Standards (CCSSO, 2012) • Default mode of administration is online • Evidence-centered design (ECD) process, assumed online delivery • Blueprint was built around tasks and standards • Items field tested in spring 2015 • Operational in spring 2016

ELPA21 Item Development and TEIs • ECD framework encouraged use of technology in items when appropriate • Some task specs required the use of TEIs (e.g., follow instructions) • Some allowed or suggested the use TEIs (e.g., student presentation) • Each task was coded as to which PLD(s) it addressed

English Language Proficiency Assessments—Background Knowledge • Prior to NCLB • Commercially available assessments • Based on EFL instruments • NCLB • ELP testing became required • Must assess the 4 language domains—listening, speaking, reading, and writing • Based on standards focused on grammar/syntax/language skills separated from content learning • Paper pencil delivery with use of recording instruments • Next Generation tests like ELPA21 • New standards designed to focus on practices and skills ELs will need to access content—focus on academic language • Online environment • Technology enhanced items

TEI Scoring Rules Review, Construct-Based1 Each TEI (and MCMS item) were reviewed as to • Whether the item could be scored for partially correct responses • Whether the item should be scored for partially correct responses • Do the possible responses tell us something different about the student’s skills as it relates to the PLDs? • Do the possible responses differentiate between different levels of the standard? • Do the possible responses differ only (or primarily) in difficulty? 1 Details of procedures used in ELPA21 can be found in Pooler, Wang, and Doyle (Sept 19, 2015). ELPA21 Partial Credit Scoring Rules Validation Report. Written for the ELPA21 consortium. Washington, DC: CCSSO

Sample items – Candidate for Partial Credit? Note that most of the sample items are from the interactive demo designed to help students learn how to respond to item types. Easier item deliberately selected for the demo so the students could concentrate on the interaction.

Sample items – Candidate for Partial Credit?

Items • 89 of the 1,138 TEIs in the item pool were deemed eligible for partial credit • 17 multiple-choice multi-select (MC-MS) • 36 drop-down paragraph-embedded tasks; pre-review decision to score as partial credit due to likely dependencies • Scoring rules were reviewed for 70 items, all drag and drop

COMMITTEE Decisions on scoring rules • Using consistent scoring rules for a given format • Avoiding automatic credit • Attend to the construct being measured, PLDs (sometimes leads to different decisions for seemingly similar task types) • Sequence items

Using Consistent scoring for a given format Move x objects from a set of 2x or greater into x spaces. Example: Student reads a paragraph about 2 or more related things, say, electrons, protons, and neutrons. (Note: This is NOT an ELPA21 sample item.) Look at the chart showing facts about atomic particles. Complete the chart by dragging one more facts about each atomic particle to the correct place in the chart.

Avoiding automatic credit

Similar Item Types, different decision Move the sentences into the correct order. Fold the paper. Cut many shapes into the paper. Draw on the paper. NOTE: this is not the format on the demo. Altered to fit.

Similar Item Types, different decision NOT an ELPA21 sample item. NOT representative of ELPA21 artwork (this item would NOT pass ELPA21 review). Read about an energy chain. Then answer the questions. Paragraph title Brief paragraph describing what an energy chain is (how the arrows work) and that the grass provides energy for the grasshopper, which provides energy for the frog, which provides energy for the hawk, which fertilizes the grass. Put the pictures in the correct order.

Sequence items 4+ objects

Using Response data to Develop/evaluate scoring Rules • Here, we’ll illustrate the kinds of analyses that were used to support the development/verification of scoring rules for ELPA21 TEIs • Committee decisions were informed by data collected during ELPA21 2015 field test • Operational data—with much larger sample sizes per item—support additional analyses, including item response theory (IRT) modeling of responses • Nominal response model (a particularly general/flexible IRT model) may provide a useful framework for scoring TEIs

A (simplified) listening item Student hears something like: “Select the [shape A]. Place the [shape A] in the [box 4].” 1 2 3 A 4 5 6 B 7 8 9 C

A (simplified) listening item

A fully correct response Student moves [shape A] into [box 4] 1 2 3 A 4 5 6 B 7 8 9 C

A fully incorrect response wrong object, wrong placement 1 2 3 A 4 5 6 B 7 8 9 C

A partially correct (partially incorrect) response wrong object, correct placement 1 2 3 A 4 5 6 B 7 8 9 C

A partially correct (partially incorrect) response correct object, wrong placement 1 2 3 A 4 5 6 B 7 8 9 C

A partially correct (partially incorrect) response correct object, correct placement 1 2 3 A 4 5 6 B 7 8 9 C

A partially correct (partially incorrect) response correct object, correct placement but ALSO incorrect object(s) with wrong placement(s) 1 2 3 A 4 5 6 B 7 8 9 C

Observed responses

Nominal response model The nominal response model (Thissen, Cai, & Bock, 2010, p. 43) “does not assume ordered polytomous response data and can therefore be used to measure traits and abilities with items that have unordered response categories. It can be used to identify the empirical ordering of response categories where that ordering is unknown a priori but of interest, or it can be used to check whether the expected ordering of response categories is supported in data.” Thissen, D., Cai, L., & Bock, R. D. (2010). The nominal categories item response model. Handbook of polytomous item response theory models, 43-75.

Item tracelines fully incorrect fully correct partially (in)correct

Observed responses collapsed

Tracelines for collapsed responses

Impact on scale scores scores increase slightly when partial credit is given scores decrease; largest decreases at highest scores scores unaffected

Evaluating options for item scoring fully incorrect (no credit) incorrect (no credit) multiple responses (partial credit?) not fully correct (no credit) partially incorrect (partial credit?) partially incorrect (partial credit?) partially (in)correct (partial credit) partially incorrect (partial credit?) fully correct (full credit) fully correct (full credit) fully correct (full credit)

Final comments • Analyses of response data can shed light on the reasonableness of scoring rules, including whether or not partial credit makes sense (and, if so, how much and for which responses) • May be used during rule development or evaluation/validation (if sufficient data are available) • Although response data may demonstrate certain empirical relationships between item response and overall test performance, there may be other considerations in establishing scoring rules

Thank you! • Terri Schuster, Nebraska Department of Education • Mark Hansen, UCLA/CRESST • Phoebe Winter, Independent Consultant SOURCE: http://xkcd.com/1289/

Validating Scoring Rules for Technology-enhanced Items (TEI s ): A Case Study from ELPA21