Automated Scoring for High Stakes Assessments: Ready for Prime Time? Sue Lottridge, Ph.D. Charles Yee, Ph.D. Pacific Metrics Corporation June 29, 2012
Introduction & Background • Both PARCC and SBAC are creating constructed response (CR) items that are expected to be 100% or partially scored by automated scoring engines • From a consequential validity and potentially construct validity standpoint, CR items have value in large-scale assessment (formative, benchmark, and summative) • However, little is known about what CR item types are score-able beyond ‘very’ short or grid-based items
Introduction & Background - 2 • Little is known because: • Automated Scoring (AS) expertise has generally resided with vendors and their clients • AS remains a highly empirical field, driven by individual item demands and student response data • Electronic scored responses available for a variety of CR item types are not common (although growing more common) • We do know that: • Quality human scoring is a prerequisite to quality computer scoring • AS staff should participate in item/task and rubric development, and in range finding and scoring • AS engine calibration requires a level of knowledge of scoring decisions and rubric interpretation that is often not required of human scoring
Introduction & Background - 3 • In this presentation, we offer a general continua for categorizing constructed response items and describe the issues that cause difficult in scoring • Our focus is on: • CR items that elicit answers written between 1-5 sentences • CR items across a range of content areas • CR items that are NOT on mathematical computation items
Automated Scoring of CR Items • The degree of difficulty posed to AS depends on how predictable and limited the semantic, lexical and syntactic varieties. • The more limited and predictable the responses, the more likely for AS to be effective. • However, limiting the variety of responses by crafting restrictive test questions and formats narrows the types of problem solving skills we wish to address, and runs counter to the purpose of using CR questions in the first place (Bennett, 2011).
Questions in Automated Scoring • What types of linguistic phenomena are commonly exhibited in an item’s response? • The computational complexity of the technologies needed to process and score a response. • The complexity of labeling scheme and the size of dataset to be labeled. • The type of statistical model that scores to the highest human agreement rate based on the dataset.
Challenging Linguistic Phenomena The use of language in a CR largely correlates to the CR item type • Bag of words/keyword search • Explain/describe what is in the prompt - a diagram, a table, an essay, etc. • Higher level abstractions related to the prompt (what are the hypothesis and assumptions, the constants and variables, correlations, etc.) • Compare/contrast X and Y (and Z…) • Explain/analyze how something works, “why” questions • Open-ended questions Easier Harder
Comparison Item Item: Describe how categorical and ordinal variables are different. Answer: A categorical variable is one that has two or more categories, but no intrinsic ordering to the categories (e.g. gender, hair color). In an ordinal variable, there are clear orderings (e.g. level of education, income brackets).
Typical Student Responses Ellipsis: • Categorical variables have no ordering. Cardinal variables do (have ordering). • Cardinal variables are ordered. Categorical variables are the opposite. • Cardinal variables require ordering. Unlike categorical variables. (A more general phenomenon) Pronouns: • Cardinal variable is ordered. It is not for categorical variable. • Ordering is important for cardinal variable. It is not for categorical variable.
“Why” Question Item: Why do spiral staircases in medieval castles run clockwise? Answer: All knights used to be right-handed. When an intruding army climb the stairs, they have a harder time using their right hand which holds the sword. This is due to the confined spaces on their right where the center of the spiral lies. Left-handed knights would have had no troubles, except left-handed people could never become knights because it was assumed that they were descendants of the devil.
Why Student Responses Statements of Causality: • It is easier to defend because the attackers are right-handed, and going up the stairs would give them less space to wield the sword. Conditional Constructions: • There are less space on the right. (Therefore) if the attackers are right-handed, (then) they will not be able to use their sword effectively. Entailment: • Clockwise spiral staircases are cramped on the right going up. This makes it easy to defend (or hard to attack).
Complexities in CR • A perfectly correct CR must satisfy a set of rubrics. • A single rubric may consist of a number of Concepts. • A Concept is a semantic entity. Like all meaning, it can be expressed in a very large (but finite) number of ways. • The syntactic manifestation of a Concept is a sentence(s). • Many sentences can describe the same concept. Either explicitly, through entailment, or presuppositions. • Synonymous words in the same context or grammatical construction can be interchanged.
Complexities in CR This causes an explosion of possible data for us to handle!! S = (C * G )n S = No. of possible sentences that can satisfy a rubric C = No. of Concepts G = No. of grammatical constructions for a Concept n = Total number of synonyms
Computational Complexity in CR AS • The goal is to analyze student responses and match them to some expected, ‘score-able’ form. • The technologies used to analyze CR strings to obtain the information needed for AS varies in their level of complexity. • Which technology to use depends on the type of linguistic phenomenon that consistently recur in the responses. • The choice of AS technology therefore depends on the type of CR Item we are dealing with.
Annotation and Dataset • Depending upon the item, student responses are manually labeled, a common linguistic method for ‘learning’ language • In the case of CR items, labeling depends upon the complexity of the rubric and the nature of the student response. • As we know, AS is a data-hungry monster: the more responses we have, generally the better the system performs.
Statistical Modeling of Dataset Most of the NLP technologies currently available utilizes one or more of the following statistical models: • Naive Bayes Classifier • Maximum Entropy Classifier • Hidden Markov Model • Latent Semantic Analysis • Support Vector Machines • Conditional Random Field
Conclusion • We’ve presented a continua of CR items and their associated AS ‘score-ability’ • The ability of an AS engine to score items depends upon: • Linguistic complexity • Computational complexity • Annotation • Statistical models
Bibliography • Randy Elliot Bennett, Automated Scoring of Constructed Response Literacy, ETS, 2011 • Daniel Cer, Marie-Catherine de Marneffe, Dan Jurafsky and Chris Manning, Parsing to Stanford Dependencies: Trade-offs between speed and accuracy, In Proceedings of Language Resources Evaluation Conference (LREC-10). 2010 • Claudia Leacock, Martin Chodorow: C-rater: Automated Scoring of Short-Answer Questions. Computers and the Humanities 37(4): 389-405 (2003) • Ross Cox, Regular Expression Matching Can Be Simple And Fast, 2007, from http://swtch.com/~rsc/regexp/regexp1.html
Bibliography • Jana Sukkarieh and John Blackmore, c-rater: Automatic Content Scoring for Short Constructed Responses, Florida Artificial Intelligence Research Society (FLAIRS): Proceedings of the Twenty-Second International FLAIRS Conference, 2009, retrieved November 18, 2010, from http://www.aaai.org/ocs/index.php/FLAIRS/2009/paper/download/122/302. • Jana Sukkarieh, John Blackmore: c-rater: Automatic Content Scoring for Short Constructed Responses. FLAIRS Conference 2009 • Jana Z. Sukkarieh, Stephen G. Pulman: Information Extraction and Machine Learning: Auto-Marking Short Free Text Responses to Science Questions. AIED 2005: 629-637 • Traub, R.E., & Fisher, C.W. (1977). On the equivalence of constructed response and multiple-choice tests. Applied Psychological Measurement, 1(3), 355-369