380 likes | 511 Views
Can subject matter experts’ ratings of statement extremity be used to streamline the development of unidimensional pairwise preference scales. Stephen Stark. Introduction– applications of noncognitive CAT.
E N D
Can subject matter experts’ ratings of statement extremity be used to streamline the development of unidimensional pairwise preference scales Stephen Stark
Introduction– applications of noncognitive CAT • The sample size for classification accuracy at least attaining 1000 is recommended for 3PL. • Rarely do pools for noncognitive test contain more than 50 items per dimension, because it is difficult to generate large number of descriptors. • Pairwise preference items • Just 20 statements can generate 190 pairwise items
Using subject matter experts (SMEs) to develop adaptive UPP scales • Marginal maximum likelihood (MML) • Pretesting is usually a lengthy and expensive endeavor. • Subject Matter Experts (SMEs) • Already a part of test development and validation • Test blueprint, numbers, types, and difficulties for each content. • History: • Calibrating statements for behaviorally anchored rating scale (BARS) and behavioral summary scales (BSS) • Score situational judgment test
Using subject matter experts (SMEs) to develop adaptive UPP scales • Source of MML error: • Sampling error • Priors used during estimate • SME error • Hard to detect (true value unknown)
SME in an IRT-based UPP testing framework • 2 statements measure the same dimension • Each statement is characterized by one location parameter μ • SME could rate statements, result in the scoring • Using empirical data to compare the standard error of scores by computing parameter using MML or SME • Simulation study examine the recovery of known trait when scoring and CAT based on SME or MML location
The ZG IRT model for UPPs θ μs-μt
The ZG IRT model for UPPs • μs = 2.0, μt = -1.4 • μ s - μ t = 3.4 • μs = 0.6, μt = 2.2 • μ s - μ t = -1.6
Monotonically↑ or ↓ • Distance ↑,slop ↑ • μ s= μ t,slop=0(Pst=0.5)
Distance ↑, slop ↑, because the choice probability change rapidly over narrow range. • Despite the Intuitive notion, two similar alternatives provides more information • Distance ↑, information ↑ • Attain maximum when |μs - μt|=2.0
CAT with the ZG UPP model • Generate a pool of pairwise items with the statements’ location parameter • Constraints on the minimum distance (no use) • Prior ~N(0,1), • Initial score = 0, • EAP estimation • randomly select an item from the Subset of items with max information (90% of maximum)
CAT with the ZG UPP model • Termination: • max test length, • or no satisfactory items • Availability constraints • Limit the number of times a statement appears(repeat only once and not within the last 2 or 3 items) • (strict) Information and this constraints might result in premature test termination
Study 1 • Effect of using MSE estimate on scoring accuracy and criterion validity in an empirical sample • choosing realistic correlation between SME and MML for study 2 • External validity: • preventative health behavior • Health checklist (8 items, reliability=.76) • Study behavior • Study behavior questionnaire(SBQ, 10 items, reliability=.76)
Participants and Measures • 602 freshmen and sophomores • Female: 77%Male: 23% • 2 dimension: Order and Self-control • 20~25 personality statement for each dimension • 2 SMEs rate the μs and μt on a 7 point scale • The location was transformed to a -3 to 3 scale. • AVG(distance)=2.95, distances for 24 items are from 0.5 to 5.0.
Analysis • Software: MODFIT 2.0 • statistics • Single, pairs, groups of three items. • Correlation between MML location and SME location • EAP estimate trait score based on MML and SME • Marginal reliabilities • Correlation with external criterion
result • item 3 has problem, but retain it. < 3.0
trait scores correlated highly for both scales • SME rating approximates to MML • As expected, Order score correlate higher with criterion than Self-control score • MML and SME correlate with criterion similarly • Error in SME did not affect the rank order of θ or their criterion validities
trait scores correlated highly for both scales • SME rating approximates to MML • As expected, Order score correlate higher with criterion than Self-control score • MML and SME correlate with criterion similarly • Error in SME did not affect the rank order of θ or their criterion validities
trait scores correlated highly for both scales • SME rating approximates to MML • As expected, Order score correlate higher with criterion than Self-control score • MML and SME correlate with criterion similarly • Error in SME did not affect the rank order of θ or their criterion validities
Study 2: Scoring Accuracy in CAT • Increasing accuracy or reducing testing time. • UPP method can yield relatively large pools • Statement parameter affects the item selection and scoring. The detrimental effect from error could thus propagate in CAT.
Method • Condition 1: Uniform[-3,+3] • Condition 2 & 3: generate response from condition 1 and using MML to estimate. Θ ~ N(0,1) • Condition 4~8:
Method • item selection: CAT and non-CAT • Test length: 8 items and 15 items • 100 examinee at each point:[-3.0, -2.8, …, +3.0] • Bias • AbsBias • RMSE • Correlation
Result • TRUE correlated highly with MML • SME correlated with TRUE near the intended value of .9, .8, .7, .6, .0 • SME8 & SME6 reasonably connect to the empirical results (.83 & .62) in study 1
Result • TRUE correlated highly with MML • SME correlated with TRUE near the intended value of .9, .8, .7, .6, .0 • SME8 & SME6 reasonably connect to the empirical results (.83 & .62) in study 1
Result • TRUE correlated highly with MML • SME correlated with TRUE near the intended value of .9, .8, .7, .6, .0 • SME8 & SME6 reasonably connect to the empirical results (.83 & .62) in study 1
For CAT, correlation for SME6 and SME7 are high. Even if 8 items is still higher than non-CAT 15 items > non-CAT 15 items
non-CAT 15 items The effect of Regression to mean (EAP),SME9 < SME6 SME0 response randomly and earned scores averaging near the prior mean (0)
Discussion • Interest to using other IRT models • The correlation between SME and MML for trait score were above .90, so it can be applied to actual personnel decisions. Further, the validity is OK. • The score accuracy would improve if longer test or implementing CAT procedure.
Discussion • Only two SMEs, larger should be more precision • Training rater can estimate location and standard setting well in cognitive (ability) test. • Explore the classification accuracy (Advance simulation) • Pairwise preference between different domains.