1 / 38

Stephen Stark

Can subject matter experts’ ratings of statement extremity be used to streamline the development of unidimensional pairwise preference scales. Stephen Stark. Introduction– applications of noncognitive CAT.

aleta
Download Presentation

Stephen Stark

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Can subject matter experts’ ratings of statement extremity be used to streamline the development of unidimensional pairwise preference scales Stephen Stark

  2. Introduction– applications of noncognitive CAT • The sample size for classification accuracy at least attaining 1000 is recommended for 3PL. • Rarely do pools for noncognitive test contain more than 50 items per dimension, because it is difficult to generate large number of descriptors. • Pairwise preference items • Just 20 statements can generate 190 pairwise items

  3. Using subject matter experts (SMEs) to develop adaptive UPP scales • Marginal maximum likelihood (MML) • Pretesting is usually a lengthy and expensive endeavor. • Subject Matter Experts (SMEs) • Already a part of test development and validation • Test blueprint, numbers, types, and difficulties for each content. • History: • Calibrating statements for behaviorally anchored rating scale (BARS) and behavioral summary scales (BSS) • Score situational judgment test

  4. Using subject matter experts (SMEs) to develop adaptive UPP scales • Source of MML error: • Sampling error • Priors used during estimate • SME error • Hard to detect (true value unknown)

  5. SME in an IRT-based UPP testing framework • 2 statements measure the same dimension • Each statement is characterized by one location parameter μ • SME could rate statements, result in the scoring • Using empirical data to compare the standard error of scores by computing parameter using MML or SME • Simulation study examine the recovery of known trait when scoring and CAT based on SME or MML location

  6. The ZG IRT model for UPPs θ μs-μt

  7. The ZG IRT model for UPPs • μs = 2.0, μt = -1.4 • μ s - μ t = 3.4 • μs = 0.6, μt = 2.2 • μ s - μ t = -1.6

  8. Monotonically↑ or ↓ • Distance ↑,slop ↑ • μ s= μ t,slop=0(Pst=0.5)

  9. Distance ↑, slop ↑, because the choice probability change rapidly over narrow range. • Despite the Intuitive notion, two similar alternatives provides more information • Distance ↑, information ↑ • Attain maximum when |μs - μt|=2.0

  10. CAT with the ZG UPP model • Generate a pool of pairwise items with the statements’ location parameter • Constraints on the minimum distance (no use) • Prior ~N(0,1), • Initial score = 0, • EAP estimation • randomly select an item from the Subset of items with max information (90% of maximum)

  11. CAT with the ZG UPP model • Termination: • max test length, • or no satisfactory items • Availability constraints • Limit the number of times a statement appears(repeat only once and not within the last 2 or 3 items) • (strict) Information and this constraints might result in premature test termination

  12. Study 1 • Effect of using MSE estimate on scoring accuracy and criterion validity in an empirical sample • choosing realistic correlation between SME and MML for study 2 • External validity: • preventative health behavior • Health checklist (8 items, reliability=.76) • Study behavior • Study behavior questionnaire(SBQ, 10 items, reliability=.76)

  13. Participants and Measures • 602 freshmen and sophomores • Female: 77%Male: 23% • 2 dimension: Order and Self-control • 20~25 personality statement for each dimension • 2 SMEs rate the μs and μt on a 7 point scale • The location was transformed to a -3 to 3 scale. • AVG(distance)=2.95, distances for 24 items are from 0.5 to 5.0.

  14. Analysis • Software: MODFIT 2.0 • statistics • Single, pairs, groups of three items. • Correlation between MML location and SME location • EAP estimate trait score based on MML and SME • Marginal reliabilities • Correlation with external criterion

  15. result • item 3 has problem, but retain it. < 3.0

  16. inter-rater correlation: .95 and .91

  17. range: MML < SMEs , because prior

  18. In terms of Order: SME > MML

  19. Correlation(MML, SME): 0.83 and 0.62

  20. trait scores correlated highly for both scales • SME rating approximates to MML • As expected, Order score correlate higher with criterion than Self-control score • MML and SME correlate with criterion similarly • Error in SME did not affect the rank order of θ or their criterion validities

  21. trait scores correlated highly for both scales • SME rating approximates to MML • As expected, Order score correlate higher with criterion than Self-control score • MML and SME correlate with criterion similarly • Error in SME did not affect the rank order of θ or their criterion validities

  22. trait scores correlated highly for both scales • SME rating approximates to MML • As expected, Order score correlate higher with criterion than Self-control score • MML and SME correlate with criterion similarly • Error in SME did not affect the rank order of θ or their criterion validities

  23. Study 2: Scoring Accuracy in CAT • Increasing accuracy or reducing testing time. • UPP method can yield relatively large pools • Statement parameter affects the item selection and scoring. The detrimental effect from error could thus propagate in CAT.

  24. Method • Condition 1: Uniform[-3,+3] • Condition 2 & 3: generate response from condition 1 and using MML to estimate. Θ ~ N(0,1) • Condition 4~8:

  25. Method • item selection: CAT and non-CAT • Test length: 8 items and 15 items • 100 examinee at each point:[-3.0, -2.8, …, +3.0] • Bias • AbsBias • RMSE • Correlation

  26. Result • TRUE correlated highly with MML • SME correlated with TRUE near the intended value of .9, .8, .7, .6, .0 • SME8 & SME6 reasonably connect to the empirical results (.83 & .62) in study 1

  27. Result • TRUE correlated highly with MML • SME correlated with TRUE near the intended value of .9, .8, .7, .6, .0 • SME8 & SME6 reasonably connect to the empirical results (.83 & .62) in study 1

  28. Result • TRUE correlated highly with MML • SME correlated with TRUE near the intended value of .9, .8, .7, .6, .0 • SME8 & SME6 reasonably connect to the empirical results (.83 & .62) in study 1

  29. It is mimicked in MML1000 & MML 500

  30. SME shows larger AbsBias and RMSE

  31. 8 items, correlation for SMEs are high (.70~.90)

  32. Correlation for SME6 and SME8 are very high.

  33. For CAT, correlation for SME6 and SME7 are high. Even if 8 items is still higher than non-CAT 15 items > non-CAT 15 items

  34. non-CAT 15 items The effect of Regression to mean (EAP),SME9 < SME6 SME0 response randomly and earned scores averaging near the prior mean (0)

  35. CAT 15 items

  36. Discussion • Interest to using other IRT models • The correlation between SME and MML for trait score were above .90, so it can be applied to actual personnel decisions. Further, the validity is OK. • The score accuracy would improve if longer test or implementing CAT procedure.

  37. Discussion • Only two SMEs, larger should be more precision • Training rater can estimate location and standard setting well in cognitive (ability) test. • Explore the classification accuracy (Advance simulation) • Pairwise preference between different domains.

More Related