1 / 54

Interpretation: How to Use Psychometrics

Interpretation: How to Use Psychometrics. A Different Format. Previous talks were generally about one topic Today’s presentation: Where does this stuff come up at MP, outside of the psychos? A little bit of info on several different things. The goals.

bryar-kidd
Download Presentation

Interpretation: How to Use Psychometrics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Interpretation: How to Use Psychometrics

  2. A Different Format • Previous talks were generally about one topic • Today’s presentation: Where does this stuff come up at MP, outside of the psychos? • A little bit of info on several different things

  3. The goals • Understand various psychometric analyses as they arise in day-to-day work • See which stats are used in different applications • Answer questions

  4. Topics Covered • Things you’d find in a key verification file • Classical stats (p-values, point-biserials) • Things you’d find at a form pulling • IRT stats (TCC’s, TIF’s) • Things you’d find in a technical manual • All sorts of info • A question you’d hear at a standard setting • IRT

  5. 1. Key Verification Files • Purpose: To check the correctness of answer keys (MC items) • A list of items whose stats are unusual or merit further investigation • Items identified based on their p-values and/or point-biserials

  6. P-value: The proportion of students answering an item correctly • “How easy is the item?” • Point-biserial: The correlation between item score and total score • “If you do well on the item, do you tend to do well on the test?”

  7. When might we be alarmed? • Not many kids are picking the right answer • The p-value is low (less than .25) • Low-performing kids are doing better on the item than high-performing kids • The point-biserial is low (less than .15) and/or • If an incorrect answer choice has strange stats

  8. Distractor Stats • Distractor p-value: The proportion of students picking the distractor (say, choice C when the correct answer is B) • “How popular is choice C?” • Flag item if distractor p-value is higher than .3 • Distractor point-biserial: The correlation between picking the distractor and total test score • “If you picked C, how well did you tend to do on the test?” • Flag item if distractor PBS is positive

  9. An Operational Example • A recent item had the following stats: • Key = D • P-value = 0.10 • Point-biserial = -0.02 • P-value for “C” = 0.60 • Point-biserial for “C” = 0.20 • So the key was wrong? Nope

  10. How Can That Happen? • An example: What is the definition of the word travesty? A: Mockery B: Injustice C: Bellybutton D: Some even stupider answer than “bellybutton” • Actual definition: “Any grotesque or debased likeness or imitation” • The correct answer is “A”, but “travesty of justice” threw off the high-performing students

  11. To sum up… • Psychometrics can help us identify items whose keys need to be checked • Stats used: • P-values • Point-biserials • Distractor p-values and point-biserials • P-values & point-biserials should be relatively high, distractor values should be relatively low • The key usually turns out to be right, but that’s OK

  12. 2. Form Pulling • Context: We are choosing items for next year’s exam • Clients like to look at psychometric info when picking items (e.g., MCAS) • We know the stats ahead of time because items were field-tested • Relevant stats: Test Characteristic Curves (TCC’s), raw score cut points, Test Information Functions (TIF’s)

  13. This stuff relates to Item Response Theory (IRT) • TCC is a plot that tells you the expected raw score for each value of ability (denoted theta) • As ability increases, expected raw score increases

  14. Example of a TCC: 5 Items

  15. Raw Score Cut Points • Suppose test has 4 performance levels: Below Basic, Basic, Proficient, Advanced • How many points do you need in order to reach the Basic level? Proficient? Advanced? • Example: Test goes from 0 to 72. Need 35 to reach Basic; 51 to reach Proficient; 63 to reach Advanced • Standard Setting often tells us theta cut points; clients want to know raw score cuts

  16. Using the TCC to find a cut point • Suppose theta cut is 0.4 • Find expected raw score at 0.4 using the TCC. It is 3.3 • Cut is placed between 3 and 4

  17. Test Information Functions • TIF’s tell us the test precision at each level of ability • The higher the curve, the more precision • Easy items give us precision for low values of theta. Similarly: • Hard items give precision at high values • Medium items give precision at medium values

  18. Example of a TIF

  19. Why does the client care? • It is often desired that next year’s forms are similar to this year’s forms • Make sure tests are correct difficulty (TCC, RScut points) & precision (TIF) • Match TCC’s, cut points, TIF’s of the two years

  20. Why should the forms be similar? • Theoretically, we should be able to account for differences through equating (Liz) • However, want the student experience to be similar from year to year • Don’t want to give easy test to Class of ’07, hard test to Class of ’08 • Don’t want to make this year’s test less precise than last year’s

  21. Example: 2007 MCAS, Grade 10 Math • Proposed 2007 TCC was lower than last year’s • Solution: Replace some hard items with easy items

  22. Example, Continued • Proposed 2007 TIF had less info at low abilities, more info at high abilities • Solution: • Replace some hard items with easy items • Use hard items with lower PBS, easy items with higher PBS

  23. Example, Continued • Proposed 2007 raw score cuts lower than 2006 raw score cuts • Solution: Replace some hard items with easy items

  24. Guide to making changes Some rules-of-thumb for different problems:

  25. To sum up… • Item Response Theory is useful in form pulling • TCC’s, raw score cuts, TIF’s are often examined • Proposed values should be similar to current year’s • Tests shouldn’t be too easy or hard • Tests should be informative but not too informative • It’s helpful to know how we can change these things based on item stats

  26. 3. Technical Manuals • Things in Technical Manuals vary from program to program • Often see some of the following: • P-values and point-biserials (thanks Louis!) • Test reliabilities (thanks Louis!) • TCC’s and TIF’s (thanks Mike!) • DIF (thanks Won!) • Standard Setting (thanks Liz and Abdullah!) • Equating (thanks in advance Liz!) • Inter-rater reliability (thanks for nothing!) • Decision consistency and accuracy (ditto)

  27. Technical Manuals: P-Values & Point-Biserials • You’ll often see a table like this:

  28. Technical Manuals: Reliabilities (and other stats) • Louis said: Reliability is the correlation between scores on parallel forms. Higher reliability Greater consistency • You’ll often see a table like this:

  29. Technical Manuals: TCC’s and TIF’s Give TCC, TIF of each grade / content area

  30. Technical Manuals: DIF • Won said: An item has DIF if the probability of getting the item right is dependent on group membership (e.g., gender, ethnic group) • Measured Progress uses a method called the Standardized P-Difference • Comparing groups • Male-Female • White-Black • White-Hispanic • Minimum 200 examinees in each group

  31. DIF, Continued • A: [-0.05 ~ 0.05] negligible • B: [-0.1 ~ -0.05) and (0.05 ~ 0.1] low • C: outside the [-0.1 ~ 0.1] high • A: [-0.05 ~ 0.05] negligible • B: [-0.1 ~ -0.05) and (0.05 ~ 0.1] low • C: outside the [-0.1 ~ 0.1] high C B A C B

  32. DIF, Continued • You may see a table like this:

  33. Technical Manuals: Standard Setting & Equating • Liz and Abdullah discussed Standard Setting • In technical manuals, you’ll often see: • Report / summary of standard setting process • Info about panelists (how many, who they are) • What method was used (e.g., bookmark / Body of Work) • Cut points • Info about panelist evaluations • Equating: Come next week and find out!

  34. Inter-rater reliability • When constructed-response items are rated by multiple scorers, how well do raters agree? • The more agreement, the better • Exact agreement: What % of the time do they give the same score? • Adjacent agreement: What % of the time are they off by 1?

  35. Decision Accuracy and Consistency: Introduction • For most programs, four achievement levels, e.g., Below Basic, Basic, Proficient, Advanced • Decision accuracy: degree to which observed categorizations match true categorizations • Decision consistency: degree to which observed categorizations match those of a parallel form

  36. Intuitive examples of accuracy • TRUE LEVEL: Proficient • OBSERVED LEVEL: Proficient • DIAGNOSIS: ACCURATE (GOOD) • TRUE LEVEL: Proficient • OBSERVED LEVEL: Below Basic • DIAGNOSIS: INACCURATE (BAD). False negative • TRUE LEVEL: Basic • OBSERVED LEVEL: Advanced • DIAGNOSIS: INACCURATE (BAD). False positive

  37. Intuitive examples of consistency • OBSERVED LEVEL, Form 1: Basic • OBSERVED LEVEL, Form 2: Basic • DIAGNOSIS: CONSISTENT (GOOD) • OBSERVED LEVEL, Form 1: Basic • OBSERVED LEVEL, Form 2: Advanced • DIAGNOSIS: INCONSISTENT (BAD)

  38. Decision Accuracy and Consistency: Introduction • Livingston and Lewis (1995) proposed method of estimating decision accuracy/consistency • For most programs, many stats are computed. We will give an example of each • The stats are all based on joint distributions • A joint distribution gives the proportion of times that 2 things both happen. • What proportion of students are truly Basic and are observed as Below Basic?

  39. Joint Distribution: True/Observed Achievement Levels Overall accuracy: 0.7484 True Status

  40. Joint Distribution: Observed/Observed Achievement Levels Overall consistency: 0.6574 Observed Status:Form 1

  41. Indices Conditional upon Level • Proportion of students correctly classified, given true level • Proportion of students consistently classified by parallel form, given observed level

  42. Indices at Cut Points • Accuracy & consistency at specified cut point • Accuracy: What is the chance that a student is classified on the “correct side” of a cut point? • Consistency: What is the chance that a student is classified on the same side of a cut point twice?

  43. To sum up… • Lots of stuff in technical manuals • Both classical test theory material (p-values, point-biserials, reliabilities) & IRT material (TCC’s, TIF’s, equating) are important to understand • Hopefully, these seminars have helped familiarize you with their contents

  44. 4. Standard Setting • Comes up all the time outside Psychoville • Should be a perfect topic for this talk, but… • Liz and Abdullah alreadydid a wonderful job

  45. Cut point 3 Cut point 2 Cut point 1 4. Standard Setting • Standard Setting is the process of recommending cut scores between achievement levels • Advance (A) • Proficient (P) • Below Proficient (BP) • Failing (F) • Focus on one FAQ in bookmark: • How do we determine the arrangement of items in the ordered item booklets?

  46. Brief Review of Bookmark • Each panelist makes use of the ordered item booklet • Items in the OIB are presented from easiest to hardest. One page per MC item • Panelists’ job is to place bookmark in OIB for each cut • For a given cut, where do panelists place a bookmark? • Where they think borderline students would no longer have a 2/3 chance (or better) of a correct answer • Abdullah said: cut points are derived from bookmark placements

  47. A Very Frequently-Asked Question • First, a FMC: “You messed up the order of the items!” • Then, the FAQ: “Well, how did you determine the order?” • Important: Order is based on actual student performance • We use the concept of IRT

  48. Two MC items: Which is easier? Easier item Harder item

  49. Depending on IRT Model this issue can become quite complex

More Related