1 / 46

How to Assess and Measure Competency

How to Assess and Measure Competency. Robert C. Shaw, Jr., PhD Program Director. Presentation Outline. Describe a program’s responsibilities Assess appropriate content Measure abilities as precisely as possible Reference each cut score to a criterion. The validity claim.

blythe
Download Presentation

How to Assess and Measure Competency

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Assess andMeasure Competency Robert C. Shaw, Jr., PhD Program Director

  2. Presentation Outline • Describe a program’s responsibilities • Assess appropriate content • Measure abilities as precisely as possible • Reference each cut score to a criterion

  3. The validity claim • Our program is confident we can make valid inferences from an assessment because • we carefully selected and structured the content and • observed scores are reasonably precise • Weakness in either claim diminishes the validity argument

  4. Define appropriate content What should we assess?

  5. Stakeholders’ Expectations Information sources for content Certification Board’s Expectations

  6. What should we assess? • A program should seek multiple opinions about program content • May mean more than one faculty person in the program • Could extend to survey results from several stakeholders • Those who hire your graduates • Those who graduated

  7. Describe potential content • Define potential content by describing job behaviors or tasks • Interpret ABG results • Determine the appropriate time to refer a patient for consultation from another service • Adjust mechanical ventilation settings to optimize oxygenation for a patient while minimizing the risk of pulmonary injury

  8. Define terminal behaviors • Focus terminal assessments on end-product behavior you expect students to master • Insert a pulmonary artery catheter in a patient within a critical care setting using standard technique while minimizing risks of infection and lung involvement • Integrate pulmonary function testing results with patient history and other laboratory results to produce a diagnosis

  9. Measure task criticality • Typically expressed by the interaction of a • importance/significance/risk measure and a • frequency/extent measure

  10. How important is the task to success? OR How significant is the task to safe and effective practice? 4=Extremely 3=Very 2=Moderately 1=Minimally Potential survey measurements

  11. If this task is incorrectly performed, how strong is the risk? 3= Potentially fatal 2=Likely to increase morbidity 1= Unlikely to have an adverse effect Potential survey measurements • 3=High • 2=Moderate • 1=Low

  12. How frequently do you perform the task? 3=Every week 2=A few times each year 1=Less than once a year Potential survey measurements • 3=Very often • 2=Occasionally • 1=Infrequently

  13. Have you performed the task in the last year? 1=Yes 0=No Potential survey measurements

  14. What can we do with task measurements? • Normed-referenced approach • Rank order tasks from most to least critical • Start at the top and work down using available time • Criterion-referenced approach • Identify tasks that are sufficiently critical to ensure program coverage and competency assessment

  15. Select item type(s) for each assessment • Constructed response (e.g., short answer, essay, performance) • Short development time • Long scoring time • Scores have strong subjective characteristics • Selected response (e.g., true/false, matching, multiple-choice) • Long development time • Short scoring time • Scores have strong objective characteristics

  16. High stakes terminal assessments should be standardized • Specify how the assessment should look before writing/selecting items • Test specifications ensure each assessment is similar, fair, and covers critical content

  17. Test specifications are typically two-dimensional

  18. Entire test blueprint/matrix

  19. Test specifications and items • Each item should be linked to a task and a cognitive process level • It helps to store items in a database • A sophisticated database will permit additional layers of classification • Acute/chronic care • Age groups

  20. Item banking software • FastTest $$$ • www.assess.com/frmSoftCat.htm • ExamView $$$ • www.pearsonncs.com/examview/ examview.htm • LXR*Test $$$ • www.lxrtest.com/

  21. Measure abilities precisely Are we confident an assessment has yielded a sufficiently precise ability estimate?

  22. Reliability • Theoretical premise • Observed scores are assumed to express true ability plus some measurement error • High reliability implies low measurement error

  23. Reliability • Reliability indices are R2 values, which express the percentage of observed score variance that can be attributed to true score variance • How high is high enough? • A test score reliability value of at least .85 is a characteristic of large-scale, standardized assessments, many exceed .90 • Sufficiently reliable test scores from a test built by a program should show values of at least .60

  24. Reliability • Reliability is an attribute of a set of test scores, it is not an attribute of a test • Therefore, a program should assess reliability for each group • KR20 is appropriate for dichotomously scored (0,1) items • Coefficient alpha works for polytomously (0, 1,…n) scored items

  25. Why are selected response items used for so many assessments? • Assuming the time to assess is constant, more responses can be elicited from students using selected response items • more items = • broader content coverage = • increased information = • enhanced measurement precision = • stronger validity • Scores are more strongly objective

  26. Add items or options? • A program cannot go wrong by adding more items to an assessment • A program may only consume space and time by adding more options to multiple-choice items • There is growing evidence items with 3 options are optimal, particularly when doing so permits inclusion of more items on an assessment • Dr. Thomas Haladyna, Arizona State University

  27. Up to a point, measurement precision and item quantity are directly related Reliability Higher quality items Lower quality items Item Count

  28. What encourages high item quality? • Write well • Clear, concise, accurate • Remove unnecessary information from the stimulus • Present nuanced choices that require a sophisticated mastery of material to correctly respond • Item review is another opportunity to seek multiple opinions

  29. What encourages high item quality? • Avoid formats known to be flawed • D. All of the above • D. None of the above • Negative wording • All of the following are true EXCEPT • Which of the following is not true?

  30. What encourages high item quality? • Apply quality improvement principles • Analyze item performance • Retain items that contribute to test score reliability • Change or discard items that fail to contribute or negatively affect reliability

  31. Item analysis properties • Difficulty • p = proportion of students who correctly responded • Discrimination • rpb = correlation between item success and students’ test scores

  32. Item difficulty Contribution to Test Score Reliability 0.0 0.4 0.6 1.0 p

  33. Item discrimination • Because rpb values are correlations, values reflect one of three possibilities relative to reliability • Positive contribution • No contribution • Negative contribution

  34. Using item parameters diagnostically • Relative to reliability contribution, item • p values provide magnitude information • rpb values provide magnitude and direction (+ or -) information

  35. Using item parameters diagnostically • Difficulty and discrimination properties equally contribute to reliability • The best items show .30<p<.70 AND ppb>.20 • The worst items exist at the difficulty extremes and show zero or negative discrimination

  36. After diagnosing an item that shows a weak or negative reliability contribution • What should we do? • Observe option response frequencies and mean scores • Identify incorrect responses that attracted students with test scores equal to or greater than the average • Replace the offending option with a less attractive response • Rewrite the stem to clarify ambiguities OR • Discard the whole item and use a better one the next time

  37. Item analysis software • Iteman $$$ • www.assess.com/Software/iteman.htm • examSystem II $$$ • www.pearsonncs.com/examsystem/index.htm • LXR*Test $$$ • www.lxrtest.com/ • True Score II $$ • www.nine-patch.com/TSCDL.htm • Excel Templates $Free • www.eflclub.com/elvin/publications/2003/itemanalysis.html

  38. Internal resources may be available • There is a good probability a large university with education, psychology, and/or statistics departments will have a system available for scoring items and providing analyses of test scores and items

  39. Reference each cut score to a criterion Should we define and assess minimal competence for our program?

  40. Cut points • Highly reliable test scores reveal differences between students’ abilities and can help accurately rank order students, which may be important to employers • However, the program is likely interested in assessing whether each student is sufficiently competent to safely and effectively practice • Such assessment concerns typically surface as students are about to graduate

  41. Measuring minimal competence • A program should decide whether it wants to create one large assessment with a single compensatory cut point OR • Should each content domain have its own cut, a conjunctive model

  42. Why are there so many compensatory cut competency assessments? • If a program selects the more rigorous conjunctive model, then each component test will produce its own set of scores, each with its own reliability • Each component must have a sufficient number of items or data points to be confident each student group’s test scores will show adequate reliability • Modules of less than 80-100 program-made items are unlikely to produce adequate reliability

  43. Seek multiple opinions . . . again • Program faculty should define skills competent practitioners possess • This is a group activity • Each cut point should be linked to a definition of minimally competent practitioners

  44. Performance assessments • Pick your spots • Ensure a sufficient quantity of information is collected • Standardize administration • Measure agreement between/among evaluators

  45. Summary • Collective opinions are closer to the truth about • appropriate assessment content, • item quality, and • justifiable cut scores than any one opinion • Unreliable scales have no utility

  46. Thank you for the opportunity to share some details about measurement Questions?

More Related