designing trustworthy reliable gme evaluations n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Designing Trustworthy & Reliable GME Evaluations PowerPoint Presentation
Download Presentation
Designing Trustworthy & Reliable GME Evaluations

Loading in 2 Seconds...

play fullscreen
1 / 75

Designing Trustworthy & Reliable GME Evaluations - PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on

Designing Trustworthy & Reliable GME Evaluations. Conference Session: SES85 2011 ACGME Annual Education Conference Nancy Piro, PhD, Program Manager/Ed Specialist Alice Edler, MD, MPH, MA (Educ) Ann Dohn, MA, DIO Bardia Behravesh, EdD, Manager/Ed Specialist Stanford Hospital & Clinics

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Designing Trustworthy & Reliable GME Evaluations' - mahola


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
designing trustworthy reliable gme evaluations

Designing Trustworthy & Reliable GME Evaluations

Conference Session: SES85

2011 ACGME Annual Education Conference

Nancy Piro, PhD, Program Manager/Ed Specialist

Alice Edler, MD, MPH, MA (Educ)

Ann Dohn, MA, DIO

Bardia Behravesh, EdD, Manager/Ed Specialist

Stanford Hospital & Clinics

Department of Graduate Medical Education

overall questions
Overall Questions
  • What is assessment…an evaluation?
  • How are they different?
  • What are they used for?
  • Why do we evaluate?
  • How do we construct a useful evaluation?
  • What is cognitive bias?
  • How do we eliminate bias from our evaluations?
  • What is validity?
  • What is reliability?
assessment evaluation what s the difference and what are they used for
Assessment - Evaluation: What’s the difference and what are they used for?

Assessment…is the analysis and use of data by residents or sub-specialty residents (trainees), faculty, program directors and/or departments to make decisions about improvements in teaching and learning. 

assessment evaluation what s the difference and what are they used for1
Assessment - Evaluation: What’s the difference and what are they used for?

Evaluation is the analysis and use of data by faculty to make judgments about trainee performance.  Evaluation includes obtaining accurate performance based, empirical information which is used to make competency decisionson trainees across the six domains.

evaluation examples
Evaluation Examples
  • Example 1: A trainee delivers an oral presentation at a Journal Club. The faculty member provides a critique of the delivery and content accompanied by a rating for the assignment.
  • Example 2: A program director provides a final evaluation to a resident accompanied by an attestation that the resident has demonstrated sufficient ability and acquired the appropriate clinical and procedural skills to practice competently and independently.
why do we assess and evaluate besides the fact it is required
Why do we assess and evaluate?(Besides the fact it is required…)
  • Demonstrate and improve trainee competence in core and related competency areas - Knowledge and application
  • Ensure our programs produce graduates, each of whom: “has demonstrated sufficient ability and acquired the appropriate clinical and procedural skills to practice competently and independently.”
  • Track the impact of curriculum/organizational change
  • Gain feedback on program, curriculum and faculty effectiveness
  • Provide residents/fellows a means to communicate confidentially
  • Provide an early warning system
  • Identify gaps between competency based goals and individual performance
so what s the game plan for constructing effective evaluations
So What’s the Game Plan for Constructing Effective Evaluations ?

Without a plan… evaluations can take on a life of their own!!

how do we construct a useful evaluation
How do we construct a useful evaluation?

STEP 1. Create the Evaluation (Plan)

Curriculum (Competency) Goals, Objectives and Outcomes

Question and Scale Development

STEP 2. Deploy(Do)

Online /In-Person (Paper)

STEP 3. Analyze (Study /Check )

Reporting, Benchmarking and Statistical Analysis

Rank Order / Norms (Within the Institution or National)

STEP 4. Take Action (Act)

Develop & Implement Learning/Action Plans

Measure Progress Against Learning Goals

Adjust Learning/Action Plans

question and response scale construction
Question and Response Scale Construction

Two Basic Goals:

  • Construct unbiased, unconfounded, and non-leading questions that produce valid data
  • Design and use unbiased and valid response scales
what is cognitive bias
What is cognitive bias…
  • Cognitive bias is distortion in the way we perceive reality or information.
  • Response bias is a particular type of cognitive bias which can affect the results of an evaluation if respondents answer questions in the way they think they are designed to be answered, or with a positive or negative bias toward the examinee.
where does response bias occur
Where does response bias occur?
  • Response bias most often occurs in the wording of the question.
    • Response bias is present when a question contains a leading phrase or words.
    • Response bias can also occur in rating scales.
  • Response bias can also be in the raters themselves
        • Halo Effect
        • Devil Effect
        • Similarity Effect
        • First Impressions
step 1 create the evaluation question construction
Step 1: Create the EvaluationQuestion Construction
  • Example (1):
    • "I can always talk to my Program Director about residency related problems.”
  • Example (2):
    • “Sufficient career planning resources are available to me and my program director supports my professional aspirations .”
question construction
Question Construction
  • Example (3):
    • “Incomplete, inaccurate medical interviews, physical examinations; incomplete review and summary of other data sources. Fails to analyze data to make decisions; poor clinical judgment.”
  • Example (4):
    • "Communication in my sub-specialty program is good."
create the evaluation question construction
Create the EvaluationQuestion Construction
  • Example (5):
    • "The pace on our service is chaotic."
exercise one
Exercise One
  • Review each question and share your thinking of what makes it a good or bad question.
question construction test your knowledge
Question Construction - Test Your Knowledge
  • Example 1: "I can always talk to my Program Director about residency related problems."
  • Problem: Terms such as "always" and "never" will bias the response in the opposite direction.
  • Result: Data will be skewed.
question construction test your knowledge1
Question Construction - Test Your Knowledge
  • Example 2: “Career planning resources are available to me and my program director supports my professional aspirations."
  • Problem: Double-barreled ---resources and aspirations… Respondents may agree with one and not the other. Researcher cannot make valid assumptions about which part of the question respondents were rating.
  • Result: Data is useless.
question construction test your knowledge2
Question Construction - Test Your Knowledge
  • Example 3: "Communication in my sub-specialty program is good."
  • Problem: Question is too broad. If score is less than 100% positive, researcher/evaluator still does not know what aspect of communication needs improvement.
  • Result: Data is of little or no usefulness.
question construction test your knowledge3
Question Construction - Test Your Knowledge
  • Example 4: “Evidences incomplete, inaccurate medical interviews, physical examinations; incomplete review and summary of other data sources. Fails to analyze data to make decisions; poor clinical judgment.”
  • Problem: Septuple-barreled ---Respondents may need to agree with some and not the others. Evaluator cannot make assumptions about which part of the question respondents were rating.
  • Result: Data is useless.
question construction test your knowledge4
Question Construction - Test Your Knowledge
  • Example (5):
    • "The pace on our service is chaotic.“
  • Problem: The question is negative, and broadcasts a bad message about the rotation/program.
  • Result: Data will be skewed, and the climate may be negatively impacted.
evaluation question design principles
Evaluation Question Design Principles

Avoid ‘double-barreled’ questions

  • A double-barreled question combines two or more issues or “attitudinal objects” in a single question.
avoiding double barreled questions
Avoiding Double-Barreled Questions
  • Example: Patient Care Core Competency

“Resident provides sensitive support to patients with serious illness and to their families, and arranges for on-going support or preventive services if needed.”  

Minimal Progress

Progressing

Competent

evaluation question design principles1
Evaluation Question Design Principles
  • Combining the two or more questions into one question makes it unclear which object attribute is being measured, as each question may elicit a different perception of the resident’s performance.
  • RESULT:
    • Respondents are confused and results are confounded leading to unreliable or misleading results.
  • Tip: If the word “and” or the word “or” appears in a question, check to verify whether it is a double-barreled question.
evaluation question design principles2
Evaluation Question Design Principles
  • Avoid questions with double negatives…
  • When respondents are asked for their agreement with a negatively phrased statement, double negatives can occur.
    • Example:Do you agree or disagree with the following statement?
evaluation question design principles3
Evaluation Question Design Principles
  • “Attendings should not be required to supervise their residents during night call.”
  • If you respond that you disagree, you are saying you do not think attendings should not supervise residents. In other words, you believe that attendings should supervise residents.
  • If you do use a negative word like “not”, consider highlighting the word by underlining or bolding it to catch the respondent’s attention.
evaluation question design principles4
Evaluation Question Design Principles
  • Because every question is measuring something, it’s important for each to be clear and precise.
  • Remember…Your goal is for each respondent to interpret the meaning of each question in exactly the same way.
evaluation question design principles5
Evaluation Question Design Principles
  • If your respondents are not clear on what is being asked in a question, their responses may result in data that cannot or should not be applied to your evaluation results…
  • "For me, further development of my medical competence, it is important enough to take risks" – Does this mean to take risks with patient safety, risks to one's pride, or something else?
evaluation question design principles6
Evaluation Question Design Principles
  • Keep questions short. Long questions can be confusing.
  • Bottom line: Focus on short, concise, clearly written statements that get right to the point, producing actionable data that can inform individual learning plans (ILPs).
    • Take only seconds to respond to/rate
    • Easily interpreted.
evaluation question design principles7
Evaluation Question Design Principles
  • Do not use “loaded” or “leading” questions
  • A loaded or leading question biases the response given by the respondent. A loaded question is one that contains loaded words.
    • For example: “I’m concerned about doing a procedure if my performance would reveal that I had low ability”

Disagree Agree

evaluation question design principles8
Evaluation Question Design Principles

"I’m concerned about doing a procedure on my unit if my performance would reveal that I had low ability"

  • How can this be answered with “agree or disagree” if you think you have good abilities in appropriate tasks for your area?
evaluation question design principles9
Evaluation Question Design Principles
  • A leading question is phrased in such a way that suggests to the respondent that a certain answer is expected:
    • Example: Don’t you agree that nurses should show more respect to residents and attendings?
      • Yes, they should show more respect
      • No, they should not show more respect
evaluation question design principles10
Evaluation Question Design Principles
  • Use of Open-Ended Questions
  • Comment boxes after negative ratings
    • To explain the reasoning and target areas for focus and improvement
  • General, open-ended questions at the end of the evaluation.
    • Can prove beneficial
    • Often it is found that entire topics have been omitted from the evaluation that should have been included.
evaluation question design principles exercise 2 post test
Evaluation Question Design Principles – Exercise 2 “Post Test”

1. Please rate the general surgery resident’s communication and technical skills

2. Rate the resident’s ability to communicate with patients and their families

3. Rate the resident’s abilities with respect to case familiarization; effort in reading about patient’s disease process and familiarizing with operative care and post op care

4. Residents deserve higher pay for all the hours they put in, don’t they?

evaluation question design principles exercise 2 post t est
Evaluation Question Design Principles – Exercise 2 “Post Test”

5. Explains and performs steps in resuscitation and stabilization

6. Do you agree or disagree that residents shouldn’t have to pay for their meals when on-call?

7. Demonstrates an awareness of and responsiveness to the larger context of health care

8.Demonstrates ability to communicate with faculty and staff

evaluation design principles rating scales
Evaluation Design Principles: Rating Scales
  • By far the most popular scale asks respondents to rate their agreement with the evaluation questions or statements – “stems”.
  • After you decide what you want respondents to rate (competence, agreement, etc.), you need to decide how many levels of rating you want them to be able to make.
evaluation design principles rating scales1
Evaluation Design Principles: Rating Scales
  • Using too few can give less precise, cultivated information, while using too many could make the question hard to read and answer (do you really need a 9 or 10 point scale?)
  • Determine how fine a distinction you want to be able to make between agreement and disagreement.
evaluation design principles rating scales2
Evaluation Design Principles: Rating Scales
  • Psychological research has shown that a 6-point scale with three levels of agreement and three levels of disagreement works best. An example would be:
          • Disagree Strongly
          • Disagree Moderately
          • Disagree Slightly
          • Agree Slightly
          • Agree Moderately
          • Agree Strongly
evaluation design principles rating scales3
Evaluation Design Principles: Rating Scales
  • This scale affords you ample flexibility for data analysis.
  • Depending on the questions, other scales may be appropriate, but the important thing to remember is that it must be balanced, or you will build in a biasing factor.
  • Avoid neutral and neither agree nor disagree…you’re just giving up 20% of your evaluation ‘real estate’
evaluation design principles rating scales4
Evaluation Design Principles: Rating Scales

1. Please rate the volume and variety of patients available to the program for educational purposes.

Poor Fair Good Very Good Excellent

2. Please rate the performance of your faculty members. 

Poor Fair Good Very Good Excellent

3. Please rate the competence and knowledge in general medicine.

Poor Fair Good Very Good Excellent

evaluation design principles rating scales5
Evaluation Design Principles: Rating Scales

The data will be artificially skewed in the positive direction using this scale because there are far more (4:1) positive than negative rating options….Yet we see this scale being used all the time!

gentle words of wisdom
Gentle Words of Wisdom….

Avoid large numbers of questions….

  • Respondent fatigue – the respondent tends to give similar ratings to all items without giving much thought to individual items, just wanting to finish
  • In situations where many items are considered important, a large number can receive very similar ratings at the top end of the scale
  • Items are not traded-off against each other and therefore many items that are not at the extreme ends of the scale or that are considered similarly important are given a similar rating
gentle words of wisdom1
Gentle Words of Wisdom….

Avoid large numbers of questions….but ensure your evaluation is both valid and has enough questions to be reliable….

how many questions raters are enough
How many questions (raters) are enough?

Not intuitive

Little bit of math is necessary (sorry)

True Score =Observed Score +/- Error score

why are we talking about reliability in a question writing session
Why are we talking about reliability in a question writing session ?

To create your own evaluation questions and insure their reliability

To share/use other evaluations that are assuredly reliable

To read the evaluation literature

reliability
Reliability
  • Reliability is the "consistency" or "repeatability" of your measures.
  • If you could create 1perfect test question (unbiased and perfectly representative of the task) you would need only that one question
  • OR if you could find 1perfect rater (unbiased and fully understanding the task) you would need only one rater
reliability estimates
Reliability Estimates
  • Test Designers use four correlational methods to check the reliability of an evaluation:
    • the test-retest method,(Pre test –Post test)
    • alternate forms,
    • internal consistency,
    • and inter-rater reliability.
generalizability
Generalizability

One measure based on Score Variances

Generalizablity Theory

problems with correlation m ethods
Problems with Correlation Methods
  • Based on comparing portions of a test to one another ( Split-Half, Coefficient α, ICC.)
    • Assumes that all portions are strictly parallel (measuring the same skill, knowledge, attitude)
  • Test-Retest assumes no learning has occurred in the interim.
  • Inter-rater reliability only provides consistency of raters across an instrument of evaluation

UNLIKE A MATH TEST, ALL CLINICAL SITUATIONS ARE NOT PARALLEL…

methods based on s core v ariance
Methods based onScore Variance
  • Generalizablity Theory
    • Based in Analysis of Variance- ANOVA
    • Can parse out the differences in the sources of error
      • For example, capture the essence of differing clinical situations
generalizablity studies
Generalizablity Studies
  • Two types:
    • G study
      • ANOVA is derived from the actual # of facets(factors) that you put into the equation
      • Produces a G coefficient (similar to r or ά )
    • D study
      • Allows you to extrapolate to other testing formats
      • Produces a D coefficient
what can we do about this problem
What can we do about this problem?

Train the raters

Increase the # of raters

Would increasing the # of test items help?

reliability goals
Reliability Goals

All reliability coefficients display the following qualities:

<50 poor

50-70 moderate

70-90 good

>90 excellent

interrater reliability kappa
Interrater Reliability (Kappa)

IRR is not really a measure of the test reliability, rather a property of the raters

It does not tell us anything about the inherent variability within the questions themselves

Rather

Quality of the raters

Or misalignment of one rater/examinee dyad

reliabililty
Reliabililty

Evaluation Reliability (consistency) is an essential but not sufficient requirement for validity

validity
Validity
  • Validity is a property of evaluation scores. Valid evaluation scores are ones with which accurate inferences can be made about the examinee’ s performance.
  • The inferences can be in the areas of :
    • Content knowledge
    • Performance ability
    • Attitudes, behaviors and attributes
t hree types of test score validity
Three types of testscore validity
  • 1. Content
    • Inferences from the scores can be generalized to a larger domain of items similar to those on the test itself
      • Example (content validity): board scores
  • 2. Criteria
    • Score inferences can be generalized to performance on some real behavior (present or anticipated) of practical importance
      • Example
        • Present behavioral generalization (concurrent validity ): OSCE
        • Future behavioral generalization (predictive validity): MCAT
validity1
Validity
  • 3. Construct
    • Score inferences have “no criteria or universe of content to entirely adequate to define the quality to be measured” (Cronbachand Meehl, 1955) but the inferences can be drawn under the label of a particular psychological construct
      • Example : professionalism
example question
Example Question:

Does not demonstrate extremes of behavior

Communicates well

Uses lay terms when discussing issues

Is seen as a role model

Introduces oneself and role in the care team

Skillfully manages difficult patient situations  

Sits down to talk with patients

process of validation
Process of Validation
  • Define the intended purposes/ use of inferences to be made from the evaluation
  • Five Arguments for Validity (Mesick, 1995)
    • Content
    • Substance
    • Structure
    • GENERALIZABLITY
    • Consequence
generalizablity
Generalizablity
  • Inferences from this performance task can be extended to like tasks
    • Task must be representative (not just simple to measure )
    • Should be as fully represent the domain as practically possible
      • Example: MultipleMini Interview (MMI)
why are validity statements critical now
Why are validity statements critical now?

Performance evaluation is on the crux of use for credentialing and certification.

We are asked to measure constructs ….not just content and performance abilities.

gentle words of wisdom begin with the end in mind
Gentle Words of Wisdom: Begin with the End in Mind
  • What do you want as your outcomes? What is the purpose of your evaluation
  • Be prepared to put in the time with pretesting for reliability and understandability
  • The faculty member, nurse, patient, resident has to be able to understand the intent of the question - and each must find it credible and interpret it in the same way
  • Adding more items to the test may not always be the answer to increased reliability
gentle words of wisdom continued
Gentle Words of Wisdom Continued…
  • Relevancy and Accuracy –

If the questions aren’t framed properly, if they are too vague or too specific, it’s impossible to get any meaningful data.

    • Question miswording can lead to skewed data with little or no usefulness.
    • Ensure your response scales are balanced and appropriate.
    • If you don't plan or know how you are going to use the data, don't ask the question!
gentle words of wisdom continued1
Gentle Words of Wisdom Continued…
  • Use an appropriate number of questions based on your evaluation's purpose .
  • If you are using aggregated data, the statistical analyses must be appropriate for your evaluation or, however sophisticated and impressive, the numbers generated that look real will actually be false and misleading.
  • Are differences really significant given your sample size?
summary evaluation do s and don ts
Summary: Evaluation Do’s and Don’ts

DO’s

  • Keep Questions Clear, Precise and Relatively Short.
  • Use a balanced response scale
    • (4-6 scale points recommended)
  • Use open ended questions
  • Use an appropriate number of questions

DON’Ts

  • Do not use Double+ Barreled Questions
  • Do not use Double Negative Questions
  • Do not use Loaded or Leading Questions.
  • Don’t assume there is no need for rater training