Measurement Issues Inherent in Educator Evaluation Michigan School Testing Conference Workshop C February 21, 2012
Presenters & Developers* • Bruce Fay, Consultant, Wayne RESA • Ed Roeber, Professor, Michigan State University • Based on presentations previously developed by • Jim Gullen, Consultant, Oakland Schools • Ed Roeber, Professor, Michigan State University • Affiliated with the * We are Ph.Ds, not J.D.s We are not lawyers and have not even played lawyers on TV. Nothing in this presentation should be construed as legal, financial, medical, or marital advice. Please be sure to consult your legal counsel, tax accountant, minister or rabbi , and/or doctor before beginning any exercise or aspirin regimen. The use of the information in this presentation is subject to political whim. This presentation may cause drowsiness or insomnia…depending on your point of view. Any likeness to characters, real or fictional is purely coincidental.
Before we even begin... Educator Evaluation is still very fluid in Michigan. This workshop will try to establish basic measurement concepts and potential issues related to the evaluation of educators regardless of what legal requirements may ultimately be imposed. The systems of educator evaluation that eventually get implemented in Michigan may not be consistent with the information presented here.
Housekeeping • Cell phones on silent • Please take calls out into the lobby • We will take a few breaks (long workshop), but... • Please take care of personal needs as needed • Restrooms in several locations • Lets get questions/comments out as they come up • Deal with them at that time if on point • Defer to later if we plan to cover it later • Parking lot – hold to end, if time • The fact that something is a measurement issue does not mean there is an answer/solution presently at hand • We don’t know everything, so we may not have an answer or even a good suggestion, so please be kind
Workshop Outline • Introduction / Framing • Purpose / Components • Measuring Educator Practice • Measuring Student Achievement • Evaluating Educators – Putting it ALL Together • Reporting & Use of Educator Evaluations • Wrap Up
Workshop Outline • Introduction / Framing • What are we talking about today? • Purpose / Components • Measuring Educator Practice • Measuring Student Achievement • Evaluating Educators – Putting it ALL Together • Reporting & Use of Educator Evaluations • Wrap Up
Why are we here? • Not just an existential question! • We have legislation that requires performance evaluation systems that are... • Rigorous, transparent, and fair • Based on multiple rating categories • Based in part on student growth, as determined by multiple measures of student learning, including national, state, or local assessments or other objective criteria as a “significant” factor • We have a Governor’s Council that will... • Make specific recommendations to the Governor and Legislature regarding this by April 30, 2012
Things we’re thinking about today • What does it mean to evaluate something? • What is the purpose of an educator evaluation system? • What components are needed? • How can components be combined? • What role does measurement play? • What do we mean by “measurement issues”? • Are there different measurement issues associated with different purposes? Roles? Stakes? • Are there other non-measurement technical issues?
Things we’re thinking about today • What do we know (or believe to be true) about the degree to which educator practice determines student results? • What do we know (or believe to be true) about the degree to which student results can be attributed to educators? • What types of student achievement metrics could/should be used? • Is it only about academics, or do other things matter? • What’s “growth” and how do we measure it?
Things we’re thinking about today • What does it mean for the system to be reliable, fair (unbiased), and able to support valid decisions? • What could impact our systems and threaten reliability, fairness, and/or validity? • Some of the things we are thinking about are clearly NOT measurement issues, so we may not deal with them today, but...
What Are YOU Thinking About? • What issues are you concerned about? • What questions do you have coming in? • What are you expecting the take away from this workshop?
Workshop Outline • Introduction / Framing • Purpose / Components • Why are we talking about this? • The system has to consist of something • Measuring Educator Practice • Measuring Student Achievement • Evaluating Educators – Putting it ALL Together • Reporting & Use of Educator Evaluations • Wrap Up
The Big Three • Quality Assurance / Accountability • Performance-based Rewards • Continuous Improvement • Some of these are higher stakes than others! • High stakes systems need to be rigorous in order to be defensible • Rigorous systems are more difficult, time consuming, and costly to implement
Rigor vs. Purpose High (defensible) QA Rewards Rigor CI Stakes Low High (consequential) (not consequential) Low (not defensible)
Teacher Examples High Rigor National Board Certification Praxis III Structured Mentoring Programs, e.g. New Teacher Center Low High Stakes Informal mentoring programs Traditional Evaluation Systems DANGER! Low
Quality Assurance / Accountability(very high stakes) What Why • Assure all personnel competently perform their job function(s) • A minimum acceptable standard • The ability to accurately identify and remove those who do not meet this standard • Compliance with legal requirements • Fulfill a public trust obligation • Belief that sanction-based systems (stick) “motivate” people to fix deficient attitudes and behaviors
Performance-based Rewards(moderate stakes) What Why • Determine which personnel (if any) deserve some form of reward for performance / results that are: • Distinguished • Above average • The ability to accurately distinguish performance and correctly tie it to results • Belief that incentive-based systems (carrot) “motivate” people to strive for better results (change attitudes and behaviors) • Assumes that these changes will be fundamentally sound, rather than “gaming” the system to get the reward
Continuous Improvement(lower stakes, but not less important) What Why • Improvement in personal educator practice • Improvement in collective educator practice • Professional Learning • Constructive / actionable feedback • Self-reflection • Belief that quality should be strived for but is never attained • Status quo is not an option; things are improving or declining • It’s what professionals do • Can’t “fire our way to quality” • Clients deserve it
The Nature of Professional Learning • Trust • Self-assessment • Reflection on practice • Professional conversation • A community of learners
Can One Comprehensive SystemServe Many Purposes? • Measurement & evaluation issues (solutions) may be different for each: • Purpose • Role (Teachers, Administrators, Other Staff) • Nationally, our industry has not done particularly well at designing/implementing systems for any of these purposes (or roles) • We have to try, but...very challenging task ahead! • We won’t get it right the first time; the systems will need continuous improvement
A Technical Concept (analogy) • When testing statistical hypotheses, one has to decide, a priori, what probability of reaching a particular type of incorrect conclusion is acceptable. • Hypotheses are normally stated in “null” form, i.e., that there was no effect as a result of the treatment • Rejecting a “true” null hypothesis is a “Type I Error”, and this is probability that is usually set a priori. • Failing to reject a false null hypothesis is a “Type II error.” • The probability of correctly detecting a treatment effect is known as “Power.”
A Picture of Type I & II Error and Power Reject Correct Decision– The treatment had a statistically significant effect – The ability to do this is know as Power D e c i s i o n Type I Error False True Null Hypothesis Correct Decision – the treatment did NOT have a statistically significant effect Type II Error Accept
These Things Are Related • Obviously we would prefer not to make mistakes, but...the lower the probability of making a Type I Error: • The higher the probability of making a Type II Error • The lower the Power of the test
An Example From Medicine • When testing drugs, we want a low Type I Error (we do not want to decide that a drug is effective when in fact it is not) • When testing patients, however, we want high power and minimum Type II Error (we want to make sure we detect disease when it is actually present) • The price we pay in medicine is a willingness to tell a patient that they are sick when they are not (Type I Error) • However, since treatments can have serious side effects, our safeguard is to use multiple tests, get multiple expert opinions, and to continue to re-check the patient once treatment begins
Do These Concepts Have Something To Do With Educator Evaluation? • We are not aware of anyone contemplating using statistical testing to do educator evaluation at this time, but • Yes, conceptually these ideas are relevant and have technical analogs in educator evaluation
Application to Educator Evaluation • In our context, a reasonable null hypothesis might be that an educator is presumed competent (innocent until proven guilty) • Deciding that a competent educator is incompetent would be a: • Type I error (by analogy) • Very serious / consequential mistake • If we design our system to guard against (minimize) this type of mistake, it may lack the power to accomplish other purposes without error
Possible Components of an Educator Evaluation System Conceptual, Legal, & Technical
Conceptually Required Components • The measurement of practice (what an educator does) based on a definition of practicethat is clear, observable, commonly accepted, and supported by transparent measurement methods / instruments that are technically sound and validated against desired outcomes
Conceptually Required Components • The measurement of student outcomes based on a definition of desired student outcomesthat is clear, commonly accepted, and supported by transparent measurement methods / instruments that are technically sound and validated for that use
Conceptually Required Components • A clear and commonly accepted method for combining A. & B. to make accurate, fair, and defensible high-stakes evaluative decisions that is: • Technically sound • Has an appropriate role for professional judgment • Includes a fair review process • Provides specific / actionable feedback and guidance • Affords a reasonable chance to improve, with appropriate supports
Legally Required ComponentsRSC 380.1249(2)(c) and (ii) ... • Annual (year-end) Evaluation • Classroom Observations (for teachers) • A review of lesson plans and the state curriculum standard being used in the lesson • A review of pupil engagement in the lesson • Growth in Student Academic Achievement • The use of multiple measures • Consideration of how well administrators do evaluations as part of their evaluation
Additional Legal Requirements • The Governor’s Council on Educator Effectiveness shall submit by April 30, 2012 • A student growth and assessment tool RSC 380.1249(5)(a) • That is a value-added model RSC 380.1249(5)(a)(i) • Has at least a pre- and post-test RSC 380.1249(5)(a)(iv) • A process for evaluating and approving local evaluation tools for teachers and administrators RSC 380.1249(5)(f) • There are serious / difficult technical issues implicit in this language
Yet More LegislationRSC 380.1249(2)(a)(ii) ... • If there are student growth and assessment data (SGaAD) available for a teacher for at least 3 school years, the annual year-end evaluation shall be based on the student growth and assessment data for the most recent 3-consecutive-school-year period. • If there are not SGaAD available for a teacher for at least 3 school years, the annual year-end evaluation shall be based on all SGaAD that are available for the teacher.
The Ultimate Legal Requirement • Mandatory dismissal of educators who have too many consecutive ineffective ratings
Required Technical Properties • In general, to be legally and ethically defensible, the system must have these technical properties embedded throughout: • Reliable – internally and externally consistent • Fair & Unbiased – objectively based on data • Validated – capable of consistently making correct (accurate) conclusions about adult performance that lead to correct (accurate) decisions/actions about educators
Technical Components – Practice • Clear operational definition of practice (teaching, principalship, etc.) with Performance Levels • Professional development for educators and evaluators to ensure a thorough and common understanding of all aspects of the evaluation system • Educators, evaluators, mentors / coaches trained together (shared understanding) • Assessments to establish adequate depth and commonality of understanding (thorough and common)
Technical Components – Practice • Validated instruments and procedures that provide consistently accurate, unbiased, defensible evidence of practice from multiple sources, with ongoing evidence of high inter-rater reliability, including: • Trained/certified evaluators • Periodic calibration of evaluatorsto ensure consistent evidence collection between evaluators and over time (overlapping, independent observation and analysis of artifacts) • Adequate sampling of practiceto ensure that evidence is representative of actual practice(may be the big measurement challenge with respect to practice) • Methods for summarizing practice evidence • Tools (software, databases) to support the work
Technical Components – Outcomes • Multiple validated measures of student academic achievement need to be: • Aligned to curriculum / learning targets • Common (where possible / appropriate0 • Standardized (admin and scoring) where possible • Adequate samples of what students know and can do • Capable of measuring “growth” • Instructionally sensitive • Able to support attribution to teacher practice • Methods for summarizing the measurement of student academic achievement and attributing it to educators • Tools (software, databases) to support the work
Technical Components – Evaluation • Methods for combining practice and outcome evidence • Careful use of formulas and compensatory systems, if used at all (great caution needed here) • Opportunity for meaningful self-evaluation and input by evaluee • Process for making initial summative overall judgment • Opportunity for review of summative judgment by evaluee • Reasonable appeal process
Evaluation vs. Measurement • Measurement = accurate data, often quantitative, to be used as evidence. A science and an art. • Evaluation = value judgment (hopefully accurate) • What is the (proper) role of measurement within an evaluation system? • Appropriate, sufficient, reliable, and unbiased data to inform evaluative decisions • What else does an evaluation system need? • Performance standards for competent practice and acceptable outcomes • Methods for combining multiple sources of evidence and making meaning of them • Methods for making overall judgments about educator effectiveness based on the above
“Effectiveness” • Effectiveness (ineffectiveness) and Competence (incompetence) are not (automatically) the same thing • Effectiveness implies results, in this case relating to the (academic) achievement of specific sets of students • Educators can (and are) often effective in one setting but not in another • Competence is something we tend to think of as a somewhat fixed set of attributes of an individual at any point in time regardless of setting • Effectiveness is definitely contextual – it is not a fixed attribute of a person
Questions & Comments ...and perhaps a short break
Workshop Outline • Introduction / Framing • Purpose / Components • Measuring Educator Practice • ...on the assumption that what educators do makes a difference in student outcomes • Measuring Student Achievement • Evaluating Educators – Putting it ALL Together • Reporting & Use of Educator Evaluations • Wrap Up
Caution! “After 30 years of doing such work, I have concluded that classroom teaching … is perhaps the most complex, most challenging, and most demanding, subtle, nuanced, and frightening activity that our species has ever invented. ..The only time a physician could possibly encounter a situation of comparable complexity would be in the emergency room of a hospital during or after a natural disaster.” Lee Shulman, The Wisdom of Practice
Double Caution!! • The work of effectively leading, supervising, and evaluating the work of classroom teachers (principalship) must also be some of the most “complex, challenging, demanding, subtle, nuanced, and frightening” work on the planet. • Humility rather than hubris seems appropriate here. While adults certainly do not have the right to harm children, do we have the right to harm adults under the pretense of looking out for children?
The Nature of Evidence • Evidence is not just someone’s opinion • It is... • Factual (accurate & unbiased) • Descriptive (non-judgmental, non-evaluative) • Relevant • Representative • Interpreting evidence is separate from collecting it, and needs to occur later in the evaluation process
Cognitive Load and Grain Size • There is only so much a person can attend to • The ability to collect good evidence, especially from observation, requires frameworks of practice with associated rubrics, protocols, and tools that... • Have an appropriate “grain size” • Don’t have too many performance levels • Are realistic in terms of cognitive load for the observer • Allow/support quick but accurate sampling of specific targets
Classification Accuracy • It seems that the more categories into which something can be classified, the more accurately it should be able to be classified. This, however, is not usually the case • The more categories (performance levels) there are, the more difficult it is to: • Write descriptions that unambiguously distinguishable • Internalize the differences • Keep them clearly in mind while observing/rating • The result is that classification becomes less accurate and less reliable as the number of categories increases
Classification Accuracy • Classification will be most accurate and reliable when there are only two choices, e.g., • “satisfactory”/“unsatisfactory” (Effective/Ineffective) • However, the law says we need more levels, and we tend to want to separate the following anyway: • Proficient from Not Proficient • Proficient from Advanced • Basic (needs improvement) from Deficient • Trained, conscientious observers/raters can reliably distinguish four levels of performance on good rubrics