Measuring teacher and principal effectiveness

Measuring teacher and principal effectiveness Laura Goe, Ph.D. Research Scientist, ETS, and Principal Investigator for the National Comprehensive Center for Teacher Quality Workshop Presentation to Nebraska Leadership Committee Lincoln, NE  April 19, 2012

Laura Goe, Ph.D. • Former teacher in rural & urban schools • Special education (7th & 8th grade, Tunica, MS) • Language arts (7th grade, Memphis, TN) • Graduate of UC Berkeley’s Policy, Organizations, Measurement & Evaluation doctoral program • Principal Investigator for the National Comprehensive Center for Teacher Quality • Research Scientist in the Performance Research Group at ETS

The National Comprehensive Center for Teacher Quality • A federally-funded partnership whose mission is to help states carry out the teacher quality mandates of ESEA • Vanderbilt University • Learning Point Associates, an affiliate of American Institutes for Research • Educational Testing Service

Today’s presentation available online • To download a copy of this presentation go to www.lauragoe.com • Go to Publications and Presentations page • Today’s presentation is at the bottom of the page

To be discussed… • A new era in teacher and principal evaluation • An aligned systems of teacher and principal evaluation • Developing a shared vocabulary • Components of teacher and principal evaluation systems • Student, parent, and staff feedback measures • Professional responsibility measures and other valued actions • Weighting components of the evaluation model • Frontier and rural school models • Professional growth opportunities aligned with evaluation results • Merit pay and teacher retention • Teacher preparation programs • Principal evaluation standards and instruments • Moving forward: next steps

The goal of teacher evaluation

Trends in teacher evaluation • The policy imperative to change teacher evaluation has outstripped the research • Though we don’t yet know which model and combination of measures will identify effective teachers, many states and districts feel compelled to move forward at a rapid pace • Inclusion of student achievement growth data represents an important “culture shift” in evaluation • Communication and teacher/administrator participation and buy-in are crucial to ensure change • The implementation challenges are considerable • We are models exist for states and districts to adopt or adapt • Many districts have limited capacity to implement comprehensive systems, and states have limited resources to help them

It’s an equity issue • Value-added research shows that teachers vary greatly in their contributions to student achievement (Rivkin, Hanushek, & Kain, 2005). • The Widget Effect report (Weisberg et al., 2009) found that 90% of teachers were rated “good” or better in districts where students were failing at high levels

An aligned teacher evaluation system: Part I

An aligned teacher evaluation system: Part II

“Effective” vs. “Highly Qualified” • The focus has shifted away from ensuring highly qualified teachers in every classroom to ensuring effective teachers in every classroom • This shift is a result of numerous studies that show that qualifications provide a “floor” or “minimum” set of competencies but do not predict which teachers will be most successful at helping students learn

Definitions in the research & policy worlds • Much of the research on teacher effectiveness doesn’t define effectiveness at all though it is often assumed to be teachers’ contribution to student achievement • Bryan C. Hassel of Public Impact stated in 2009 that “The core of a state’s definition of teacher effectiveness must be student outcomes” • Checker Finn stated in 2010 that “An effective teacher is one whose pupils learn what they should while under his/her tutelage”

Definitions in the research & policy worlds (2) • Anderson (1991) stated that “… an effective teacher is one who quite consistently achieves goals which either directly or indirectly focus on the learning of • their students” (p. 18).

Definitions in the research & policy worlds (3) • Hunt (2009) stated that, “…the term “teacher effectiveness” is used broadly, to mean the collection of characteristics, competencies, and behaviors of teachers at all educational levels that enable students to reach desired outcomes, which may include the attainment of specific learning objectives as well as broader goals such as being able to solve problems, think critically, work collaboratively, and become effective citizens. • (p. 1)

Goe, Bell, & Little (2008) definition of teacher effectiveness • Have high expectations for all students and help students learn, as measured by value-added or alternative measures. • Contribute to positive academic, attitudinal, and social outcomes for students, such as regular attendance, on-time promotion to the next grade, on-time graduation, self-efficacy, and cooperative behavior. • Use diverse resources to plan and structure engaging learning opportunities; monitor student progress formatively, adapting instruction as needed; and evaluate learning using multiple sources of evidence. • Contribute to the development of classrooms and schools that value diversity and civic-mindedness. • Collaborate with other teachers, administrators, parents, and education professionals to ensure student success, particularly the success of students with special needs and those at high risk for failure.

Race to the Top definition of effective & highly effective teacher Effective teacher: students achieve acceptable rates (e.g., at least one grade level in an academic year) of student growth (as defined in this notice). States, LEAs, or schools must include multiple measures, provided that teacher effectiveness is evaluated, in significant part, by student growth (as defined in this notice). Supplemental measures may include, for example, multiple observation-based assessments of teacher performance. (pg 7) Highly effective teacher students achieve high rates (e.g., one and one-half grade levels in an academic year) of student growth (as defined in this notice).

Measures and models: Definitions • Measures are the instruments, assessments, protocols, rubrics, and tools that are used in determining teacher effectiveness • Models are the state or district systems of teacher evaluation including all of the inputs and decision points (measures, instruments, processes, training, and scoring, etc.) that result in determinations about individual teachers’ effectiveness

Teaching standards • A set of practices teachers should aspire to • A teaching tool in teacher preparation programs • A guiding document with which to align: • Measurement tools and processes for teacher evaluation, such as classroom observations, surveys, portfolios/evidence binders, student outcomes, etc. • Teacher professional growth opportunities, based on evaluation of performance on standards • A tool for coaching and mentoring teachers: • Teachers analyze and reflect on their strengths and challenges and discuss with consulting teachers

Who should be at the table? • “An SEA must meaningfully engage and solicit input from diverse stakeholders and communities in the development of its request.” (NCLB Waiver application, pg. 15) • A description of how the SEA meaningfully engaged and solicited input on its request from teachers and their representatives. • A description of how the SEA meaningfully engaged and solicited input on its request from other diverse communities, such as students, parents, community-based organizations, civil rights organizations, organizations representing students with disabilities and English Learners, business organizations, and Indian tribes.

Multiple measures of teacher effectiveness • Evidence of growth in student learning and competency • Standardized tests, pre/post tests in untested subjects • Student performance (art, music, etc.) • Curriculum-based tests given in a standardized manner • Classroom-based tests such as DIBELS • Evidence of instructional quality • Classroom observations • Lesson plans, assignments, and student work • Student surveys such as Harvard’s Tripod • Evidence binder (next generation of portfolio) • Evidence of professional responsibility • Administrator/supervisor reports, parent surveys • Teacher reflection and self-reports, records of contributions

Teacher observations: strengths and weaknesses • Strengths • Great for teacher formative evaluation (if observation is followed by opportunity to discuss) • Helps evaluator (principals or others) understand teachers’ needs across school or across district • Weaknesses • Only as good as the instruments and the observers • Considered “less objective” • Expensive to conduct (personnel time, training, calibrating) • Validity of observation results may vary with who is doing them, depending on how well trained and calibrated they are

Why teachers generally value observations • Observations are the traditional measure of teacher performance • Teachers feel they have some control over the process and outcomes • They report that having a conversation with the observation and receiving constructive feedback after the observation is greatly beneficial • Evidence-centered discussions can help teachers improve instruction • Peer evaluators often report that they learn new teaching techniques

When teachers don’t value observations, it’s because… • They do not receive feedback at all • The feedback they receive is not specific and actionable • The observer suggests actions but is unable to offer the means and resources to carry out those actions • Mentors/coaches, other support personnel • Time for individual growth planning/activities • Protected time for collaboration with others

Validity of classroom observations is highly dependent on training • A teacher should get the same score no matter who observes him • This requires that all observers be trained on the instruments and processes • Occasional “calibrating” should be done; more often if there are discrepancies or new observers • Who the evaluators are matters less than adequate training • Teachers should be trained on the observation forms and processes

Reliability results when using different combinations of raters and lessons Figure 2. Errors and Imprecision: the reliability of different combinations of raters and lessons. From Hill et al., 2012 (see references list). Used with permission of author.

Cincinnati study results • Study by Kane et al. (2010) used teacher evaluation scores plus value-added scores • “…policies and programs that help a teacher get better on all eight ‘teaching practice’ and ‘classroom environment’ skills measured by TES will lead to student achievement gains” (p. 28) • “…helping teachers improve their ‘classroom environment’ management will likely also generate higher student achievement” (p. 28) • “…[adding] pedagogy that utilizes ‘questioning and discussion’ practices will generate higher reading achievement, but not higher math achievement” (p. 28)

Value-added models Many variations on value-added models TVAAS (Sander’s original model) typically uses 3+ years of prior test scores to predict the next score for a student Used since the 1990’s for teachers in Tennessee, but not for high-stakes evaluation purposes Most states and districts that currently use VAMs use the Sanders’ model, also called EVAAS There are other models that use less student data to make predictions Considerable variation in “controls” used 27

Achievement Proficient Teacher A: “Success” on Ach. Levels Teacher B: “Failure” on Ach. Levels Start of School Year End of Year Growth vs. Proficiency Models In terms of growth, Teachers A and B areperforming equally Slide courtesy of Doug Harris, Ph.D, University of Wisconsin-Madison

Achievement Proficient Teacher A Teacher B Start of School Year End of Year Growth vs. Proficiency Models (2) A teacher with low-proficiency students can still be high in terms of GROWTH (and vice versa) Slide courtesy of Doug Harris, Ph.D, University of Wisconsin-Madison

Most popular growth models: Colorado Growth Model • Colorado Growth model • Focuses on “growth to proficiency” • Measures students against “academic peers” • Also called criterion‐referenced growth‐to‐standard models • The student growth percentile is “descriptive” whereas value-added seeks to determine the contribution of a school or teacher to student achievement (Betebenner 2008)

An illustration of student growth over time in Denver, CO Slide courtesy of Damian Betebenner at www.nciea.org

What value-added and growth models cannot tell you • Value-added and growth models are really measuring classroom, not teacher, effects • Value-added models can’t tell you why a particular teacher’s students are scoring higher than expected • Maybe the teacher is focusing instruction narrowly on test content • Or maybe the teacher is offering a rich, engaging curriculum that fosters deep student learning. • How the teacher is achieving results matters!

Measuring teachers’ contributions to student learning growth (classroom)

Race to the Top definition of student growth • Student growth means the change in student achievement (as defined in this notice) for an individual student between two or more points in time. A State may also include other measures that are rigorous and comparable across classrooms. (pg 11) 34

Measuring teachers’ contributions to student learning growth: A summary of current models

School-wide VAM illustration

DC Impact: Score comparison for Groups 1-3

Validity • There is little research-based support for the validity of using student growth measures for teacher evaluation • Mainly because using student growth measures in evaluation hasn’t been done • Herman et al. (2011) state, “Validity is a matter of degree (based on the extent to which an evidence-based argument justifies the use of an assessment for a specific purpose).” (pg. 1)

IF THEN Standards clearly define learning expectations for the subject area and each grade level Assessment scores represent teachers’ contribution to student growth AND AND IF Interpretation of scores may be appropriately used to inform judgments about teacher effectiveness Student growth scores accurately and fairly measure student progress over the course of the year The assessment instruments have been designed to yield scores that can accurately reflect student achievement of standards AND IF AND There is evidence that the assessment scores actually measure the learning expectations The assessment instruments have been designed to yield scores that accurately reflect student learning growth over the course of the year AND IF Propositions that justify the use of these measures for evaluating teacher effectiveness. (Adaptation based on Bailey & Heritage, 2010 and Perie & Forte (in press)) (Herman, Heritage & Goldschmidt, 20ll ). Slide used courtesy of Margaret Heritage.

Validity is a process • Starts with defining the criteria and standards you want to measure • Requires judgment about whether the instruments and processes are giving accurate, helpful information about performance • Verify validity by • Comparing results on multiple measures • Multiple time points, multiple raters

The 4 Ps (Projects, Performances, Products, Portfolios) • Some learning is best measured with an assessments other than a standardized test • Yes, they can be used to demonstrate teachers’ contributions to student learning growth • Here’s the basic approach • Use a high-quality rubric to judge initial knowledge and skills required for mastery of the standard(s) • Use the same rubric to judge knowledge and skills at the end of a specific time period (unit, grading period, semester, year, etc.)

Assessing Musical Behaviors: The type of assessment must match the knowledge or skill 4 types of musical behaviors: Types of assessment Slide used with permission of authors Carla Maltas, Ph.D. and Steve Williams, M.Ed. See reference list for details. Responding Creating Performing Listening Rubrics Playing tests Written tests Practice sheets Teacher Observation Portfolios Peer and Self-Assessment

Georgia CLASS KEYS

Washington DC IMPACT:Rubric for Determining Success (for teachers in non-tested subjects/grades)

The “caseload” educators • For nurses, counselors, librarians and other professionals who do not have their own classroom, what counts for you is your “caseload” • May be all the students in the school • May be a specific set of students • May be other teachers • May be all of the above!

Other teachers with “caseloads” • For team teachers, special ed teachers, ELL teachers, other itinerant teachers • Caseload would be the students you provide instruction or assistance to • When students are shared between two teachers, those students belong to both teachers’ caseloads • This may be done as a percentage, or the shared student scores would be counted for each teacher

Tripod Survey (1) • Harvard’s Tripod Survey – the 7 C’s • Caring about students (nurturing productive relationships); • Controlling behavior (promoting cooperation and peer support); • Clarifying ideas and lessons (making success seem feasible); • Challenging students to work hard and think hard (pressing for effort and rigor); • Captivating students (making learning interesting and relevant); • Conferring (eliciting students’ feedback and respecting their ideas); • Consolidating (connecting and integrating ideas to support learning)

Tripod Survey (2) • Improved student performance depends on strengthening three legs of teaching practice: content, pedagogy, and relationships • There are multiple versions: k-2, 3-5, 6-12 • Measures: • student engagement • school climate • home learning conditions • teaching effectiveness • youth culture • family demographics • Takes 20-30 min • There are English and Spanish versions • Comes in paper form or in online version

Tripod Survey (3) • Control is the strongest correlate of value added gains • However, it is important to keep in mind that a good teacher achieves control by being good on the other dimensions

Measuring teacher and principal effectiveness