slide1
Download
Skip this Video
Download Presentation
Robert W. Lissitz University of Maryland

Loading in 2 Seconds...

play fullscreen
1 / 60

Robert W. Lissitz University of Maryland - PowerPoint PPT Presentation


  • 54 Views
  • Uploaded on

The Evaluation of Teacher and School Effectiveness Using Growth Models and Value Added Modeling: Hope Versus Reality. Robert W. Lissitz University of Maryland. http://marces.org/Completed.htm. Thank you. First, I want to thank… The creators of this symposium Burcu Kaniskan

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Robert W. Lissitz University of Maryland' - tia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

The Evaluation of Teacher and School Effectiveness Using Growth Models and Value Added Modeling:Hope Versus Reality

Robert W. Lissitz

University of Maryland

http://marces.org/Completed.htm

Maryland Assessment Research Center for Education Success

thank you
Thank you
  • First, I want to thank…
  • The creators of this symposium
    • BurcuKaniskan
  • The State of Maryland
  • MARCES:
    • Laura Reiner, Yuan Zhang, Xiaoshu Zhu, and Dr. Bill Schafer
  • Drs. XiaodongHou and Ying Li
  • Yong Luo, Matt Griffin, Tiago Calico, and Christy Lewis
preview
Preview
  • History of VAM
  • Literature:
    • Reliability
    • Validity
  • Application of VAM
  • Direction of VAM in the future
    • Applied viewpoint
    • Psychometric viewpoint
introduction and history
Introduction and History

RACE TO THE MIDDLE

  • The federal government is asking psychometricians to help make decisions
    • Race to the Top
    • Earlier: No Child Left Behind (“Race to the Middle”)
  • The government wants a system that will
    • Pressure educational administrations to do the right thing
    • Combat the teachers’ unions perceived as obstacles
slide5

Introduction and History

WHAT IS VAM?

  • Value-added modeling (VAM) is a system that we hope can determine the effectiveness of some mechanism
    • Usually teachers or schools
  • Most popular models include
    • Simple regression
    • Recording transitions between performance levels in adjacent grades
    • Mixed effects or multilevel regression models
      • Teacher or school as level 2 effect
slide6

Introduction and History

WHAT IS VAM?

  • Results for each student are usually aggregated
    • Provides summaries of every student for each teacher
  • Attempt to show whether students associated with a teacher are performing above or below statistically expected values, or values associated with other teachers
  • Usually normative in nature
slide7

Introduction and History

MANDEVILLE – late 1980’s

  • Investigated school effectiveness and reliability of indicators
  • Findings:
    • Some schools are better than others
    • Differences in quality are inconsistent
      • Across years
      • Within schools across grade levels and subject areas
slide8

Introduction and History

DALLAS – mid-1990’S

  • 1994: School effects
  • 1995-1996: Teacher effects
  • Model with two stages:
    • Regression to control for “fairness variables”
      • Gender, ethnicity, English proficiency, SES, etc.
    • HLM to control for prior achievement, attendance, and school-level variables
  • High stakes decisions
    • Bonuses
    • Frequency of classroom observations
slide9

Introduction and History

TVAAS – mid-1990’S

  • Sanders et al.
  • “Layered” multiple regression model
    • Effects of teachers and past teachers
  • Multiple years of prior performance on several subject matter exams
    • Used to covary out the effect of undesirable student characteristics on growth
  • Complex interactions could not be statistically removed
    • Effects may have different influence on students of different ability levels
    • Probably not possible to eliminate statistically
  • Future might look at latent classes of students and teachers
slide10

Introduction and History

CHALLENGES– CRITICISM

  • Nonrandom assignment of students to teachers
    • Effect not controlled by use of prior performance level
    • Bias reduced by using multiple prior measures
  • “Dynamic” interaction between students and teachers
    • Association between teacher effectiveness and student characteristics
  • VAM for high-stakes decisions not for all
    • Many teachers with subjects not tested
    • Memphis, TN – VAM does not apply to 70% of teachers
slide11

reliability

GENERALIZABILITY

  • Think of the reliability of VAM as a generalizability problem.
  • Is teacher effectiveness justified as a main effect, or are teachers actually effective in some circumstances and ineffective in others?
  • If interactions exist, the problem for the principal changes from “who is ineffective?” to “are there conditions in which this teacher can be effective?”
reliability
Reliability

STABILITY OVER A ONE-YEAR PERIOD

  • Mandeville (1988):
  • School effectiveness estimates were stable in the 0.34 to 0.66 range of correlations
    • Large differences across grade level and subject matter
  • McCaffrey (2009):
  • Teacher effect estimates one year apart had correlations around 0.2 to 0.3
  • Teaching itself may not be a stable phenomenon
    • Variability may be due to actual performance changes from year to year; instability may be intractable
slide13

Reliability

STABILITY OVER A SHORT PERIOD OF TIME

  • Sass (2008) and Newton, et al (2010):
  • Estimates of teacher effectiveness from test-retest assessments over a short time period
  • Correlations in the range of 0.6
  • For high stakes testing, we usually require reliability greater than 0.8
  • Still may indicate a real phenomenon, but modest
slide14

Reliability

STABILITY ACROSS GRADE AND SUBJECT

  • Mandeville & Anderson (1987) and others (Rockoff, 2004; Newton, et al, 2010):
  • Stability fluctuates across grade and subject matter
  • Limited stability found more often with math courses, less often with reading courses
  • Success depends on what class you are assigned rather than your ability?
    • Serious issues of fairness and comparability
slide15

Reliability

STABILITY AT THE SCHOOL LEVEL

  • Perception that entire school is good or bad is very popular
  • St. Louis, early 1990’s
    • Challenged advisory committee to find a school that remained at the top 3 years in a row
    • No system that reported back had even one
  • FedBlue Ribbon Schools
    • “Winning school in one year was typically not at the top a year or two later”
  • Bottom line:
    • Rankings or groupings of schools (e.g., quintiles) are not stable.
slide16

Reliability

STABILITY ACROSS TEST FORMS

  • Sass (2008):
  • Top quintile and bottom quintile seem the most stable
  • Correlation of teacher effectiveness in those groups was 0.48 across comparable exams over a short time
  • Time extended to a year between tests: correlation dropped to 0.27
  • Papay (2011):
  • Three different tests
  • Rank order correlations of teacher effectiveness across time ranged from 0.15 to 0.58 across different tests
  • Test timing and measurement error have effects
slide17

Reliability

STABILITY ACROSS STATISTICAL MODELS

  • Tekwe, et al (2004):
  • Compared four regression models
  • Unless models involve different variables, results tend to be similar
  • Dawes (1979):
  • Linear composites seem to be pretty much the same regardless of how one gets the weights
  • Hill, et al (2011):
  • Convergent validity problem
slide18

Reliability

STABILITY ACROSS CLASSROOMS

  • Newton, et al (2010):
  • Students who are less advantaged, ESL, or on a lower track can have a negative impact on teacher effect estimates
  • Multiple VAM models were tested
  • Success of matching teacher characteristics to VAM outcomes was modest
  • VAM could be used as a criterion to judge other variables, but validity is questionable
slide19

Reliability

SOURCES OF UNRELIABILITY

  • Persistent effects (teacher consistency), non-persistent effects (inconsistency), and non-persistence due to sampling error (unknown)
  • 30-60% of variation is due to sampling error
    • In part due to small numbers of students as the basis of effectiveness estimates
  • Regression to the mean
    • Class sizes vary within a school or district
    • Classrooms with fewer students tend toward the mean
  • Bayes estimates in multilevel modeling also introduce bias that is a function of sample size
  • Other occupations: Lack of consistency is typical of complex professions – baseball players, stock investors…
slide20

Validity

JOB APPLICATIONS AS PREDICTIVE MEASURES

  • Years of experience, advanced degrees, certification, licensure, school quality, etc. have low relationships (if any) to teacher effectiveness
    • Weak relationship between effectiveness and advanced degree
  • Knowledge of mathematics positively correlated with teaching mathematics effectively
  • VAM estimates provide better measures of teacher impact on student test scores than measures on teacher’s job application
slide21

Validity

TRIANGULATION OF MULTIPLE INDICATORS

  • Goe, et al (2008):
  • Context forevaluation
  • Teachers should be compared to other teachers who:
    • Teach similar courses
    • In same grade
    • In a similar context
    • Assessed by same or similar examination
  • Probably necessary to establish validity
slide22

Validity

COMPARABILITY

  • Ability is very likely correlated with growth and status
    • Do gifted students learn at the same rate as others?
    • Gifted students and their teachers have an advantage
  • Interaction between student ability and teachers’ ability to be effective
  • Mixture models are in development
slide23

Validity

CAUSALITY, RESEARCH DESIGN, AND THEORY

  • Rubin (2004):
  • Missing data is not missing at random
    • Missing in a way that confounds results and complicates inferences
  • We do not have a clear idea what our hypothesis is
  • Multiple operational definitions of growth, but no developmental science for the phenomenon
slide24

Validity

CAUSALITY, RESEARCH DESIGN, AND THEORY

  • Without carefully controlled experiments, we cannot isolate teacher effects
    • Students have multiple teachers
    • Influence of prior performance and experience
  • What do we even mean by causal effect?
    • How do teachers and schools impart their effect?
    • How is it internalized by the student?
  • Lord’s paradox
    • ANCOVA does not lead to unambiguous interpretations
  • Only experimental efforts will provide adequate results
  • Eminent faculty member: teacher decision-making - unclear what is optimal
slide25

Validity

WHY SHOULD WE CARE?

  • Are teachers the most important factor determining student achievement?
    • Nye, et al (2004): 11% of variation in student gains explained by teacher effects
    • Rockoff (2004): Teacher effects 5.0-6.4%

School effects 2.7-6.1%

Student fixed effects 59-68%

slide26

Validity

WHY SHOULD WE CARE?

  • Importance of classroom context
    • Kennedy (2010), etc.:
      • Situational factors influence teacher success
        • Time, materials, work assignments
      • Controlling behavioral issues; mainstreaming only students who are willing/capable to be non-disruptive
      • Technical assistance with teaching (computers..)
  • New teacher’s Goal: Maximize context for learning
slide27

Validity

WHY SHOULD WE CARE?

  • New paradigm– different orientation toward the learning process
  • Teacher optimizes the context of the classroom
    • Adding to motivation
    • Preventing disruption
    • Providing opportunity for enhanced learning engagement
  • Use of assistive teaching devices (computers) will change teacher’s role
  • Develop a learning science
    • Current paradigm emphasizes external validity and immediate generality
    • Instead, create laboratory for education science
slide28

Validity

WHY SHOULD WE CARE?

  • Fairness
    • Little evidence VAM is ready for high stakes use
    • But…

Is it less fair than traditional personnel selection that focuses on advanced degrees and certificates, more credit hours, and working more years? Classroom observations?

slide29

OUR STUDY

COMPARING MODELS USING REAL DATA

  • The MARCES Center has studied 11 of the simplest models that might be applied
  • The full VAM report and the full textsupportingthis presentation can be accessed at
  • http://marces.org/Completed.htm
slide30

OUR STUDY

COMPARING MODELS USING REAL DATA

  • We obtained 3 years of data on the same students, linked to their teachers
  • Students divided into four cohorts: (N ≈ 5000 per cohort)
  • Math and reading data from yearly spring state assessment (2008-2010)
    • No vertical scale
    • Horizontally equated from year to year
  • VAM models chosen for comparison do not require vertical scaling
    • Nine models compare growth from first to second year
    • Two models compare growth from first and second to third year
slide31

TABLE 2:

Data used in our study

slide32

OUR STUDY

MODELS

slide33

OUR STUDY

MODELS

  • BETEBENNER’S MODEL
  • Used in Colorado
  • Looks at conditional percentile of each student’s performance in the second year, compared to other students who started in same percentile the first year
  • Aggregates conditional percentiles of students exposed to the same teacher
  • QRG1 uses prior year to condition the percentile the next year
  • BETEBENNER’S MODEL
  • Used in Colorado
  • Looks at conditional percentile of each student’s performance in the second year, compared to other students who started in same percentile the first year
  • Aggregates conditional percentiles of students exposed to each teacher
  • ConD is a simplification: aggregates students into deciles one year and compares to deciles the second year
  • BETEBENNER’S MODEL
  • Used in Colorado
  • Looks at conditional percentile of each student’s performance in the second year, compared to other students who started in same percentile the first year
  • Aggregates conditional percentiles of students exposed to each teacher
  • QRG2 uses 2 prior years to condition the percentile the 3rd year
slide34

OUR STUDY

MODELS

  • THUM’S MODEL
  • Similar to ConD, but looks at effect size
  • Uses z score to identify student’s performance level compared to the average student the first year
  • In second year, compares student’s z score to students who started at same z position (within a decile) in the prior year
  • Conditional z scores aggregated for each teacher to provide measure of effectiveness
  • THUM’S MODEL
  • Our simplification: z score conditional on prior deciles:
    • Rank order all students’ year one scale scores; divide into 10 deciles
    • Compute mean of year 2 scale scores for students within each decile
    • Compute deviation scores from the decile mean of year 2 scale scores for students within each decile
    • Compute pooled within-decile SD of year 2 scale scores
    • Compute growth z score for each student
slide35

OUR STUDY

MODELS

  • ORDINARY LEAST SQUARES REGRESSION
  • Aggregates errors of prediction across teachers to see which teacher’s students tend to perform above or below prediction

OLS2

Independent variable: first two years’ scale scores

Effectiveness measure: deviation from expected scale score for year three

OLS1

Independent variable: first year scale score

Effectiveness measure: deviation from expected scale score for year two

slide36

OUR STUDY

MODELS

  • REGRESSION USING SPLINE SCORES
  • Calculated with scores that had been transformed by a spline function
  • Gives relational meaning to points along the performance continuum across grades
    • Builds a quasi-vertical scale without common items
  • Transformation matched to cut scores for 3 proficiency levels: basic, proficient, advanced

OLSS applies ordinary least squares to the spline scale scores and looked at deviations from predicted

DIFS subtracts spline function transformed score at year 1 from the transformed score at year 2, as though they were a true vertical scale

slide37

OUR STUDY

  • TRANSITION MODELS
  • Used in Delaware and Arkansas
  • Classify students into categories in year one (basic, proficient, advanced)
    • Divide each category into three subcategories
  • Observe year two category conditional on year one performance
  • Matrix associated with transition from level at year one to level at year two
    • Values represent importance of each transition; determined by educators
  • TRUG rewards students only for growth
  • Does not punish for regressing
  • TRANSITION MODELS
  • Used in Delaware and Arkansas
  • Classify students into categories in year one (basic, proficient, advanced)
    • Divide each category into three subcategories
  • Observe year two category conditional on year one performance
  • Matrix associated with transition from level at year one to level at year two
    • Values represent importance of each transition; determined by educators
  • TRUD values reflect growth as well as decreased performance
  • Does not reward for status
  • TRUG rewards students only for growth
  • Does not punish for regressing
  • Does not distinguish much between amounts of growth
  • TRUD values reflect growth as well as decreased performance
  • Does not reward for status
  • TRSG rewards students for maintaining previous status and for growth within and across performance levels
  • Reward increases with higher performance level status
  • TRANSITION MODELS
  • Used in Delaware and Arkansas
  • Classify students into categories in year one (basic, proficient, advanced)
    • Divide each category into three subcategories
  • Observe year two category conditional on year one performance
  • Matrix associated with transition from level at year one to level at year two
    • Values represent importance of each transition; determined by educators
  • TRSG rewards students for maintaining previous status and for growth within and across performance levels
  • Reward increases with higher performance level status

MODELS

slide38

OUR STUDY

INTER-CORRELATION OF STUDENT GROWTH SCORES AND THEIR DIMENSIONALITY

  • Each student had growth calculation from year 1-2 and year 2-3
  • Factor analysisof student growth from these models intercorrelated for year 1-2 and replicated for 2-3
    • One dimension accounts for largest percentage of variance
    • Great deal of noise in results
    • Over 80% of variance undefined by first dimension
    • Results of factor analysis same for eachpair of years, for each cohort and foreach content area
slide39

OUR STUDY

INTER-CORRELATION OF STUDENT GROWTH SCORES AND THEIR DIMENSIONALITY

  • Example: Scree Plot for Math 2008-2009, Cohort 1
slide40

OUR STUDY

RELATION TO DEMOGRAPHIC VARIABLES AND PRE- AND POSTTEST SCORES

  • Growth in reading tends to be slightly more correlated with SES and race than growth in math
  • Correlations between TRSG and pre- and post-tests are strongest among all the models
    • Correlation between TRSG and pretest around 0.5
    • Correlation between TRSG and posttest around 0.8
  • Correlations otherwise…
    • Between pretest and regression-based models: low
    • Between pretest and transition-based models: medium
    • Between posttest and regression-based models: higher
    • Between posttest and transition-based models: lower
slide41

OUR STUDY

THE CORRELATION BETWEEN GROWTH IN MATH AND GROWTH IN READING

  • Year 2008-2009
slide42

OUR STUDY

THE CORRELATION BETWEEN GROWTH IN MATH AND GROWTH IN READING

  • Year 2009-2010
slide43

OUR STUDY

THECORRELATION BETWEEN THE TWO GROWTH PERIODS (YEAR 1-2 AND YEAR 2-3)

  • Math
slide44

OUR STUDY

THECORRELATION BETWEEN THE TWO GROWTH PERIODS (YEAR 1-2 AND YEAR 2-3)

  • Reading
slide45

OUR STUDY

TEACHER EFFECTIVENESS AND TEACHER RELIABILITY

  • Square Root of Intra-Class Correlations for Year 2008-2009
slide46

OUR STUDY

TEACHER EFFECTIVENESS AND TEACHER RELIABILITY

  • Square Root of Intra-Class Correlations for Year 2009-2010
slide47

OUR STUDY

TEACHER EFFECTIVENESS AND TEACHER RELIABILITY

  • Year to Year Reliability of Teacher Effectiveness
  • Between 2008-2009 and 2009-2010
slide48

OUR STUDY

SCHOOL EFFECTIVENESS AND SCHOOL RELIABILITY

  • Sq. root of School Intra-Class Correlation for Year 2008-2009
slide49

OUR STUDY

SCHOOL EFFECTIVENESS AND SCHOOL RELIABILITY

  • Sq. root of School Intra-Class Correlation for Year 2009-2010
slide50

OUR STUDY

SCHOOL EFFECTIVENESS AND SCHOOL RELIABILITY

  • Year to Year Reliability of School Effectiveness
  • Between 2008-2009 and 2009-2010
slide51

OUR STUDY

COMPARISON BETWEEN SCHOOL AND TEACHER EFFECT

  • Levels of Effectiveness
  • 2008-2009
slide52

OUR STUDY

COMPARISON BETWEEN SCHOOL AND TEACHER EFFECT

  • Levels of Effectiveness
  • 2009-2010
slide53

OUR STUDY

METHODOLOGICAL ISSUES

  • Math Cohort 1 in Year 2008-2009
the model you use can make a difference

OUR CONCLUSIONS

The model you use can make a difference
  • Deciding how to balance status against growth
  • No standardization for the modeling of VAM
  • Traditional qualitative approaches used by principals are not likely to be an improvement on VAM
  • Using either approach for high stakes testing and decision-making seems premature
    • Combining two procedures that are not valid will not necessarily result in a valid system
more sophisticated growth models

OUR CONCLUSIONS

More sophisticated growth models
  • Would be nice to explore different models
    • Example: 4 level model
      • Many vertically scaled time points
      • Many subject matter assessments
      • Nested within students (level 2)
      • Nested within teachers (level 3)
      • Nested within school context (level 4)
    • Mixture and latent class models
      • Student and teachers as members of discrete groups that interact
interactions should be modeled

OUR CONCLUSIONS

Interactions should be modeled
  • Why model teacher effects…
    • as if all students react the same way?
    • as if all teachers are the same over time?

School and classroom context effects

Should be investigated as well

Implications for how to create a learning science

May add to the modest results for teachers and schools

change in instruction involving supportive technology

OUR CONCLUSIONS

change in instruction involving supportive technology
  • The transition (paradigm shift)may becloser than we think
  • Cognitive, computer, econometrician, engineering scientists are beginning to study education
    • Field can be expected to change as researchers and their students change
  • The nature of teachers and instructional decision-making
    • Radical changes for the better are expected
vam for high stakes

OUR CONCLUSIONS

Vam for high stakes
  • Right now, I do not encourage it
  • It makes a difference what VAM model we implement
  • Choose the model based on policy decisions that capture the goals and intent of the school system
  • Factors not in the teacher’s control have an effect
relating vam to what teachers are doing

OUR CONCLUSIONS

relating VAM to what teachers are doing
  • Create causal models and explore with experiments

Interested in implementing a vam?

Read Finlay and Manavi (2008)

  • Practical political issues of using VAM in schools
    • Unions, federal government, special education advocates…
  • Effective teaching requires good measurement and presents a great challenge and is a worthy goal…
questions

Questions?

Visit http://marces.org/Completed.htm to find references, the full text of this talk, our comparison of value-added models, and other projects.

Robert W. Lissitz

University of Maryland

Maryland Assessment Research Center for Education Success

ad