Robert L. Linn

Validity of Inferences from Test-Based Educational Accountability Systems Robert L. Linn Paper presented at National Evaluation Institute Sponsored by the Consortium on Educational Accountability and Teacher Evaluation (CREATE) and the Dallas Independent School District, Dallas, TX, July 7, 2006.

State Accountability Systems • Most states had test-based accountability systems before the enactment of NCLB • Systems varied • Grades and subjects • Reporting results • School report cards • Sanctions and/or rewards

State Accountability Systems • Systems varied • Current status • Progress • Combination • Assessing progress • Comparison of successive cohorts • Longitudinal tracking of individual students

NCLB States required to adopt “challenging academic content standards” that “specify what children are expected to know and be able to do; content coherent and rigorous content; [and} encourage the teaching of advanced skills” (NCLB, 2001, part A, subpart 1, Sec. 1111, a (D).

NCLB • States required to assess all students in grades 3 through 8 and one grade in high school in mathematics and reading/English language arts • Assessments must be aligned with state’s academic content standards

Definition of Proficient Achievement • NCLB: States must “describe two levels of high achievement (proficient and advanced) [and] a third level of achievement (basic)” • Setting levels left to the states, but must have all students at “proficient” level by 2014

Adequate Yearly Progress (AYP) • Central to the Accountability System of the No Child Left Behind (NCLB) Act of 2001 • States required to define AYP for the state, school districts, and schools in a way that enables all children to meet the state’s student achievement standards by 2014

Annual Measurable Objectives (AMOs) • Target percentages proficient or better in mathematics and reading/language arts • Set each year from 2002 to 2014 so that they lead to 100% proficient or above in 2014

Mixed Messages • State accountability system results and NCLB results often give conflicting indications of school success • A school that fails to make AYP may look good according to state system and vise versa

Florida Example • 68% of schools got a grade of A or B and only 8.8% got a grade of D or F in 2004 • 77% of schools failed to make AYP in 2004 • 56% of schools that got an A in 2004 failed to make AYP that year

Confusion Regarding Mixed Messages The frequently conflicting messages about how schools are performing from state accountability and NCLB accountability systems “are confusing to the public” (National Education Network, 2006, p. 8).

Test-based Accountabilityand School Effectiveness • School effectiveness – a causal inference • Must be able to eliminate plausible alternative explanations

Alternative Explanations • Prior achievement differences • Differences in student characteristics relevant to achievement • Differences in home support during school year

Alternative Explanations (cont’d) • Score inflation: “a gain in scores that substantially overstates the improvement in learning it implies” (Koretz, 2005) • Differential inflation of test scores

Inferences About School Effectiveness From AYP Inferences about school effectiveness from differences in student test performance at a fixed point in time are “scientifically indefensible” (Raudenbush, 2004)

AYP and School Effectiveness Current status on achievement tests used for purpose of NDLB accountability is “contaminated with factors other that school performance, in particular the average level of achievement prior to entering first grade – average effects of student family and community characteristics on student growth from first grade through the grade in which the student is tested (Myers, 2000).

Current Status vs. Progress Measures • If NCLB benchmarks are not reached, no amount of improvement can put a school in compliance with NCLB” (Public Education Network, 2006p. 10). • A large majority (85% of the public thinks that “school performance should be judged based on improvement shown” while only 13% think that it should be judged based on basis of the percentage of students who pass a test (Rose & Gallup, 2005, p. 55).

Progress Measures Progress of successive cohorts of student and longitudinal tracking of students and value-added analyses can rule out some, but not all of the alternative explanations of school differences in performance.

Value-Added Models • Value-added label implies causal interpretation of results. • But, causal claims are not justified. • Value-added analyses “should not be seen as estimating causal effects of teachers or schools, but rather as descriptive measure” Rubin, Stewart & Zanutto, 2004).

Vertical Scales • Scores treated as if exchangeable, but they do not meet the requirements of equating: • Measure the same constructs • Equal difficulty • Nearly equal reliability

Vertical Scales (cont’d) • Test difficulty increases with grade level by design • Mix of constructs changes with grade level • “For mathematics, for example, the tests at the 3rd grade measure predominately arithmetic skills. By 8th grade, the test shifts to problem solving, pre-algebra and algebra skills” (Reckase 2004)

Making AYP • Conjunctive, multiple-hurdle approach • Many ways to fail but only one way to make AYP • Small school with homogeneous student body must clear 5 hurdles • Large school with diverse student body and enough students in each of 4 subgroups for disaggregated reporting must clear 21 hurdles

Reporting on Subgroup Performance • Critical for monitoring the closing of gaps in achievement • No real relevance for small schools with homogeneous student bodies • However, it leads to many hurdles that large, diverse schools must meet

Subgroup Gains in NAEP Mathematics Scale Scores (1996 to 2005)

Closing Achievement Gaps: NAEP Mathematics Average Scale Scores (1996 to 2005)

Subgroup Gains in NAEP Reading Scale Scores (1998 to 2005)

Closing Achievement Gaps: NAEP Reading Average Scale Scores (1998 to 2005)

Apparent gains and changes in achievement gaps using NAEP achievement levels depend onchoice of level, e.g., basic or above vs. proficient or above. See, for example, Holland, P. W. (2002). Two measures of change in gaps between CDFs of test score distributions. JEBS, 27, 3-17.

Subgroup Gains in NAEP Mathematics Percent at or Above Basic or Proficient (1996 to 2005)

Closing Achievement Gaps: NAEP Mathematics Percent at or Above Basic or Proficient (1996 to 2005)

Gaps and Percent Proficient or Above “Using differences in percents above cut scores can give a confusing impression of a rather simple situation” (Holland, 2002). Need to look beyond percents basic or above or proficient or above – average scale scores and comparisons of score distributions

Comparing States on Closing Gaps Gaps measured in terms of percent proficient or above on state assessments could be quite misleading due to the wide variation in the stringency of state definitions of the proficient performance standard.

Conclusions 1. Test-based accountability systems used to infer relative school effectiveness, but the validity of such inferences is dubious at best. 2. Although NCLB emphasizes “scientifically-based research” the NCLB accountability results do not live up to that level of evidence. Inferences about school effectiveness based on AYP are not scientifically defensible.

Conclusions (continued) 3. Causal inferences about school effectiveness are not justified from test-based accountability systems regardless of whether they rely on current status measures, progress of successive cohorts, or value-added analyses of longitudinal data. 4. Accountability results can still be valuable if treated as descriptive measures, and the source of hypotheses that can be followed up by collecting information about instructional practices, teacher, and student characteristics.

Conclusions (continued) 5. Closing gaps in achievement is a worthwhile goal of NCLB and essential to achieving equity in education. 6. Measuring achievement gaps needs to involve more than tracking the percentage of students in various subgroups who are at the proficient level or above.

Robert L. Linn

Robert L. Linn

Presentation Transcript

Linn

Robert L. Linn

Robert L. Linn CRESST, University of Colorado at Boulder

Linn

Robert L Renshaw, Jr.

Timber Linn Park

Robert L. Swaim

Robert L. Johnson

Robert L. Mollica

Robert L. Swaim

Robert L. Bertini

Linn County

Robert L. Davis, Petitioner

Robert L. Linn Center for Research on Evaluation, Standards, and Student Testing

Robert L. Linn

Robert L. Eichorn, Principal

Robert L. Nowack