Benchmark Assessments: Promises and Perils Are we wasting ...

1. Benchmark Assessments: Promises and Perils(Are we wasting our resources?) James H. McMillan Virginia Commonwealth University

2. Roadmap A few questions for the audience Data-driven instruction Benchmark test characteristics Formative assessment Research on benchmark testing MERC study Recommendations

5. Commercial Influence �The formative assessment market is considered �one of the fastest-growing segments of test publishing� - predicted to generate revenues of $323 million for vendors.�(Olson, 2006).

7. This is what we are doing: Inference Use Scores Conclusion Claim Consequence

8. More accurately: Inference Use Score + Error Conclusion Claim Consequence

9. Why Add Error? Error in testing (e.g., bad items, administrative differences) Limited coverage of important objectives due to sampling Unethical test preparation Single indicator Inflation (not your $)

10. Test Score Inflation �Score inflation refers to increases in scores that do not signal a commensurate increase in proficiency in the domain of interest...gains on scores of high-stakes tests are often far larger than true gains in students learning... [this] leads to an illusion of progress and to erroneous judgments about the performance of schools ...it cheats students who deserve better and more effective schooling� Dan Koretz, Measuring Up: What Educational Testing Really Tells Us, Harvard Press, 2008.

11. How Much and What Kind of Error for Individual Scores? Test Score + Error Conclusion

12. Inferences from Low Scores Instruction not aligned Poor teaching Poor item(s) Student weakness Individual students Whole class Remediation False negative

13. Inferences from High Scores Instruction well aligned Good teaching Student competence Instructional enhancements False positives

14. Table Time to Talk How can we account for the error that is a part of every benchmark assessment we give, whether 1) commercially prepared or 2) developed at the local level?

15. Features of Formative Assessment A process of several components, not simply a test Used by both teachers and students Takes place during instruction Provides feedback to students Provides instructional adjustments or correctives

16. Formative Assessment Cycle

17. Formative Assessment Characteristics

20. Table Time to Talk In your division/school, is benchmark = formative? To what extent is benchmark testing in your division/school very or barely formative?

21. Research: 2007 IES Study Purpose: Predictive validity of four commercial benchmark tests used in Delaware, Maryland, New Jersey, Pennsylvania, and Washington DC Method: Analysis of predictive correlations of district and individual student scores with state assessments Result: Well matched but �Evidence is generally lacking of the predictive validity with respect to state assessment tests.�

22. 2008 IES Study Purpose: Study effects of using benchmark assessments for grade 8 math in Massachusetts Method: Quasi-experimental comparison of 22 intervention and 44 comparison schools Result: No significant differences in student achievement on state tests

23. CTB/Mcgraw-Hill, 2009 AERA Presentation Research based assessments with sound technical foundations Rapid or immediate turn around of data and reports to support and modify instruction Component flexibility In response to the market need and customer requests, we develop a system built with sound technical base, flexibility, and to support instruction to be modified by providing data in a useful way. In response to the market need and customer requests, we develop a system built with sound technical base, flexibility, and to support instruction to be modified by providing data in a useful way.

24. Flexibility Publisher developed components can be chosen as desired Types of tests Predictive tests reflecting state tests Diagnostic tests reflecting pacing, scope, sequence Numbers of tests/frequency of administration Item types (MC, GR, CR) A flexible system is required to meet user needs. Users can select various types of publisher developed tests and administer as desired. If the full battery of predictive and diagnostic assessments are selected then testing occurs approximately monthly. Diagnostic tests may replace less reliable teacher made tests that take teachers� time to develop and score.A flexible system is required to meet user needs. Users can select various types of publisher developed tests and administer as desired. If the full battery of predictive and diagnostic assessments are selected then testing occurs approximately monthly. Diagnostic tests may replace less reliable teacher made tests that take teachers� time to develop and score.

25. Flexibility Teachers can Create custom tests using a pre-populated item bank aligned to state standards Share custom developed forms with other teachers on the system Write new items (or instructional activities) using item authoring software Share the new items or instructional exercises on their local system

26. Flexibility Administration and data capture modes Paper and Pencil Scan and score answer sheets Online Student response devices or �clickers� Assign instructional activities Directly from online reports Automatically or manually

27. Empirical Data Studies: Did Forms Achieve Appropriate Reliability and Desired Difficulty? Reliability is monitored for publisher developed tests. Test difficulty is calibrated to be developmentally appropriate. Tests administered later in the year are developed to be more difficult than tests administered earlier. Test difficulty is intended to match student growth, so average proportion correct should be approximately constant over the year, as illustrated by the Colorado Language Arts assessments (.6, .61, .61 for forms administered in Sept, Dec, and Feb, respectively). Other forms indicated here are within .05 in difficulty over the year.Reliability is monitored for publisher developed tests. Test difficulty is calibrated to be developmentally appropriate. Tests administered later in the year are developed to be more difficult than tests administered earlier. Test difficulty is intended to match student growth, so average proportion correct should be approximately constant over the year, as illustrated by the Colorado Language Arts assessments (.6, .61, .61 for forms administered in Sept, Dec, and Feb, respectively). Other forms indicated here are within .05 in difficulty over the year.

29. Different numbers of items appear under each standard and benchmark. Training and caveats are provided to use the results responsibly. Note: Only the portion of the Grade Level Expectations (GLE) covered by the assessed curriculum are measured on this form of the Diagnostic Assessment. Thus, inferences from students� performances should not be made to the GLE as a whole, but only to the assessed portion of the GLE. A specific GLE is measured by items on this form only when the GLE comprised at least five percent of the assessed content as indicated by the pacing guide; GLEs that did not comprise at least five percent of the curriculum are not measured by this form. Also, the reported results for GLEs measured with fewer items are less reliable than for GLEs measured with more items. Thus, when small numbers of items are used to measure a GLE, other measures (e.g., observations, homework, etc.) should be used to confirm the results reported here. Different numbers of items appear under each standard and benchmark. Training and caveats are provided to use the results responsibly. Note: Only the portion of the Grade Level Expectations (GLE) covered by the assessed curriculum are measured on this form of the Diagnostic Assessment. Thus, inferences from students� performances should not be made to the GLE as a whole, but only to the assessed portion of the GLE. A specific GLE is measured by items on this form only when the GLE comprised at least five percent of the assessed content as indicated by the pacing guide; GLEs that did not comprise at least five percent of the curriculum are not measured by this form. Also, the reported results for GLEs measured with fewer items are less reliable than for GLEs measured with more items. Thus, when small numbers of items are used to measure a GLE, other measures (e.g., observations, homework, etc.) should be used to confirm the results reported here.

30. Class Report The Predictive Class Assessment Report shows the average scale score for a class and how the class is predicted to perform on CRCT in the spring of each year. A�The percentage of students in a particular class who are expected to fall into each proficiency category on CRCT. For example, 4% of students in this class are expected to fall into the �Exceeds� category on CRCT. The categories listed will include the three CRCT proficiency levels�Does Not Meet, Meets, Exceeds. B�The average scale score of the class�394. The number in parenthesis is the Standard Deviation (SD). In this case, the SD is 42. C�The minimum/maximum scale score range for the Acuity assessments. The scale scores range from 230 to 590. The Predictive Class Assessment Report shows the average scale score for a class and how the class is predicted to perform on CRCT in the spring of each year. A�The percentage of students in a particular class who are expected to fall into each proficiency category on CRCT. For example, 4% of students in this class are expected to fall into the �Exceeds� category on CRCT. The categories listed will include the three CRCT proficiency levels�Does Not Meet, Meets, Exceeds. B�The average scale score of the class�394. The number in parenthesis is the Standard Deviation (SD). In this case, the SD is 42. C�The minimum/maximum scale score range for the Acuity assessments. The scale scores range from 230 to 590.

31. Longitudinal Reports The Student Longitudinal Report demonstrates the progress by scale score of an individual student on each Acuity assessment. The Longitudinal Report will not be displayed until two Acuity assessments have been taken. The report will display information over a three-year period beginning in the 2008-09 school year. A�A specific scale score range for the Acuity assessments. B�A line graph that shows a series of scale scores from Acuity assessments. The scale score from each assessment is indicated by and the band around this symbol is the Standard Error of Measurement (SEM). C�Acuity assessment forms included in this longitudinal report. On this report, the scale scores represent data points from the three predictive assessments taken by a student in third, fourth and fifth grades.The Student Longitudinal Report demonstrates the progress by scale score of an individual student on each Acuity assessment. The Longitudinal Report will not be displayed until two Acuity assessments have been taken. The report will display information over a three-year period beginning in the 2008-09 school year. A�A specific scale score range for the Acuity assessments. B�A line graph that shows a series of scale scores from Acuity assessments. The scale score from each assessment is indicated by and the band around this symbol is the Standard Error of Measurement (SEM). C�Acuity assessment forms included in this longitudinal report. On this report, the scale scores represent data points from the three predictive assessments taken by a student in third, fourth and fifth grades.

32. MERC StudyPurpose Explore the extent to which benchmark test results are used by teachers in formative ways to support student learning What is the policy context and nature of benchmark testing? How do teachers use benchmark testing data in formative ways? What factors support and/or mitigate teachers� formative use of benchmark testing data?

33. Methods Role of MERC study team Qualitative double-layer category focus-group study design (Krueger & Casey, 2009) Layers � school type (elementary or middle) and district (N=4) Protocol was developed and piloted the general nature of benchmark testing policies and the type of data teachers receive expectations for using benchmark test results instructional uses of benchmark test results general views on benchmark testing policies, practices and procedures Focus groups lasted between 1-1.5 hours with 4-5 participants and were digitally recorded

34. Participants A two-stage convenience sampling process was used to select and recruit focus group participants District ? School Principal ? Teachers Spring 2009: 9 focus groups w/40 core-content area teachers across 4 districts The majority were white (85%) and female (90%) with an average of 12.5 years of teaching exp. (range of 1-32 yrs.) 25% were beginning teachers with 1-3 years of teaching experience and 25% had been teaching for over 20 years. The majority (80%) taught at the elementary level in grades 4 and 5 and the remaining were middle school teachers in the areas of civics, science, mathematics and language arts.

35. Preliminary Findings: Informing Instruction 1. Teachers make a variety of instructional adjustments based on the results of benchmark assessments, especially when there is an expectation or culture established for using data. �If I see a large number of my students missing in this area, I am going to try to re-teach it to the whole class using a different method. If it is only a couple of [students], I will pull them aside and instruct one-on-one.� �We are asked to be accountable for each and every one of those students and we sit face-to-face with an administrator who says to you, �how are you going to address their needs?� And we have to be able to say, well, I am pulling them for remediation during this time, or I am working with a small group or I�ve put them in additional enrichment�we have got to be able to explain how we are addressing those weaknesses.�

36. Preliminary Findings: Learning Time v. Testing Time 2. Teachers have significant concerns about the amount of instructional time that is devoted to testing in general and the implications of this for the quality of instruction they can provide. � �I think it is definitely made us change the way we teach � I do feel like sometimes I don�t teach things as well as I used to because of the time constraints.� �Just the time is takes to give all these assessments. As important as these assessments are, it does take instructional time� we don�t just do those [benchmark assessments], because we do a lot of pre and post-assessments so this is just one more thing on top of a lot of other testing we do.� �You are sacrificing learning time for testing time�we leave very little time to actually teach.�

37. Preliminary Findings: Value of Test Results 3. The value teachers place on benchmark testing data is associated with their views on the quality of the test items, the integrity of the scoring process, and the alignment of the test with the curriculum. �We really need to focus on the tests being valid. It is hard to take it seriously when you don�t feel like it is valid. When you look at it and you see mistakes or passages you know your students aren�t going to be able to read because it is way above their reading level. The people writing the tests need some kind of training.� �Sometimes, I think�you have to use your professional judgment�sometimes the questions that are on the test are just simply bad questions� was it because they didn�t understand cause and effect or is it because that was really a poorly written question?� �

38. �Many times the 9-week assessments are so all encompassing that it is difficult for the students�.you may only have one question that addresses a specific objective. And so that is not really a true representation of what the child knows about that objective.�

39. Preliminary Findings: Instructional Benefits Teachers� views on benchmark testing policies are somewhat positive They recognize the benefits to their instruction and students� learning. �It [benchmark test results] helps me analyze my instruction as a whole. I didn�t present the questions in my classroom the way the questions are presented on the test. So maybe I need to back track or maybe I didn�t spend a lot of time in this area, and I need to go back and [re-teach].� �They are one more piece you can have and when you compare it across the county you can use it to see how you are doing�but again, it just shows that it isn�t the only thing we base our instructional decisions on by all means.� �It tells you what you need to place more emphasis on. It really alerts you to the weaknesses of your class and how much more practice you need to provide. I think they [the benchmark tests] are excellent, I think they are great because of the correlation with the SOL tests and this is one way of getting the children prepared and familiar with the formatting of the SOL test so that they will be successful.��

40. Conclusions Preliminary findings to date seem consistent with the literature. School culture and expectations for the use of test results is a key factor in how results are used. The results so far indicate that, under the right conditions, benchmark testing may serve a meaningful formative purpose. Additional focus groups are planned for fall 2009 to test preliminary findings.

41. Recommendations for Effective Use of Benchmark Assessments Clarify Purpose Instructional To adapt instruction and curriculum to meet student needs � content, pace and strategies, whole class or small groups, re-teach To identify low performing students To implement teacher strategies to use the information? Program Evaluation To compare different instructional programs To determine instructional effectiveness To modify curriculum and instruction in the future Predicting State Test Results

42. Recommendations for Effective Use of Benchmark Assessments Use About 30 High Quality Items 3-5 items for diagnostic information per topic or trait Pilot? Item discrimination, difficulty, fairness, and clarity Establish Alignment Evidence Content matched to SOLs Number of items match Reporting Category percentages Content matched to instruction and opportunity to learn Cognitive demands of items Provide Clear Guidelines for Using Results Standardize Administration Procedures

43. Recommendations for Effective Use of Benchmark Assessments Verify Results with Other Evidence Classroom tests Contrasted groups Include Estimates of Error Monitor Unintended Consequences Ensure Fairness Equitable treatment Opportunity to learn Document Costs financial student time teacher time

44. Recommendations for Effective Use of Benchmark Assessments Evaluate Use of Results � What evidence exists that teachers are using results to modify instruction and students are learning more? Use Teams of Teachers for Review and Analysis Provide Adequate Professional Development Commercial or Locally Prepared?

45. Commercial or Locally Developed?

46. Benchmark Assessments: Promises and Perils(Are we wasting our resources?) James H. McMillan Virginia Commonwealth University

Benchmark Assessments: Promises and Perils Are we wasting ...

Benchmark Assessments: Promises and Perils Are we wasting ...

Presentation Transcript

MASS WASTING

Benchmark Assessments in Literacy

Interest Rate Benchmark Review

Mass Wasting Chapter

Be Forewarned

Ohio Graduation Test 2007 – Math

Ch.1§3 , pp.29-46

OHIO DEPARTMENT OF EDUCATION: PHYSICAL EDUCATION BENCHMARK ASSESSMENT

Contracts – Unit 1

Ohio Department of Education Physical Education Benchmark Assessments: K-12

Perils and Promises of Undergraduate Research

Oracy Assessments Research

Mass Wasting

God’s Promises

The Promise and Perils of Nanotech

Claim God’s Promises by Faith

Digital Library of the Future: Promises, Perils, and the Models

The perils of drugs use

Assessments

Network of Excellence Benchmark

Promises, promise...