1 / 90

Welcome Reliability Programme: Leading the way to better testing and assessments 22 March 2011

Welcome Reliability Programme: Leading the way to better testing and assessments 22 March 2011. Event Chair: Dame Sandra Burslem, DBE, Ofqual's Deputy Chair. Welcome and Setting the Scene. Glenys Stacey, Ofqual Chief Executive. Ofqual’s Reliability Programme. Dennis Opposs.

danton
Download Presentation

Welcome Reliability Programme: Leading the way to better testing and assessments 22 March 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WelcomeReliability Programme: Leading the way to better testing and assessments22 March 2011 Event Chair: Dame Sandra Burslem, DBE, Ofqual's Deputy Chair

  2. Welcome and Setting the Scene Glenys Stacey, Ofqual Chief Executive

  3. Ofqual’s Reliability Programme Dennis Opposs

  4. Reliability: quantifying the luck of the draw Reliability work in England has generally been Isolated Partial Under-theorised Under-reported Misunderstood Ofqual’s Reliability Programme aimed to improve the situation. Background

  5. To gather evidence for Ofqual to develop regulatory policy on reliability of results from national tests, examinations and qualifications Aims

  6. Strand 1: Generating evidence of reliability Strand 2: Interpreting and communicating evidence of reliability Strand 3: Developing reliability policy Strand 3a: Exploring public understanding of reliability Strand 3b: Developing Ofqual policy on reliability Programme structure

  7. Our Technical Advisory Group Alastair Pollitt Anton Beguin Jo-Anne Baird Paul Black Gordon Stanley

  8. Strand 1 – Generating evidence Synthesising pre-existing evidence Literature reviews Generating new evidence Monitoring existing practices Experimental studies

  9. Strand 2 – Interpreting and communicating evidence How do we conceptualise reliability? How do we interpret our findings? How do we communicate our findings?

  10. Strand 3 – Developing policy Exploring public understanding of, and attitudes towards, assessment error Stimulating national debate on the significance of the reliability evidence generated by the programme Developing Ofqual’s policy on reliability

  11. Student misclassification Controversial area - earlier conclusions include: “… it is likely that the proportion of students awarded a level higher or lower than they should be because of the unreliability of the tests is at least 30% at key stage 2” Wiliam, D. (2001). Level best? London: ATL. “Professors Black, Gardner and Wiliam argued […] that up to 30% of candidates in any public examination in the UK will receive the wrong level or grade” House of Commons Children, Schools and Families Committee. (2008a). Testing and Assessment. Third Report of Session 2007–08. Volume I. HC 169-I. London: TSO. Is this accurate?

  12. Strand 1 – Generating evidence (1) National Curriculum tests: The reliabilities of KS2 science pre-tests and the stability of consistency over time The reliabilities of the 2008 KS2 English reading pre-test General qualifications: The reliabilities of GCSE components/units The reliability of GCE units Vocational qualifications

  13. Strand 1 – Generating evidence (2) KS2 science pre-tests The reliabilities of KS2 Science tests over five years Values of internal consistency reliability (alpha) generally over 0.85 Classification accuracy (pre-tests) 83%-88% Classification consistency (between pre-tests and live tests) 72%-79% Reliability indices relatively stable over time Relatively high reliability compared with similar tests

  14. Strand 1 – Generating evidence (3) A KS2 English reading pre-test Data collected in 2007 during pre-testing 2008 KS2 English reading test Containing 34 items and having a total of 50 marks (mean 28.5 and standard deviation 9.1, 1387 pupils) Internal consistency reliability 0.88 Standard error of measurement 3.1 Classification accuracy (IRT) 83% Classification consistency (IRT) 76%

  15. Strand 1 – Generating evidence (4) Cronbach’s alpha for GCSE components/units

  16. Strand 1 – Generating evidence (5) Cronbach’s alpha for GCE units

  17. Strand 1 – Generating evidence (6) Assessor agreement rates for a workplace-based vocational qualification

  18. Strand 1 - Generating evidence (7) The 2009 and 2010 live tests (populations)

  19. Strand 2 – Interpreting and communicating evidence (1) External research projects Estimating and interpreting reliability, based on CTT Estimating and interpreting reliability based on CTT and G-theory Quantifying and interpreting GCSE and GCE component reliability based on G-theory Reporting of results and measurement uncertainties Representing and reporting of assessment results and measurement uncertainties in some USA tests Reliability of teacher assessment Internal research projects Reliability of composite scores: based on CTT, G-theory and IRT, qualification level

  20. Strand 2– Interpretingandcommunicatingevidence (2) Reporting results and associated errors (students and parents)

  21. Strand 2 – Interpreting and communicating evidence (3) Technical seminars Factors that affect the reliability of results from assessments Definition and meaning of different forms of reliability Statistical methods that are used to produce reliability estimates Representing and reporting assessment results and reliability estimates / measurement errors Improving reliability and implications Disseminating reliability statistics Tension in managing public confidence whilst exploring and improving reliability Operational issues for awarding bodies in producing reliability information Challenges posed by the reliability programme in vocational qualifications

  22. Strand 2 – Interpreting and communicating evidence (4) International perspective on reliability Reliability studies should be built into the assessment quality assurance process Information on reliability (primary and derived indices) should be in the public domain The introduction of information about reliability (misclassification / measurement error) should be managed carefully Education of the public to understand concept of reliability (measurement error) is seen to play an important part to alleviate the problem of misinterpretation by the media The reporting of results and measurement error can be complex as results are normally used by multiple users Primary reliability indices and classification indices should be reported at population level Standard error of measurement should be reported at individual test-taker level

  23. Strand 3a – Public perceptions of reliability (1) External research projects Ipsos MORI survey Ipsos MORI workshops AQA focus groups Internal research project Online questionnaire survey Investigating Understanding of the assessment process Understanding of factors affecting performance on exams Understanding of factors introducing uncertainty in exam results Distinction between inevitable errors and preventable errors Tolerance for errors in results Disseminating reliability information

  24. Views on accuracy of GCSE grades Strand 3a – Public perceptions of reliability (2)

  25. Strand 3a – Public perceptions of reliability (3) Views on national exams system

  26. Strand 3b – Developing Ofqual reliability policy (1) Ofqual reliability policy based on Evaluating findings from this programme Evaluating findings from other reliability related studies Reviewing current practices adopted elsewhere

  27. Ofqual Board recommendations • Continue work on reliability as a contribution to improving the quality assurance of qualifications, examinations and tests • Encourage awarding organisations to generate and publish reliability data • Continue to improve public and professional understanding of reliability and increase public confidence

  28. Next steps Publishing reliability compendium later this year Reliability work becomes “business as usual” Creation of a further policy

  29. Today • Presentations from the Technical Advisory Group and experts in teaching, assessment research and communications • Question and answer session • Tell us your opinions or email them to reliability.programme@ofqual.gov.uk

  30. Findings from the Reliability Research Professor Jo-Anne Baird, Technical Advisory Group Member

  31. Refreshment Break

  32. A view from the assessment community Paul E. Newton Director, Cambridge Assessment Network Division Presentation to Ofqual event The reliability programme: leading the way to better testing and assessments. 22 March 2011.

  33. We need to talk about error

  34. Talking about error

  35. The Telegraph (front page)

  36. The professional justification what the profession needs to accomplish through talking about error

  37. The bad old days Boards seem to have strong objections to revealing their mysteries to ‘outsiders’ […] There have undoubtedly been cases of inquiries […] where publication would have been in the interests of education, and would have helped to prevent the spread of ‘horror-stories’ about such things as lack of equivalence which is an inevitable concomitant of the present cloak of secrecy. Wiseman, S. (1961). The efficiency of examinations. In S. Wiseman (Ed.). Examinations in education. Manchester: MUP.

  38. Promulgating the myth However, any level of error has to be unacceptable – even just one candidate getting the wrong grade is entirely unacceptable for both the individual student and the system. QCA. (2003). A level of preparation. TES Insert. The TES, 4 April.

  39. The technical justification why users and stakeholders need to know about error

  40. Using knowledge of error Students and teachers maybe you’re better, or worse, than your grades suggest Employers and selectors maybe such fine distinctions shouldn’t be drawn maybe other information should be taken into account Parents maybe that difference in value added is insignificant maybe inferences like that should not be drawn Awarding bodies maybe that examination (structure) is insufficiently robust Policy makers maybe that proposed use of results is illegitimate maybe that policy change will compromise accuracy

  41. Talking about error the commitment to greater openness and transparency about error is nothing new but there is still a long way to go

  42. The 20-point scale (1969-72) The presentation of results on (i) the broadsheet will be by a single number denoting a scale point for each subject taken by each candidate, accompanied by a statement on the range of uncertainty; and (ii) the candidate's certificate as a range of scale points (eg 13-17, corresponding to 15 on the broadsheet and indicating a range of uncertainty of plus or minus 2 scale points.) Schools Council (1971). General Certificate of Education. Introduction of a new A-level grading scheme. London: Schools Council.

  43. The 20-point scale (1969-72) The following rubric is proposed, to be prominently displayed on both broadsheets and certificates: "Attention is drawn to the uncertainty inherent in any examination. In terms of the scale on which the above results are recorded, users should consider that a candidate's true level of attainment in each subject while possibly represented by a scale point one or two higher or lower, is more likely to be represented by the scale point awarded than by any other scale point [...]." Report by the Joint Working Party on A-level comparability to the Second Examinations Committee of the Schools Council on grading at A-level in GCE examinations. (1971)

  44. 20-point scale (1983-86) It was proposed that the new scheme should have the following characteristics: [...] (d) results should be accompanied by a statement of the possible margin of error. JMB (1983). Problems of the GCE Advanced level grading scheme. Manchester: Joint Matriculation Board.

  45. Talking about error there is disagreement within the profession over the concept of error but, at least, we are beginning to make these differences of opinion more explicit

  46. Measuring attainment

  47. Judging performance I argue that there is a strong case for saying that it is more sensible to accept that exams are just about fair competition – which means your performance must be reliably turned into a score but you accept as the luck of the draw things like the question paper being tough for you or having hay fever on the day, etc. Moreover, I think if you do that you can design things like regulatory work on reliability so that they reflect the priorities of the public. This was behind my first question to you about your presentation yesterday – do you really think Joe Public is interested in Cell 6? That’s an empirical question of course; I think the answer is no, but I’d love to find out for sure. Mike Cresswell, 20 October 2009, personal communication

  48. Uses of reliability information Evaluation and improvement highly technical (detailed & specific & idiosyncratic) obscure (typically not published) primary users = awarding bodies Accountability technical (but how detailed & generic & uniform?) translatable (published but not necessarily disseminated) primary users = regulator & analysts Education non-technical (uncomplicated & generic & uniform) translated (widely disseminated) primary users = members of the public

  49. For education How can we achieve greater openness and transparency?

  50. The Sun

More Related