Investigating the impact of empirically-developed rating scales on the assessment of students’ writing in the FPE Part I- Report Writing A Presentation for PDD Language Centre May 9, 2012 by Farah Bahrouni [email protected] Outline: Claim & Questions Study Methodology Data collection
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Investigating the impact of empirically-developed rating scales on the assessment of students’ writing in the FPE Part I- Report WritingA Presentation forPDDLanguage CentreMay 9, 2012byFarah [email protected]
recognize a priori thatlge L. may not learn lge skills @ same pace & may display different levels in different skills at a given time (Hamp-Lyons, 1991; Weigle 1994, 2002)
diagnose sts’ strong and weak areas (Bachman, 1990, 2007, Bachman & Palmer 1996; Fulcher, 2000, 2010, 2011; North, 2003)
help Ts give constructive feedback (Alderson, 1991, 2007; Alderson et al. 1995)
help Ts & sts focus on weaknesses
give a picture of how much of the curriculum/LOs a st has achieved
help Ts reflect on their teaching = positive washback (Alderson, 1991, 2007; Alderson et al., 1995; Bachman, 1990, 2007; Bachman & Palmer 1996; North, 2000, 2003; North & Schneider, 1998)
I argue that analytic scales as they are presented in the literature (see Bachman & Palmer 1996, pp.214-216, 275-280; Hamp-Lyons, 1992, in North, 2003, pp.78-79; Hughey et al., 1983; Shohamy, 1992, p. 33), to mention only these, are still holistic in nature in terms of construct definitions as well as band descriptors of the language components being assessed.
In a multi-cultural context, similar to ours, where more than 230 teachers from about 30 different countries, such scales are still not enough to do away with idiosyncrasies in assessing writing in general. I am inclined to believe that more rigorous scales would leave minimum leeway for raters to call on their personal experience to interpret vague descriptors (Brindley, 1998).
Continuum of scoring methods (Hunter et al. 1996, p.64)
Holistic Approaches Analytic Approaches
General impression Holistic Primary Trait Analytic Atomistic
Scoring Scoring Scoring Scoring Scoring
The more to the right on this continuum a scale is, the better it is in a multi-cultural context yields more reliable results, hence this study:
Develop a new set of rating scales and compare it to the current one.
2) Are there any significant differences between the two sets of scales?
a. taken as whole sets (when looking at the total mark = the sum of the scores from the 4 components)?
b. in terms of categories?
3) Which of the two sets of rating scales yields less variation among raters?
What are we after?
A rating scale functioning properly is expected to show:
Studied the FPE LOs: 22 testable writing LOs
Q. writing Features to be assessed: CCs & Prog. Ts Likert scale
Selection: items ranked at 3 & 4 confirm with CCs
Scrutinized 65 sts’ live reports for extra features = 38 Fs
2.1 Data collection (I)
# categ. Ts suggest
whtfs. go in each cat.
scale (# lvls/pnts) Ts sug. for each cat.
Ts’ description/definition = wht each lvl/pnt means
Piloting: 7 teachers scored 10 samples twice ( in the analysis, discarded 2 = left with 5
2.2 Data collection (II)
Analysis: FACETS + One-Way ANOVA
1.1 EXCEL (Appendix 1)
1.2 ANOVA (Appendix 2)
2. Categories Statistics from FACETS & ANOVA (Appendix 3)
3. Implication & significance:
(changes made in version 3, which needs to be piloted)
Alderson, J. C. (1991). Bands and Scores. In J. C. Alderson & B. North (Eds.), Language Testing in the 1990s: The Communicative Legacy (Vol. 71 - 86). London and Basingstoke: Macmillan Publishers Limited.
Alderson, J. C., Clapham, C., & Wall, D. (1995). Language Test Construction and Evaluation: Cambridge University Press.
Bachman, L. F. (1990). Fundamental Considerations in Language Testing: Oxford: Oxford University Press.
Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice: Designing and Developing Useful Language Tests.: Oxford: Oxford University Press.
Brindley, G. (1998). Describing language development? Rating scales and SLA. In: L. F. Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and language testing research. CUP.
Fulcher, G. (2000). The 'communicative' legacy in language testing. System, 28, 483 -497.
Fulcher, G. (2010). Practical Language Testing. Hodder Education, An Hachette UK Company
Fulcher, G., Davidson, F. & Kemp, J. (2011) Effective rating scale development for speaking tests: Performance decision trees. Language Testing 28 (1) 5-29
Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons (Ed.), Assessing Second Language Writing in Academic Contexts (pp. 241-276). Norwood, NJ: Ablex Publishing Corporation.
Hunter, D. M., Jones, R. M., & Randhawa, B. S. (1996). The Use of Holistic versus Analytic Scoring for Large-Scale Assessment of Writing. The Canadian Journal of Program Evaluation, 11(2), 61 - 85.
North, B. (2000) The development of a Common Framework Scale of Language Proficiency: Theoretical Studies in Second Language Acquisition P. Lang.
North, B. (2003). Scales for rating language performance: Descriptive models, formulation styles, and presentation formats. TOEFL Monograph, 24.
North, B. & Schneider, G. (1998) Scaling descriptors for language proficiency scales. Language Testing 15 (2) 217-263
Weigle, S. C. (1994). Effects of training on raters of English as a second language compositions: Quantitative and Qualitative approaches. University of California, Los Angeles.
Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press.