Investigating the impact of empirically-developed rating scales on the assessment of students’ wri...
1 / 14

Outline: Claim & Questions Study Methodology Data collection Analysis: FACETS & One-Way ANOVA - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Investigating the impact of empirically-developed rating scales on the assessment of students’ writing in the FPE Part I- Report Writing A Presentation for PDD Language Centre May 9, 2012 by Farah Bahrouni Outline: Claim & Questions Study Methodology Data collection

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Outline: Claim & Questions Study Methodology Data collection Analysis: FACETS & One-Way ANOVA

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Investigating the impact of empirically-developed rating scales on the assessment of students’ writing in the FPE Part I- Report WritingA Presentation forPDDLanguage CentreMay 9, 2012byFarah


  • Claim & Questions

  • Study

    • Methodology

    • Data collection

    • Analysis: FACETS & One-Way ANOVA

    • Results

  • Conclusion

    • Implication & Significance

  • Claim

    Analytic scales

    recognize a priori thatlge L. may not learn lge skills @ same pace & may display different levels in different skills at a given time (Hamp-Lyons, 1991; Weigle 1994, 2002)

    diagnose sts’ strong and weak areas (Bachman, 1990, 2007, Bachman & Palmer 1996; Fulcher, 2000, 2010, 2011; North, 2003)

    help Ts give constructive feedback (Alderson, 1991, 2007; Alderson et al. 1995)

    help Ts & sts focus on weaknesses

    give a picture of how much of the curriculum/LOs a st has achieved

    help Ts reflect on their teaching = positive washback (Alderson, 1991, 2007; Alderson et al., 1995; Bachman, 1990, 2007; Bachman & Palmer 1996; North, 2000, 2003; North & Schneider, 1998)


    I argue that analytic scales as they are presented in the literature (see Bachman & Palmer 1996, pp.214-216, 275-280; Hamp-Lyons, 1992, in North, 2003, pp.78-79; Hughey et al., 1983; Shohamy, 1992, p. 33), to mention only these, are still holistic in nature in terms of construct definitions as well as band descriptors of the language components being assessed.

    In a multi-cultural context, similar to ours, where more than 230 teachers from about 30 different countries, such scales are still not enough to do away with idiosyncrasies in assessing writing in general. I am inclined to believe that more rigorous scales would leave minimum leeway for raters to call on their personal experience to interpret vague descriptors (Brindley, 1998).

    Continuum of scoring methods (Hunter et al. 1996, p.64)


    Holistic Approaches Analytic Approaches

    <_________Ʌ____________________________________Ʌ___ _______________________________________>

    General impression Holistic Primary Trait Analytic Atomistic

    Scoring Scoring Scoring Scoring Scoring


    The more to the right on this continuum a scale is, the better it is in a multi-cultural context yields more reliable results, hence this study:


    Develop a new set of rating scales and compare it to the current one.


    • Which of the two sets of rating scales, the one currently in use or the newly-devised one, functions better in terms of the attributes that follow on the next slide?

      2) Are there any significant differences between the two sets of scales?

      a. taken as whole sets (when looking at the total mark = the sum of the scores from the 4 components)?

      b. in terms of categories? 

      3) Which of the two sets of rating scales yields less variation among raters?


    What are we after?

    A rating scale functioning properly is expected to show:

    • significant discrimination between candidates, i.e. good distribution of abilities(used ANOVA to show this)

    • higher inter-rater and intra-rater consistency: we look at the SD (descriptive stats) & rater separation ratio (FACETS): the lower the better, as big differences between raters are not welcome. The closer to 0 the better.

    • fewer raters marking either inconsistently (misfits) or over consistently (overfits) by overusing the central categories of the rating scale (measures from FACETS: 1.6 - .6)

    • the measurement values increasing as the scale points get higher = a higher score means a higher ability in the construct being assessed

    • all points on the scale being used

    Studied the FPE LOs: 22 testable writing LOs

    Q. writing Features to be assessed: CCs & Prog. Ts Likert scale

    Selection: items ranked at 3 & 4 confirm with CCs

    Scrutinized 65 sts’ live reports for extra features = 38 Fs

    2.1 Data collection (I)

    QUAL data:

    # categ. Ts suggest

    whtfs. go in each cat.

    scale (# lvls/pnts) Ts sug. for each cat.

    Ts’ description/definition = wht each lvl/pnt means

    • Write:

    • construct definitions based on Bachman & Palmer’s (1996) communicative approach

    • definitions of performance levels based on LOs, Ts’ responses and the 65 studied reports

    Piloting: 7 teachers scored 10 samples twice ( in the analysis, discarded 2 = left with 5

    2.2 Data collection (II)



    Analysis: FACETS + One-Way ANOVA


    • Descriptive Statistics

      1.1 EXCEL (Appendix 1)

      1.2 ANOVA (Appendix 2)

      2. Categories Statistics from FACETS & ANOVA (Appendix 3)

    3. Implication & significance:

    • Analysis indicates that

    • RS2 function more effectively than RS1 in most investigated areas in the study

    • the optimumnumber of points/levels to have on the scale is 5

    • Language Use category should be divided into more sub-categories

    • title and length (2 sub-categories of Content) are ‘over weighted’

      (changes made in version 3, which needs to be piloted)

    • For good results, RS2 needs to be used along with anchor papers representing different levels. Teacher training would also be helpful as research suggests that it can reduce, but cannot eliminate raters’ tendency for overall severity or leniency in assessing performance (Lumley and McNamara, 1995 ; McNamara, 1996; McNamara and Adams, 1991; Weigle, 1998)

    • Ts’ involvement in defining what they think should be assessed in sts’ writing & describing the levels of performance (what those labels as Excellent, Good, or Poor’ stand for) helped them reach a more common understanding of the lge aspects being assessed and a shared interpretation of the score descriptions

    • The rating scales I have developed are ‘home-made’, based on LOs and tailored to FPE, and therefore to the LC needs. They can be extrapolated, with some adaptation to essay writing

    • They can be generalised to any similar multi-cultural context to produce a less personalized and more institutionalized objective assessment of students’ writing performance.


    Alderson, J. C. (1991). Bands and Scores. In J. C. Alderson & B. North (Eds.), Language Testing in the 1990s: The Communicative Legacy (Vol. 71 - 86). London and Basingstoke: Macmillan Publishers Limited.

    Alderson, J. C., Clapham, C., & Wall, D. (1995). Language Test Construction and Evaluation: Cambridge University Press.

    Bachman, L. F. (1990). Fundamental Considerations in Language Testing: Oxford: Oxford University Press.

    Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice: Designing and Developing Useful Language Tests.: Oxford: Oxford University Press.

    Brindley, G. (1998). Describing language development? Rating scales and SLA. In: L. F. Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and language testing research. CUP.

    Fulcher, G. (2000). The 'communicative' legacy in language testing. System, 28, 483 -497.

    Fulcher, G. (2010). Practical Language Testing. Hodder Education, An Hachette UK Company

    Fulcher, G., Davidson, F. & Kemp, J. (2011) Effective rating scale development for speaking tests: Performance decision trees. Language Testing 28 (1) 5-29

    Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons (Ed.), Assessing Second Language Writing in Academic Contexts (pp. 241-276). Norwood, NJ: Ablex Publishing Corporation.

    Hunter, D. M., Jones, R. M., & Randhawa, B. S. (1996). The Use of Holistic versus Analytic Scoring for Large-Scale Assessment of Writing. The Canadian Journal of Program Evaluation, 11(2), 61 - 85.

    North, B. (2000) The development of a Common Framework Scale of Language Proficiency: Theoretical Studies in Second Language Acquisition P. Lang.

    North, B. (2003). Scales for rating language performance: Descriptive models, formulation styles, and presentation formats. TOEFL Monograph, 24.

    North, B. & Schneider, G. (1998) Scaling descriptors for language proficiency scales. Language Testing 15 (2) 217-263

    Weigle, S. C. (1994). Effects of training on raters of English as a second language compositions: Quantitative and Qualitative approaches. University of California, Los Angeles.

    Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press.

    Thank you

  • Login