Investigating the impact of empirically-developed rating scales on the assessment of students’ wri...
This presentation is the property of its rightful owner.
Sponsored Links
1 / 14

Outline: Claim & Questions Study Methodology Data collection Analysis: FACETS & One-Way ANOVA PowerPoint PPT Presentation


  • 41 Views
  • Uploaded on
  • Presentation posted in: General

Investigating the impact of empirically-developed rating scales on the assessment of students’ writing in the FPE Part I- Report Writing A Presentation for PDD Language Centre May 9, 2012 by Farah Bahrouni [email protected] Outline: Claim & Questions Study Methodology Data collection

Download Presentation

Outline: Claim & Questions Study Methodology Data collection Analysis: FACETS & One-Way ANOVA

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Outline claim questions study methodology data collection analysis facets one way anova

Investigating the impact of empirically-developed rating scales on the assessment of students’ writing in the FPE Part I- Report WritingA Presentation forPDDLanguage CentreMay 9, 2012byFarah [email protected]


Outline claim questions study methodology data collection analysis facets one way anova

Outline:

  • Claim & Questions

  • Study

    • Methodology

    • Data collection

    • Analysis: FACETS & One-Way ANOVA

    • Results

  • Conclusion

    • Implication & Significance


  • Outline claim questions study methodology data collection analysis facets one way anova

    Claim

    Analytic scales

    recognize a priori thatlge L. may not learn lge skills @ same pace & may display different levels in different skills at a given time (Hamp-Lyons, 1991; Weigle 1994, 2002)

    diagnose sts’ strong and weak areas (Bachman, 1990, 2007, Bachman & Palmer 1996; Fulcher, 2000, 2010, 2011; North, 2003)

    help Ts give constructive feedback (Alderson, 1991, 2007; Alderson et al. 1995)

    help Ts & sts focus on weaknesses

    give a picture of how much of the curriculum/LOs a st has achieved

    help Ts reflect on their teaching = positive washback (Alderson, 1991, 2007; Alderson et al., 1995; Bachman, 1990, 2007; Bachman & Palmer 1996; North, 2000, 2003; North & Schneider, 1998)


    Outline claim questions study methodology data collection analysis facets one way anova

    However,

    I argue that analytic scales as they are presented in the literature (see Bachman & Palmer 1996, pp.214-216, 275-280; Hamp-Lyons, 1992, in North, 2003, pp.78-79; Hughey et al., 1983; Shohamy, 1992, p. 33), to mention only these, are still holistic in nature in terms of construct definitions as well as band descriptors of the language components being assessed.


    Outline claim questions study methodology data collection analysis facets one way anova

    In a multi-cultural context, similar to ours, where more than 230 teachers from about 30 different countries, such scales are still not enough to do away with idiosyncrasies in assessing writing in general. I am inclined to believe that more rigorous scales would leave minimum leeway for raters to call on their personal experience to interpret vague descriptors (Brindley, 1998).


    Outline claim questions study methodology data collection analysis facets one way anova

    Continuum of scoring methods (Hunter et al. 1996, p.64)

    ____________________________________________________________________________________________

    Holistic Approaches Analytic Approaches

    <_________Ʌ____________________________________Ʌ___ _______________________________________>

    General impression Holistic Primary Trait Analytic Atomistic

    Scoring Scoring Scoring Scoring Scoring

    ____________________________________________________________________________________________

    The more to the right on this continuum a scale is, the better it is in a multi-cultural context yields more reliable results, hence this study:

    Action:

    Develop a new set of rating scales and compare it to the current one.


    Outline claim questions study methodology data collection analysis facets one way anova

    Questions:

    • Which of the two sets of rating scales, the one currently in use or the newly-devised one, functions better in terms of the attributes that follow on the next slide?

      2) Are there any significant differences between the two sets of scales?

      a. taken as whole sets (when looking at the total mark = the sum of the scores from the 4 components)?

      b. in terms of categories? 

      3) Which of the two sets of rating scales yields less variation among raters?


    Outline claim questions study methodology data collection analysis facets one way anova

    2.Study

    What are we after?

    A rating scale functioning properly is expected to show:

    • significant discrimination between candidates, i.e. good distribution of abilities(used ANOVA to show this)

    • higher inter-rater and intra-rater consistency: we look at the SD (descriptive stats) & rater separation ratio (FACETS): the lower the better, as big differences between raters are not welcome. The closer to 0 the better.

    • fewer raters marking either inconsistently (misfits) or over consistently (overfits) by overusing the central categories of the rating scale (measures from FACETS: 1.6 - .6)

    • the measurement values increasing as the scale points get higher = a higher score means a higher ability in the construct being assessed

    • all points on the scale being used


    2 1 data collection i

    Studied the FPE LOs: 22 testable writing LOs

    Q. writing Features to be assessed: CCs & Prog. Ts Likert scale

    Selection: items ranked at 3 & 4 confirm with CCs

    Scrutinized 65 sts’ live reports for extra features = 38 Fs

    2.1 Data collection (I)

    QUAL data:

    # categ. Ts suggest

    whtfs. go in each cat.

    scale (# lvls/pnts) Ts sug. for each cat.

    Ts’ description/definition = wht each lvl/pnt means


    2 2 data collection ii

    • Write:

    • construct definitions based on Bachman & Palmer’s (1996) communicative approach

    • definitions of performance levels based on LOs, Ts’ responses and the 65 studied reports

    Piloting: 7 teachers scored 10 samples twice ( in the analysis, discarded 2 = left with 5

    2.2 Data collection (II)

    RS1

    RS2

    Analysis: FACETS + One-Way ANOVA


    Outline claim questions study methodology data collection analysis facets one way anova

    Results

    • Descriptive Statistics

      1.1 EXCEL (Appendix 1)

      1.2 ANOVA (Appendix 2)

      2. Categories Statistics from FACETS & ANOVA (Appendix 3)


    Outline claim questions study methodology data collection analysis facets one way anova

    3. Implication & significance:

    • Analysis indicates that

    • RS2 function more effectively than RS1 in most investigated areas in the study

    • the optimumnumber of points/levels to have on the scale is 5

    • Language Use category should be divided into more sub-categories

    • title and length (2 sub-categories of Content) are ‘over weighted’

      (changes made in version 3, which needs to be piloted)

    • For good results, RS2 needs to be used along with anchor papers representing different levels. Teacher training would also be helpful as research suggests that it can reduce, but cannot eliminate raters’ tendency for overall severity or leniency in assessing performance (Lumley and McNamara, 1995 ; McNamara, 1996; McNamara and Adams, 1991; Weigle, 1998)


    Outline claim questions study methodology data collection analysis facets one way anova

    • Ts’ involvement in defining what they think should be assessed in sts’ writing & describing the levels of performance (what those labels as Excellent, Good, or Poor’ stand for) helped them reach a more common understanding of the lge aspects being assessed and a shared interpretation of the score descriptions

    • The rating scales I have developed are ‘home-made’, based on LOs and tailored to FPE, and therefore to the LC needs. They can be extrapolated, with some adaptation to essay writing

    • They can be generalised to any similar multi-cultural context to produce a less personalized and more institutionalized objective assessment of students’ writing performance.


    Outline claim questions study methodology data collection analysis facets one way anova

    REFERENCES

    Alderson, J. C. (1991). Bands and Scores. In J. C. Alderson & B. North (Eds.), Language Testing in the 1990s: The Communicative Legacy (Vol. 71 - 86). London and Basingstoke: Macmillan Publishers Limited.

    Alderson, J. C., Clapham, C., & Wall, D. (1995). Language Test Construction and Evaluation: Cambridge University Press.

    Bachman, L. F. (1990). Fundamental Considerations in Language Testing: Oxford: Oxford University Press.

    Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice: Designing and Developing Useful Language Tests.: Oxford: Oxford University Press.

    Brindley, G. (1998). Describing language development? Rating scales and SLA. In: L. F. Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and language testing research. CUP.

    Fulcher, G. (2000). The 'communicative' legacy in language testing. System, 28, 483 -497.

    Fulcher, G. (2010). Practical Language Testing. Hodder Education, An Hachette UK Company

    Fulcher, G., Davidson, F. & Kemp, J. (2011) Effective rating scale development for speaking tests: Performance decision trees. Language Testing 28 (1) 5-29

    Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons (Ed.), Assessing Second Language Writing in Academic Contexts (pp. 241-276). Norwood, NJ: Ablex Publishing Corporation.

    Hunter, D. M., Jones, R. M., & Randhawa, B. S. (1996). The Use of Holistic versus Analytic Scoring for Large-Scale Assessment of Writing. The Canadian Journal of Program Evaluation, 11(2), 61 - 85.

    North, B. (2000) The development of a Common Framework Scale of Language Proficiency: Theoretical Studies in Second Language Acquisition P. Lang.

    North, B. (2003). Scales for rating language performance: Descriptive models, formulation styles, and presentation formats. TOEFL Monograph, 24.

    North, B. & Schneider, G. (1998) Scaling descriptors for language proficiency scales. Language Testing 15 (2) 217-263

    Weigle, S. C. (1994). Effects of training on raters of English as a second language compositions: Quantitative and Qualitative approaches. University of California, Los Angeles.

    Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press.

    Thank you


  • Login