LG675 Session 4: Reliability I. Sophia Skoufaki [email protected] 8/2/2012. What does ‘reliability’ mean in the context of applied linguistic research? Definition and examples Which are the two broad categories of reliability tests?
A researcher who has created a vocabulary-knowledge test wants to see whether any questions in this test are inconsistent with the test as a whole (Gyllstad 2009).
A researcher who has collected data through a questionnaire she designed wants to see whether any questionnaire items are inconsistent with the test as a whole (Sasaki 1996).
A researcher wants to see whether she and another coder agree to a great extent in their coding of idiom-meaning guesses given by EFL learners as right or wrong (Skoufaki 2008).
Reliability of norm-referenced data-collection instruments
Reliability of criterion-referenced data collection instruments, AKA ‘dependability’.
(e.g., gender, country of residence)
(e.g., students in a class can be ordered)
(e.g., in tests where people have to get a minimum score to pass)
(e.g., in length, time)
a) a rater rates test-takers’ performance consistently (intra-rater reliability) and
b) two or more raters which rate test-takers’ performance give ratings which agree among themselves (inter-rater agreement)
Cronbach’s a is the most frequently used internal-consistency reliability statistic.
When data is interval, correlations can be used (Pearson r if the data are normally distributed and Spearman rho if they are not).
When there are more than two judges and the data is interval, Cronbach’s a can be used.
When data is categorical, we can calculate agreement percentage (e.g., the two raters agreed 30% of the time) or Cohen’s Kappa. Kappa corrects for the chance agreement between judges. However, the agreement percentage is good enough in some studies and Kappa has been criticised (see Phil’s handout, pp. 14-17).
This is an activity from Larsen-Hall (2010).
Go to http://cw.routledge.com/textbooks/9780805861853/spss-data-sets.asp
Download and input into SPSS the file MunroDerwingMorton.
Click on Analyze...Scale...Reliability analysis....
In the Model box, choose Alpha.
Click on Statistics and tick Scale, Scale if item deleted and Correlations.
The file ‘Kappa data.sav’ contains the results of an error tagging task that I and a former colleague of mine performed on some paragraphs written by learners of English.
Each number is an error category (e.g., 2=spelling error). There are 7 categories.
For SPSS to understand what each row means, you should weigh the two judge variables by the ‘Count’ variable.
Go to Analyse…Descriptive Statistics…Crosstabs.
Move one of the judge variables in the ‘Row(s)’ and the other on the ‘Column(s)’ box.
You shouldn’t do anything with the ‘Count’ variable.
In ‘Statistics’ tick ‘Kappa’.
Go to the website above and Select number of categoriesin the data.
In the table, enter the raw numbers as they appear in the SPSS contingency table.
The result is not only the same Kappa value as in SPSS, but also three more.
Phil recommends using one of the Kappas which have a possible maximum value (see p. 18 of his handout).
Item analysis for norm-referenced measures
Reliability tests for criterion-referenced measures
Bachman, L.F. and Palmer, A.S. 1996. Language Testing in Practice. Oxford: Oxford University Press. (pp.19-21)
Bachman, L.F. 2004. Statistical Analyses for Language Assessment. Cambridge University Press. (Chapter 5)
Brown, J.D. 2005. Testing in language programs: a comprehensive guide to English language assessment. New York: McGraw Hill.
Fulcher, G. 2010. Practical language testing. London: Hodder Education. (pp.46-7)
Hughes, A. 2003. Testing for Language Teachers. (2nd ed.) Cambridge: Cambridge University Press. (pp. 36-44)
Bachman, L.F. 2004. Statistical Analyses for Language Assessment. Cambridge University Press. (chapter 5)
Brown, J.D. 2005. Testing in language programs: a comprehensive guide to English language assessment. New York: McGraw Hill. (chapter 8)
Brown, J.D. 1997. Reliability of surveys. Shiken: JALT Testing & Evaluation SIG Newsletter 1 (2) , 18-21.
Field, A. 2009. Discovering statistics using SPSS. (3rd ed.)London: Sage. (sections 17.9, 17.10)
Fulcher, G. 2010. Practical language testing. London: Hodder Education. (pp.47-52)
Howell, D.C. 2007. Statistical methods for psychology. Calif.: Wadsworth. (pp. 165-166)
Larsen-Hall, J. 2010. A guide to doing statistics in second language research using SPSS. London: Routledge. (section 6.4, 6.5.4., 6.5.5)
The file ‘P-FP Sophia Sumei.xls’ contains the number of pauses (unfilled, filled, and total) in some spoken samples of learners of English according to my and a former colleague’s judgment.
Which of the aforementioned statistical tests of interjudge agreement seem appropriate for this kind of data?
What else would you need to find out about the data in order to decide which test is the most appropriate?