**LG675Session 4: Reliability I** Sophia Skoufaki sskouf@essex.ac.uk 8/2/2012

**What does ‘reliability’ mean in the context of applied** linguistic research? • Definition and examples • Which are the two broad categories of reliability tests? • How can we use SPSS to examine the reliability of a norm-referenced measure? • Work with typical scenarios

**Reliability: broad definitions** • The degree to which a data-collection instrument (e.g., a language test, questionnaire) yields consistent results. • The degree to which a person categorises linguistic output consistently (as compared to himself/herself or someone else).

**Reliability in applied linguistic research: examples** A researcher who has created a vocabulary-knowledge test wants to see whether any questions in this test are inconsistent with the test as a whole (Gyllstad 2009). A researcher who has collected data through a questionnaire she designed wants to see whether any questionnaire items are inconsistent with the test as a whole (Sasaki 1996). A researcher wants to see whether she and another coder agree to a great extent in their coding of idiom-meaning guesses given by EFL learners as right or wrong (Skoufaki 2008).

**Two kinds of reliability** Reliability of norm-referenced data-collection instruments Reliability of criterion-referenced data collection instruments, AKA ‘dependability’.

**Classification of data-collection instruments according to** the basis of grading 6

**Norm-referenced ** • “Each student’s score on such a test is interpreted relative to the scores of all other students who took the test. Such comparisons are usually done with reference to the concept of the normal distribution …” (Brown 2005) • In the case of language tests, these tests assess knowledge and skills not based on specific content taught.

**Criterion-referenced ** • “… the purpose of a criterion-referenced test is to make a decision about whether an individual test taker has achieved a pre-specified criterion…” (Fulcher 2010) • “The interpretation of scores on a CRT is considered absolute in the sense that each student’s score is meaningful without reference to the other students’ scores.” (Brown 2005) • In the case of language tests, these tests assess knowledge and skills based on specific content taught.

**How reliability is assessed in norm-referenced** data-collection instruments

**Which reliability test we will use also depends on the** nature of the data

**Scoring scales** • Nominal: Numbers are arbitrary; they distinguish between groups of individuals (e.g., gender, country of residence) • Ordinal: Numbers show greater or lesser amount of something; they distinguish between groups of individuals and they rank them (e.g., students in a class can be ordered)

**Scoring scales (cont.)** • Interval: Numbers show greater or lesser amount of something and the difference among adjacent numbers remains stable throughout the scale; numbers distinguish between groups of individuals and they rank them and they show how large the difference is between two numbers (e.g., in tests where people have to get a minimum score to pass) • Ratio: This scale contains a number zero, for cases which completely lack a characteristic; numbers do all the things that numbers in interval scales do and they include a zero point (e.g., in length, time)

**Assessing reliability through the test-rest or equivalent** forms approach • The procedure for this kind of reliability test is to ask the same people to do the same test again(test-retest) or an equivalent version of this test (equivalent forms). • Correlational or correlation-like statistics are used to see how much the scores of the participants are similar between the two tests.

**SPSS: Testing test-retest reliability** • Open SPSS and input the data from Set 1 on page 5 of Phil’s ‘Simple statistical approaches to reliability and item analysis’ handout. • Do this activity. • Then do the activity with Sets 2 and 3.

**Raterreliability: The degree to which ** a) a rater rates test-takers’ performance consistently (intra-rater reliability) and b) two or more raters which rate test-takers’ performance give ratings which agree among themselves (inter-rater agreement)

**Ways of assessing internal-consistency reliability** • Split-half reliability • We split the test items in half. Then we do a correlation between the scores of the halves. Because our finding will indicate how reliable half our test is (not all of it) and the longer a test is, the higher its reliability, we need to adjust the finding. We use the Spearman-Brown prophecy formula for that. Or • Statistic that compares the distribution of the scores that each item got with the distribution of the scores the whole test got • E.g.: Cronbach’s a or Kuder-Richardson formula 20 or 21 • In both cases, the higher the similarity found, the higher the internal-consistency reliability. Cronbach’s a is the most frequently used internal-consistency reliability statistic.

**SPSS: Assessing internal-consistency reliability with** Cronbach’sa • This is an activity from Brown (2005). He split the scores from a cloze test into odd and even numbered ones, as shown in the table in your handout. • Input the file ‘Brown_2005.sav’ into SPSS. • Then click on Analyze...Scale...Reliability analysis.... • In the Model box, choose Alpha • Click on Statistics and tick Scaleand Correlations.

**Assessing intra-judge agreement or inter-judge agreement** between two judges When data is interval, correlations can be used (Pearson r if the data are normally distributed and Spearman rho if they are not). When there are more than two judges and the data is interval, Cronbach’s a can be used. When data is categorical, we can calculate agreement percentage (e.g., the two raters agreed 30% of the time) or Cohen’s Kappa. Kappa corrects for the chance agreement between judges. However, the agreement percentage is good enough in some studies and Kappa has been criticised (see Phil’s handout, pp. 14-17).

**SPSS: Assessing interjudge agreement with Cronbach’s a ** This is an activity from Larsen-Hall (2010). Go to http://cw.routledge.com/textbooks/9780805861853/spss-data-sets.asp Download and input into SPSS the file MunroDerwingMorton. Click on Analyze...Scale...Reliability analysis.... In the Model box, choose Alpha. Click on Statistics and tick Scale, Scale if item deleted and Correlations.

**SPSS: Assessing rater reliability through Cohen’s Kappa** The file ‘Kappa data.sav’ contains the results of an error tagging task that I and a former colleague of mine performed on some paragraphs written by learners of English. Each number is an error category (e.g., 2=spelling error). There are 7 categories. For SPSS to understand what each row means, you should weigh the two judge variables by the ‘Count’ variable.

**SPSS: Assessing rater reliability through Cohen’s Kappa** (cont.) Go to Analyse…Descriptive Statistics…Crosstabs. Move one of the judge variables in the ‘Row(s)’ and the other on the ‘Column(s)’ box. You shouldn’t do anything with the ‘Count’ variable. In ‘Statistics’ tick ‘Kappa’.

**Kappa test**

**Kappa at http://faculty.vassar.edu/lowry/kappa.html** Go to the website above and Select number of categoriesin the data. In the table, enter the raw numbers as they appear in the SPSS contingency table. Click Calculate. The result is not only the same Kappa value as in SPSS, but also three more. Phil recommends using one of the Kappas which have a possible maximum value (see p. 18 of his handout).

**Next week** Item analysis for norm-referenced measures Reliability tests for criterion-referenced measures Validity tests

**References** • Brown, J.D. 2005. Testing in language programs: a comprehensive guide to English language assessment. New York: McGraw Hill. • Fulcher, G. 2010. Practical language testing. London: Hodder Education. • Gyllstad, H. 2009. Designing and evaluating tests of receptive collocation knowledge: COLLEX and COLLMATCH. In Barfield, A. and Gyllstad, H. (eds.) Researching Collocations in Another Language: Multiple Interpretations (pp. 153-170). London: Palgrave Macmillan. • Larsen-Hall, J. 2010. A guide to doing statistics in second language research using SPSS. London: Routledge. • Sasaki, C. L. 1996. Teacher preferences of student behavior in Japan. JALT Journal 18(2), 229-239. • Scholfield, P. 2011. Simple statistical approaches to reliability and item analysis. LG675 Handout. University of Essex. • Skoufaki, S. 2008. Investigating the source of idiom transparency intuitions. Metaphor and Symbol 24(1), 20-41.

**Suggested readings** • On the meaning of ‘reliability’ (particularly in relation to language testing) Bachman, L.F. and Palmer, A.S. 1996. Language Testing in Practice. Oxford: Oxford University Press. (pp.19-21) Bachman, L.F. 2004. Statistical Analyses for Language Assessment. Cambridge University Press. (Chapter 5) Brown, J.D. 2005. Testing in language programs: a comprehensive guide to English language assessment. New York: McGraw Hill. Fulcher, G. 2010. Practical language testing. London: Hodder Education. (pp.46-7) Hughes, A. 2003. Testing for Language Teachers. (2nd ed.) Cambridge: Cambridge University Press. (pp. 36-44) • On the statistics used to assess language test reliability Bachman, L.F. 2004. Statistical Analyses for Language Assessment. Cambridge University Press. (chapter 5) Brown, J.D. 2005. Testing in language programs: a comprehensive guide to English language assessment. New York: McGraw Hill. (chapter 8)

**Suggested readings (cont.)** Brown, J.D. 1997. Reliability of surveys. Shiken: JALT Testing & Evaluation SIG Newsletter 1 (2) , 18-21. Field, A. 2009. Discovering statistics using SPSS. (3rd ed.)London: Sage. (sections 17.9, 17.10) Fulcher, G. 2010. Practical language testing. London: Hodder Education. (pp.47-52) Howell, D.C. 2007. Statistical methods for psychology. Calif.: Wadsworth. (pp. 165-166) Larsen-Hall, J. 2010. A guide to doing statistics in second language research using SPSS. London: Routledge. (section 6.4, 6.5.4., 6.5.5)

**Homework** The file ‘P-FP Sophia Sumei.xls’ contains the number of pauses (unfilled, filled, and total) in some spoken samples of learners of English according to my and a former colleague’s judgment. Which of the aforementioned statistical tests of interjudge agreement seem appropriate for this kind of data? What else would you need to find out about the data in order to decide which test is the most appropriate?