Research Instruments in Language Studies

Experimental Research Methods in Language Learning Chapter 7 Quantitative Research Instruments and Techniques

Leading Questions • Can you give an example of a research instrument and describe a situation in which it is used? • If you designed an experimental study, what kind of research instruments you would adopt? • Why do you think an understanding of advantages and disadvantages of a particular research instrument or data elicitation technique is important?

The Nature of Research Data • Empirical data = raw data/unprocessed data gathered in the course of a study • Empirical data are used to understand the nature of language learning which is not always directly observable. • Experimental research needs to employ a variety of research instruments such as tests, elicitation tasks, inventories, questionnaires, and rating scales to obtain data.

Examples of Research Instruments

Quantitative Research Instruments and Techniques Language Tests and Assessments • Assessment is a broader concept than tests as it includes both tests and non-tests (e.g., self-assessment, and portfolio assessment). • A test is typically used in a strictly controlled and standardized manner.

Needs for Language Tests and Assessments In language learning research, tests and assessments are needed because, for example, we need to: • assess students’ language ability/proficiency; • discover how successful students have been in achieving the objectives of a course of study; • provide feedback to learners; • evaluate the effectiveness of teaching or an experimental program.

Language Proficiency Tests • Is based on a theoretical model of language proficiency • Assesses students’ knowledge of and ability to use a language in general without reference to a curriculum or syllabus • Focuses on discrimination among different ability levels (distinguishing various ability levels from one another validly). • Examples: TOEFL, TOEIC, IELTS, and OET

Achievement Tests • Is associated with the language curriculum or syllabus for a course that students are undertaking • Can assess what students have learnt and rank them in terms of their level of mastery of the subject. • Discrimination among different students may or may not be important for achievement tests.

Researcher-made Tests • Developed by the researcher through use of existing theories about a topic of interest as well as by reviewing and examining what other researchers who have conducted a similar study have used • Can be designed to elicit the specific ability under investigation • Requires a pilot study to make sure that an instrument is appropriate and feasible for use • Can be time-consuming and expensive

Performance Assessments • Measures what students can do (e.g., speak and write), rather than what they know • Form of a direct assessment in which students are assessed by carrying out an activity that requires them to use a particular target language skill • Uses a holistic scoring method or an analytical scoring method • Is subjective and often needs two trained raters to mark each learner

Self-assessment • Typically uses ‘can-do’ statements (e.g., I can carry on a daily conversation with a stranger) • Useful for formative assessment purposes • Practical for low-stakes decision making • Subject to scepticism due to learners’ inability to provide accurate judgments of their achievement, ability, or proficiency • Should not be a sole method for experimental research

Peer evaluation • Is useful for formative assessment and encourages the development of evaluative processes • Promotes students’ peer mentoring and supportive environment • Students can rate their peers’ performance. • Provides some peers’ written feedback

Portfolio Assessment • Is related to a collection of language performance samples of students over time • Is difficult to use to determine the gains or improvement in an experimental study. • Scoring a portfolio can be much harder than scoring a test. • the issue of validity and fairness in portfolio assessment due to a lack of control over factors contributing to students’ performance

Objective versus Subjective Tests • An objective test is a test that has answer keys to mark students’ responses to questions. • An answer is either correct or incorrect and a human scorer does not need to make his/her own judgment. • A subjective test requires a human scorer to make a judgment on students’ performance. • Subjective tests include tests that require learners to complete a task by speaking or writing.

Skill-based Tests and Assessments • Each skill requires a careful theoretical and methodological consideration of a language skill construct, assessment methods. • Speaking, see Luoma 2004 • Writing, see Weigle 2004 • Listening, see Buck 2001 • Reading, see Alderson 2000 • Grammar, see Purpura 2004, • Vocabulary, see Read 2000).

Test Techniques • Selected-response techniques (e.g., multiple-choice, true/false, and ordering) • Constructed-response techniques (e.g., limited-production tasks, such as those involving short answers, information transfer, cloze, gap-filling, dictation, and sentence completion • Extended-production tasks, such as essays, reports, role play, and interviews)

The Importance of Test Specifications • Are blueprints. • It is essential to develop a test specification for pre- and posttests or other tests to be used in your experimental study. • Test specifications provide a detailed guideline of the abilities to be measured and how to measure them validly, reliably and appropriately.

What Makes up a Test Score? • An observed test score is not necessarily a true reflection of students’ ability or performance. • There is a true score and an error score in an observed score. • A test score is affected by an underlying ability of interest, test-method facets, personal characteristics and a random error of measurement. See Bachman and Palmer (2010)

The Ceiling and Floor Effects • The ceiling effect is related to a restriction of the upper end of the test score range. This effect is related to higher ability students whose test performance may be underestimated because the test may be too easy, or the test does not allow them to demonstrate their ability to a sufficiently high level.

The Ceiling and Floor Effects • The floor effect is influenced by a restricted lower end of the test score range and concerns lower ability students whose performance cannot be captured adequately, simply because the test is too difficult for them.

Questionnaires and Inventories • Can collect quantitative or qualitative data or both. • No right or wrong answers in questionnaires. • A Likert-scale is often used to quantify a construct of interest, for example, 1 (never), 2 (rarely), 3 (often), 4 (usually), or 5 (always). • Other techniques: Dichotomous items (e.g., yes/no), multiple-choice items, order of importance, checklists, semantic differential items, and open-ended questions

Rating Scales • Are often used in self-assessment, peer evaluation and performance assessment. They can be affected by various types of error/bias and rater characteristics. • Linearity error = the tendency of a rater to be generous when rating individuals’ abilities. • Severity error = the tendency of a rater to be harsh on all individuals. • Central tendency error = a rater’s tendency to avoid extreme rating scores

Language Aptitude Tests • Aptitude is often viewed as raw learning power (Dörnyei 2005). • MLAT (Modern Language Aptitude Test; Carroll & Sapon 1959) • DLAB (Defense Language Aptitude Battery (Peterson & Al-Haik 1976) • Hi-LAB (High-Level Language Aptitude Battery; Doughty et al 2010)

Quantitative Observations • Can help researchers overcome such a discrepancy, by allowing them to observe learners’ patterns of behavior in a specific context. • Standardized procedures surrounding the questions of not only who and what is to be observed, but also when, where, and how to observe. • COLT (Communicative Orientation of Language Teaching; Spada & Fröhlich 1995) and MOLT (Motivation Orientation in Language Teaching; Guilloteaux & Dörnyei 2008)

Validity, Reliability, Practicality, Fairness, and Ethics Revisited • An instrument is valid when it measures what it intends to measure and fulfils its purpose. • A measure is reliable when it can produce scores consistently. • A measure is practical when it can be administered and scored in a reasonable amount of time and with a reasonable use of resources. • A measure is fair when the participants know the purpose of the measurement and are treated fairly across groups of participants. • A measure is ethical when it is not only fair, but also used appropriately, bearing in mind its potential consequences on an individual or a society

Validation and a Pilot Study • Steps taken by the researcher to make sure that a measure to be used will likely be valid and that proper inferences can be made about the construct of interest based on the data. • Validity evidence may include content-related evidence, internal structures and criterion-related evidence

Validation and a Pilot Study Examples of validation processes • Expert Judgments • Analysis of Cognitive Processes • Analysis of Internal structures • Analysis across Different Groups of Participants • Comparative Analysis with Other External Criteria

Discussion • What are common characteristics of quantitative research instruments? • If you would like to examine the effects of feedback on students’ academic writing, what instruments would you choose? Explain your reasons? • What would you gain if you piloted your research instruments before you conducted your experiment?

Research Instruments in Language Studies