Effective Test Construction Techniques for Various Formats

Using Tests (Part II) • Dr AyazAfsar

Consider the form of the test • Much of the discussion in this session assumes that the test is of the pen-and paper variety. Clearly this need not be the case; for example, tests can be written, oral, practical, interactive, computer-based, dramatic, diagrammatic, pictorial, photographic, involve the use of audio and video material, presentational and role-play, simulations. Oral tests, for example, can be conducted if the researcher feels that reading and writing will obstruct the true purpose of the test (i.e. it becomes a reading and writing test rather than, say, a test of mathematics). • This does not negate the issues discussed herefor the form of the test will still need to consider, for example, reliability and validity, difﬁculty, discriminability, marking and grading, item analysis, timing.

Cont. • Indeed several of these factors take on an added signiﬁcance in non- written forms of testing; for example, reliability is a major issue in judging live musical performance or the performance of a gymnastics routine – where a ‘one-off’ event is likely. Furthermore, reliability and validity are signiﬁcant issues in group performance or group exercises – where group dynamics may prevent a testee’s true abilities from being demonstrated. • Clearly the researcher will need to consider whether the test will be undertaken individually, or in a group, and what form it will take.

Write the test item The test will need to address the intended and unintended clues and cues that might be provided in it, for example (Morris et al. 1987): • The number of blanks might indicate the number of words required. • The number of dots might indicate the number of letters required. • The length of blanks might indicate the length of response required. • The space left for completion will give cues about how much to write. • Blanks in different parts of a sentence will be assisted by the reader having read the other parts of the sentence (anaphoric and cataphoric reading cues). • Hanna (1993: 139–41) and Cunningham (1998) provide several guidelines for constructing short-answer items to overcome some of these problems:

Cont. • Make the blanks close to the end of the sentence. • Keep the blanks the same length. • Ensure that there can be only a single correct answer. • Avoid putting several blanks close to each other (in a sentence or paragraph) such that the overall meaning is obscured. • Only make blanks of key words or concepts, rather than of trivial words. • Avoid addressing only trivial matters. • Ensure that students know exactly the kind and speciﬁcity of the answer required. • Specify the units in which a numerical answer is to be given. • Use short-answers for testing knowledge recall.

Cont. • With regard to multiple choice items there are several potential problems: • The number of choices in a single multiple choice item and whether there is one or more right answer(s). • The number and realism of the distractors in a multiple choice item (e.g. there might be many distractors but many of them are too obvious to be chosen – there may be several redundant items). • The sequence of items and their effects on each other • The location of the correct response(s) in a multiple choice item.

Constructing effective multiple choice test Some suggestions for constructing effective multiple choice test items: • Ensure that they catch signiﬁcant knowledge and learning rather than low-level recall of facts • frame the nature of the issue in the stem of the item, ensuring that the stem is meaningful in itself (e.g. replace the general ‘sheep’: (a) are graminivorous (b) are cloven footed (c) usually give birth to one or two calves at a time’ with ‘how many lambs are normally born to a sheep at one time?’)

Cont…Write the test item • Ensure that the stem includes as much of the item as possible, with no irrelevancies. • Avoid negative stems to the item. • Keep the readability levels low ensure clarity and unambiguity. • Ensure that all the options are plausible so that guessing of the only possible option is avoided • avoid the possibility of students making the correct choice through incorrect reasoning • include some novelty to the item if it is being used to measure understanding.

Cont. • Ensure that there can only be a single correct option (if a single answer is required) and that it is unambiguously the right response • avoid syntactical and grammatical clues by making all options syntactically and grammatically parallel and by avoiding matching the phrasing of a stem with similar phrasing in the response. • Avoid including in the stem clues as to which may be the correct response • Ensure that the length of each response item is the same (e.g. to avoid one long correct answer from standing out).

Cont…Write the test item • keep each option separate, avoiding options which are included in each other • ensure that the correct option is positioned differently for each item (e.g. so that it is not always option 2) • avoid using options like ‘all of the above’ or ‘none of the above’ • avoid answers from one item being used to cue • answers to another item – keep items separate.

True-false questions • Morris et al. (1987: 161), Gronlund and Linn (1990), Hanna (1993: 147), Cunningham (1998) and Aiken (2003) also indicate particular problems in true–false questions: • ambiguity of meaning: some items might be partly true or partly false items that polarize – being too easy or too hard most items might be true or false under certain conditions • it may not be clear to the student whether facts or opinions are being sought as this is dichotomous, students have an even chance of guessing the correct answer an imbalance of true to false statements • some items might contain ‘absolutes’ which give powerful clues, e.g. ‘always’, ‘never’, ‘all’, ‘none’.

How to overcome problems To overcome these problems several points need to be addressed: • Avoid generalized statements (as they are usually false) • Avoid trivial questions • Avoid negatives and double negatives in statements • Avoid over-long and over-complex statements • Ensure that items are rooted in facts • Ensure that statements can be only true or false • Write statements in everyday language • Decide where it is appropriate to use ‘degrees’ – ‘generally’, ‘usually’, ‘often’ – as these are capable of interpretation • Avoid ambiguities • Ensure that each statement contains only one idea • If an opinion is to be sought then ensure that it is attributable to a named source • Ensure that true statements and false statements are equal in length and number.

Matching items • Morris et al. (1987), Hanna (1993: 150–2), Cunningham (1998) and Aiken (2003) also indicate particular potential difﬁculties in matching items: • It might be very clear to a student which items in a list simply cannot be matched to items in the other list (e.g. by dint of content, grammar, concepts), thereby enabling the student to complete the matching by elimination rather than understanding. • One item in one list might be able to be matched to several items in the other. • The lists might contain unequal numbers of items, thereby introducing distractors rendering the selection as much a multiple choice item as a matching exercise.

Cont. • The authors suggest that difficulties in matching items can be addressed thus: • Ensure that the items for matching are homogeneous – similar – over the whole test (to render guessing more difficult) • Avoid constructing matching items to answers that can be worked out by elimination (e.g. by ensuring that: • (a) there are different numbers of items in each column so that there are more options to be matched than there are items; • (b) students can avoid being able to reduce the field of options as they increase the number of items that they have matched; • (c) the same option may be used more than once) • Decide whether to mix the two columns of matched items (i.e. ensure, if desired, that each column includes both items and options)

Cont. • Sequence the options for matching so that they are logical and easy to follow (e.g. by number, by chronology) • Avoid over-long columns and keep the columns on a single page • Make the statements in the options columns as brief as possible • Avoid ambiguity by ensuring that there is a clearly suitable option that stands out from its rivals • make it clear what the nature of the relationship should be between the item and the option (on what terms they relate to each other) number the items and letter the options.

Essay questions • With regard to essay questions, there are several advantages that can be claimed. For example, an essay, as an open form of testing, enables complex learning outcomes to be measured, • It enables the student to integrate, apply and synthesize knowledge, to demonstrate the ability for expression and self-expression, and to demonstrate higher order and divergent cognitive processes. • On the other hand, essays have been criticized for yielding unreliable data (Gronlund and Linn 1990; Cunningham 1998), for being prone to unreliable (inconsistent and variable) scoring and neglectful of intended learning outcomes and prone to marker bias and preference (being too intuitive, subjective, holistic, and time-consuming to mark).

How to improve essay questions? • The essay question must be restricted to those learning outcomes that are unable to be measured more objectively. • The essay question must ensure that it is clearly linked to desired learning outcomes and that it is clear what behaviours the students must demonstrate. • The essay question must indicate the ﬁeld and tasks very clearly (e.g. ‘compare’, ‘justify’, ‘critique’, ‘summarize’, ‘classify’, ‘analyse’, ‘clarify’, ‘examine’, ‘apply’, ‘evaluate’, ‘synthesize’, ‘contrast’, ‘explain’, ‘illustrate’). • Time limits are set for each essay. • Options are avoided, or, if options are to be given, ensure that, if students have a list of titles from which to choose, each title is equally difﬁcult and equally capable of enabling the student to demonstrate achievement, understanding etc.

Marking • Marking criteria are prepared and are explicit, indicating what must be included in the answers and the points to be awarded for such inclusions or ratings to be scored for the extent to which certain criteria have been met. • Decisions are agreed on how to address and score irrelevancies, inaccuracies, poor grammar and spelling. • The work is double marked, blind, and, where appropriate, without the marker knowing (the name of) the essay writer. • Clearly these are issues of reliability. • The issue here is that layout can exert a profound effect on the test.

Consider the layout of the test • Deciding on the layout will include the following factors: • The nature, length and clarity of the instructions, for example what to do, how long to take, how much to do, how many items to attempt, what kind of response is required (e.g. a single word, a sentence, a paragraph, a formula, a number, a statement etc.), how and where to enter the response, where to show the ‘working out’ of a problem, where to start new answers (e.g. in a separate booklet) is one answer only required to a multiple choice item, or is more than one answer required spread out the instructions through the test.

Cont. • Avoiding overloading students with too much information at first, and providing instructions for each section as they come to it what marks are to be awarded for which parts of the test minimizing ambiguity and taking care over the readability of the items, the progression from the easy to the more difficult items of the test (i.e. the location and sequence of items) the visual layout of the page, for example avoiding overloading students with visual material or words, the grouping of items – keeping together items that have the same contents or the same format, the setting out of the answer sheets or locations so that they can be entered onto computers and read by optical mark readers and scanners (if appropriate). • The layout of the text should be such that it supports the completion of the test and that this is done as efficiently and as effectively as possible for the student.

Consider the timing of the test • The timing refers to two areas: when the test will take place (the day of the week, month, time of day) and the time allowances to be given to the test and its component items. • With regard to the former, in part this is a matter of reliability, for the time of day or week etc. might inﬂuence how alert, motivated or capable a student might be. With regard to the latter, the researcher will need to decide what time restrictions are being imposed and why; for example, is the pressure of a time constraint desirable – to show what a student can do under time pressure – or an unnecessary impediment, putting a time boundary around something that need not be bounded.

Cont. • Although it is vital that students know what the overall time allowance is for the test, clearly it might be helpful to indicate notional time allowances for different elements of the test; if these are aligned to the relative weightings of the test, they enable students to decide where to place emphasis in the test – they may want to concentrate their time on the high scoring elements of the test. • Further, if the items of the test have exact time allowances, this enables a degree of standardization to be built into the test, and this may be useful if the results are going to be used to compare individuals or groups.

Plan the scoring of the test • The awarding of scores for different items of the test is a clear indication of the relative significance of each item – the weightings of each item are addressed in their scoring. • It is important to ensure that easier parts of the test attract fewer marks than more difficult parts of it, otherwise a student’s results might be artificially inflated by answering many easy questions and fewer more difficult questions (Gronlund and Linn 1990). • Additionally, there are several attractions to making the scoring of tests as detailed and specific as possible (Cresswell and Houston 1991; Gipps 1994; Aiken 2003), awarding specific points for each item and sub-item, for example: It enables partial completion of the task to be recognized – students gain marks in proportion to how much of the task they have completed successfully (an important feature of domain-referencing).

Cont. • It enables a student to compensate for doing badly in some parts of a test by doing well in other parts of the test. • It enables weightings to be made explicit to the students. • It enables the rewards for successful completion of parts of a test to reflect considerations such as the length of the item, the time required to complete it, its level of difficulty, its level of importance. • It facilitates moderation because it is clear and specific. • It enables comparisons to be made across groups by item. • It enables reliability indices to be calculated. • Scores can be aggregated and converted into grades straightforwardly. • Scoring will also need to be prepared to handle issues of poor spelling, grammar and punctuation – is it to be penalized, and how will consistency be assured here? • Further, how will issues of omission be treated, e.g. if a student omits the units of measurement (miles per hour, dollars or pounds, meters or centimetres)?

Reporting the results • Related to the scoring of the test is the issue of reporting the results. If the scoring of a test is specific then this enables variety in reporting to be addressed, for example, results may be reported item by item, section by section, or whole test by whole test. • This degree of flexibility might be useful for the researcher, as it will enable particular strengths and weaknesses in groups of students to be exposed. • The desirability of some of the above points is open to question. For example, it could be argued that the strength of criterion-referencing is precisely its specificity, and that to aggregate data (e.g. to assign grades) is to lose the very purpose of the criterion-referencing . • For example, if a student is awarded a grade E for spelling in English, and a grade A for imaginative writing, this could be aggregated into a C grade as an overall grade of the student’s English language competence, but what does this C grade mean? It is meaningless, it has no frame of reference or clear criteria, it loses the useful specificity of the A and E grades, it is a compromise that actually tells us nothing. Further, aggregating such grades assumes equal levels of difficulty of all items.

Cont. • Of course, raw scores are still open to interpretation – which is a matter of judgement rather than exactitude or precision (Wiliam 1996). For example, if a test is designed to assess ‘mastery’ of a subject, then the researcher is faced with the issue of deciding what constitutes ‘mastery’ – is it an absolute (i.e. very high score) or are there gradations, and if the latter, then where do these gradations fall? • For published tests the scoring is standardized and already made clear, as are the conversions of scores into, for example, percentiles and grades. • Underpinning the discussion of scoring is the need to make it unequivocally clear exactly what the marking criteria are – what will and will not score points. • This requires a clariﬁcation of whether there is a ‘checklist’ of features that must be present in a student’s answer.

Cont. • Clearly criterion-referenced tests will have to declare their lowest boundary – a cut-off point – below which the student has been deemed to fail to meet the criteria. • A compromise can be seen in those criterion-referenced tests that award different grades for different levels of performance of the same task, necessitating the clariﬁcation of different cut-off points in the examination. • A common example of this can be seen in the GCSE examinations for secondary school pupils in the United Kingdom, where students can achieve a grade between A and F for a criterion-related examination. • The issue of scoring takes in a range of factors, for example: grade norms, age norms, percentile norms and standard score norms.

Devising a pretest and post-test The construction and administration of tests is an essential part of the experimental model of research, where a pretest and a post-test have to be devised for the control and experimental groups.The pretest and post-test must adhere to several guidelines: • The pretest may have questions which differ in form or wording from the post-test, though the two tests must test the same content, i.e. they will be alternate forms of a test for the same groups. • The pretest must be the same for the control and experimental groups. • The post-test must be the same for both groups. • Care must be taken in the construction of a post-test to avoid making the test easier to complete by one group than another. • The level of difﬁculty must be the same in both tests. • Test data feature centrally in the experimental model of research; additionally they may feature as part of a questionnaire, interview and documentary material.

Reliability and validity of tests • Sufﬁce it here to say that reliability concerns the degree of conﬁdence that can be placed in the results and the data, which is often a matter of statistical calculation and subsequent test redesigning. • Validity, on the other hand, concerns the extent to which the test tests what it is supposed to test. This devolves on content, construct, face, criterion-related and concurrent validity.

Ethical issues in preparing for tests • A major source of unreliability of test data derives from the extent and ways in which students have been prepared for the test. These can be located on a continuum from direct and specific preparation, through indirect and general preparation, to no preparation at all. With the growing demand for test data (e.g. for selection, for certification, for grading, for employment, for tracking, for entry to higher education, for accountability, for judging schools and teachers) there is a perhaps understandable pressure to prepare students for tests. • This is the ‘high-stakes’ aspect of testing (Harlen 1994), where much hinges on the test results. At one level this can be seen in the backwash effect of examinations on curricula and syllabuses; at another level it can lead to the direct preparation of students for specific examinations.

Cont. • Preparation can take many forms: ensuring coverage, among other programme contents and objectives, of the objectives and programme that will be tested restricting the coverage of the programme content and objectives to only those that will be tested preparing students with ‘exam technique’ practising with past or similar papers directly matching the teaching to speciﬁc test items, where each piece of teaching and contents is the same as each test item practising on an exactly parallel form of the test telling students in advance what will appear on the test practising on and preparing the identical test itself (e.g. giving out test papers in advance) without teacher input practising on and preparing the identical test itself (e.g. giving out the test papers in advance), with the teacher working through the items, maybe providing sample answers.

Cont. • How ethical it would be to undertake the ﬁnal four of these is perhaps questionable, or indeed any apart from the ﬁrst on the list. Are they cheating or legitimate test preparation? Should one teach to a test; is not to do so a dereliction of duty (e.g. in criterion- and domain-referenced tests) or giving students an unfair advantage and thus reducing the reliability of the test as a true and fair measure of ability or achievement? • In high-stakes assessment (e.g. for public accountability and to compare schools and teachers) there is even the issue of not entering for tests students whose performance will be low. There is a risk of a correlation between the ‘stakes’ and the degree of unethical practice – the greater the stakes, the greater the incidence of unethical practice.

Ethical issues in preparing for tests • Unethical practice, observes Gipps (1994), occurs where scores are inﬂated but reliable inference on performance or achievement is not, and where different groups of students are prepared differentially for tests, i.e. giving some students an unfair advantage over others. To overcome such problems, she suggests, it is ethical and legitimate for teachers to teach to a broader domain than the test, that teachers should not teach directly to the test, and the situation should only be that better instruction rather than test preparation is acceptable (Cunningham 1998). • One can add to this list of considerations that tests must be valid and reliable. • The administration, marking and use of the test should be undertaken only by suitably competent/qualiﬁed people (i.e. people and projects should be vetted).

Cont. • Access to test materials should be controlled, thus test items should not be reproduced apart from selections in professional publication; the tests should be released only to suitably qualified professionals in connection with specific professionally acceptable projects. • Tests should benefit the testee (beneficence). • Clear marking and grading protocols should exist. • Test results are reported only in a way that cannot be misinterpreted. • The privacy and dignity of individuals should be respected (e.g. confidentiality, anonymity, non-traceability). • Individuals should not be harmed by the test or its results (non-maleficence). • Informed consent to participate in the test should be sought.

Computerized adaptive testing • Computerized adaptive testing is the decision on which particular test items to administer, which is based on the subjects’ responses to previous items. • It is particularly useful for large-scale testing, where a wide range of ability can be expected. Here a test must be devised that enables the tester to cover this wide range of ability; hence it must include some easy to some difficult items – too easy and it does not enable a range of high ability to be charted (testees simply getting all the answers right), too difficult and it does not enable a range of low ability to be charted (testees simply getting all the answers wrong). We find out very little about a testee if we ask a battery of questions which are too easy or too difficult. • Further, it is more efficient and reliable if a test can avoid the problem for high ability testees of having to work through a mass of easy items in order to reach the more difficult items and for low ability testees of having to try to guess the answers to more difficult items. Hence it is useful to have a test that is flexible and that can be adapted to the testees. • For example, if a testee found an item too hard the next item could adapt to this and be easier, and, conversely, if a testee was successful on an item the next item could be harder.

Cont. • Computers here provide an ideal opportunity to address the ﬂexibility,discriminability and efﬁciency of testing. • computer adaptive testing can reduce the number of test items present to around 50 per cent of those used in conventional tests. • Testees can work at their own pace, they need not be discouraged but can be challenged, the test is scored instantly to provide feedback to the testee, a greater range of items can be included in the test and a greater degree of precision and reliability of measurement can be achieved; indeed, test security can be increased and the problem of understanding answer sheets is avoided. • Clearly the use of computer adaptive testing has several putative attractions. On the other hand, it requires different skills from traditional tests, and these might compromise the reliability of the test, for example: The mental processes required to work with a computer screen and computer program differ from those required for a pen and paper test.

Cont. • Motivation and anxiety levels increase or decrease when testees work with computers. • The physical environment might exert a signiﬁcant difference, e.g. lighting, glare from the screen, noise from machines, loading and running the software. • Reliability shifts from an index of the variability of the test to an index of the standard error of the testee’s performance. • The usual formula for calculating standard error assumes that error variance is the same for all scores, whereas in item response theory it is assumed that error variance depends on each testee’s ability – the conventional statistic of error variance calculates a single average variance of summed scores, whereas in item response theory this is at best very crude, and at worst misleading as variation is a function of ability rather than test variation and cannot fairly be summed

Cont. • Computer adaptive testing requires a large item pool for each area of content domain to be developed with sufficient numbers, variety and spread of difficulty. • All items must measure a single aptitude or dimension, and the items must be independent of each other, i.e. a person’s response to an item should not depend on that person’s response to another item. • The items have to be pretested and validated, their difﬁculty and discriminability calculated, the effect of distractors reduced, the capability of the test to address unidimensionality and/or multidimensionality to be clariﬁed, and the rules for selecting items to be enacted.

Further Reading • Cohen, L. and Manion, L. (2012) Research Methods in Education (7thedition). London: Routledge.

The End

Effective Test Construction Techniques for Various Formats