1 / 37

The BILC BAT: A Research and Development Success Story

The BILC BAT: A Research and Development Success Story. Ray T. Clifford BILC Professional Seminar Vienna, Austria 11 October. Language is the most complex of human behaviors. Language proficiency is clearly not a simple, one-dimensional trait.

pittsd
Download Presentation

The BILC BAT: A Research and Development Success Story

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The BILC BAT:A Research and DevelopmentSuccess Story Ray T. Clifford BILC Professional Seminar Vienna, Austria 11 October

  2. Language is the most complex of human behaviors. • Language proficiency is clearly not a simple, one-dimensional trait. • Therefore, language development can not be expected to be linear. • However, language proficiency can be assessed against a hierarchy of identifiable common stages of language skill development.

  3. Testing Language Proficiencyin the Receptive Skills • Norm-referenced statistical analyses are problematic when testing for proficiency. • Rasch one-factor IRT analysis assumes: • A one-dimensional trait. • Linear skill development. • All test items discriminate equally well. • Norm-referenced statistics are meant to distinguish all students from each another, not separating passing students from failing students.

  4. Testing Language Proficiencyin the Receptive Skills • Norm-referenced statistical analyses are problematic when testing for proficiency. • They require too many subjects for use in LCTLs. • About 100 to 300 test subjects of varying abilities must answer each item. • There may not be that number of people to be tested. • The results do not have a direct relationship to proficiency levels or other external criteria.

  5. Testing Language Proficiencyin the Receptive Skills • Norm-referenced statistical analyses are problematic when testing for proficiency. • There has not been an adequate way of insuring that the range of skills tested and the difficulty of any given test match the targeted range of the language proficiency scale. • Setting passing scores using norm-referenced statistics is an imprecise process. • Setting multiple cut-scores from a total test score violates the criterion-referenced principle of non-compensatory scoring.

  6. Test Development Procedures: Norm-Referenced Tests • Create a table of test specifications. • Train item writers in item-writing techniques. • Develop items. • Test the items for difficulty and reliability by administering them to several hundred learners. • Use statistics to eliminate “bad” items. • Administer the resulting test. • Report results compared to other students or attempt to relate these norm-referenced results to a polytomous set of criteria (such as the STANAGscale).

  7. Traditional Method ofSetting Cut Scores Level3Group Groups of ”known” ability Test to be calibrated Level2Group Level1Group

  8. The Results You Hope For: Level3Group Groups of “known” ability Test to be calibrated Level2Group Level1Group

  9. The Results You Always Get: Level3Group Groups of ”known” ability Level2Group Test Scores Received Level1Group

  10. Why is there always an overlap? • Total scores are by definition “compensatory” scores. • Every answer guessed correctly adds to the individual’s score. • There is no way to check for ability at a given proficiency level. • Students with different abilities may have attained the same scores, by • Answering only the Level 1 questions right. • Answering 25% of all the questions right.

  11. No matter where the cut scores are set, they are wrong for someone. Level3Group Groups of ”known” ability Level2Group Test Scores Received Level1Group

  12. A Better Way • We can test language proficiency using criterion-referenced instead of norm-referenced testing procedures.

  13. Criterion-Referenced Proficiency Testing in the Receptive Skills • Items must strictly adhere to the proficiency “Table of Specifications”. • Every component of the test item must be aligned with and match the specifications of a single level of the proficiency scale. • The text difficulty • The author purpose • The task asked of the reader/listener

  14. Criterion-Referenced Proficiency Testing in the Receptive Skills • Testing reading and listening proficiency requires “Independent, non-compensatory scoring” for each proficiency level, not calculating a single score for the entire test. • This makes the test development process more complex. • Requires trained item writers and reviewers. • Begins with “modified Angoff” ratings instead of IRT procedures to validate items.

  15. The BILC Benchmark Advisory Test(Reading)Is a Criterion-Referenced Proficiency Test.

  16. Steps in the Process • We updated the STANAG 6001 Proficiency Scale. • Each level describes a measurable point on the scale. • These assessment points are not arbitrary, but represent useful levels of ability, e.g. Survival, Functional, Professional, etc. • Thus, each level represents a defined “construct” of language ability.

  17. Steps in the Process • We validated the scale. • The hierarchical nature of these constructs had been operationally – but not statistically – validated. • A statistical validation process was run in Sofia, Bulgaria. • The results substantiated the validity of the scale’s operational use.

  18. STANAG 6001Scale Validation Exercise Conducted at Sofia, Bulgaria 13 October 2005

  19. Instructions • On the top of a blank piece of paper, write the following information: • Your current work assignment: Teacher, Tester, Administrator, Other______ • Your first (or dominate) language: _________ • You do not need to write your name!

  20. Instructions • Next, write the numbers: 0 1 2 3 4 5 down the left side of the paper.

  21. Instructions • You will now be shown 6 descriptions of language speaking proficiency. • Each description will be labeled with a color.

  22. Instructions • Rank the descriptions according to their level of difficulty by writing their color designation next the appropriate number: 0 (easiest) = Color ? 1 (next easiest) = Color ? 2 (next easiest) = Color ? 3 (next easiest) = Color ? 4 (next easiest) = Color ? 5 (most difficult) = Color ?

  23. Ready? • The descriptions will now be presented… • One at a time, • In a random sequence, • For 15 seconds each. • You will see each of the descriptors 4 times. • Thank you for participating in this experiment.

  24. STANAG 6001 Scale Validation: A Timed Exercise Without Training • 74 people turned in their rankings. • They marked their current work assignments as: • Administrator 49 • Teacher 26 • Tester 19 • Other 1

  25. Results of theSTANAG Scale Validation( n = 74 )

  26. Steps in the Process • We used the STANAG 6001 base proficiency levels as the definitive specifications for item development. • Author task and purpose in producing the text have to be aligned with the question or task asked of the reader. • The written (or audio) text type and linguistic characteristics of each item must also be characteristic of the proficiency level targeted by the item.

  27. Steps in the Process • The items developed had to then pass a strict review of whether each item matched the design specifications. • Multiple expert judges made independent judgments of whether each item matched the targeted level. • Only the items which passed this review with the unanimous consensus of trained judges, were taken to the next step.

  28. Steps in the Process • The next step was a “bracketing” process to check the adequacy of the question’s multiple choice options. • Experts were asked to make independent judgments about how likely a learner at the next lower level would be to answer the question correctly. • Responses significantly above chance (or 25%) made the item unacceptable. • In such cases the item, item question, or item choices had to be discarded or revised.

  29. Steps in the Process • (Cont.) • Experts made independent judgments about how likely a learner at the next higher level would be to answer each question correctly. • If the item would not be answered correctly by this more competent group, it was rejected. • (Because of human limitations, inattention, fatigue, carelessness, etc, it was recognized that the correct response probability for this more competent group would be less than 100%.)

  30. Steps in the Process • Items that passed the technical specifications review and the bracketing process, then underwent a “Modified Angoff” rating procedure. • Expert judges rated the probability that each item would be correctly answered by a person who was fully competent at the targeted proficiency level. • If the independent probability ratings produced an outlier rating or a standard deviation of more than 5 points, the item was rejected and/or revised.

  31. Steps in the Process • Items found acceptable in the “Modified Angoff” rating procedure, where assembled into an online test. • The test had three subtests of 20 items each. • A separate subtest for each of the Reading proficiency Levels 1, 2, and 3. • Each test was to be graded separately. • “Sustained performance” (passing) on each subtest was defined as the mean Angoff rating minus one standard deviation or 70%.

  32. More About Scoring • Scoring had to follow Criterion-Referenced, non-compensatory Proficiency assessment procedures. • “Sustained” ability would be required to qualify as proficient at each level. • Summary ratings would consider both “Floor” and “Ceiling” abilities. • Each learner’s performance profile would determine “between-level” ratings (if any).

  33. And the results… More pilot testing will be done, but here are the results of the first 36 pilot tests:

  34. Congratulations! Working together, we have solved a major testing problem – a problem which has plagued language testers for decades. We have developed a criterion-referenced proficiency test of Reading which • Accurately assigns proficiency levels. • Has both face- and statistical validity.

  35. Questions?

  36. Some additional thoughts… • The assessment points or levels in the STANAG 60001 scale may be thought of as “chords” – each of which describe a short segment along an extended multi-dimensional proficiency development scale. • These “chords” represent cross-dimensional constellations of factors that represent different levels of language ability. Like the concept of “chords” in calculus, these defined progress levels allow us to accurately measure whether the particular set of factors described at each level has been mastered. • Each proficiency level or factor constellation can also be seen as a separate construct, and these constructs can be shown to form an ascending array or hierarchy of increasing language proficiency which meets Guttman scaling criteria. • Therefore, these “points” in the scale can also indicate overall proficiency development.

More Related