Role of Statistics in Developing Standardized Examinations in the US

Role of Statistics in Developing Standardized Examinations in the US by Mohammad Hafidz Omar, Ph.D. April 19, 2005

Map of Talk • What is a standardize test? • Why standardize Tests? • Who builds standardize tests in the United States? • Steps to Building a standardize test • Test Questions & some statistics used to describe them • Statistics used for describing exam scores • Research studies in educational testing that uses advanced statistical procedures

What is a “standardized Examination”? • A standardized test: A test which theconditions of administrationand thescoring proceduresare designed to be thesamein all uses of the test • Conditions of administration: • 1) physical test setting • 2) directions for examinees • 3) test materials • 4) administration time • Scoring procedures: • 1) derivation of scores • 2) transformation of raw scores

Why standardize tests? • Statistical reason: • Reduction of unwanted variations in • Administration conditions • Scoring practices • Practical reason: • Appeal to many test users • Same treatment and conditions for all students taking the tests (fairness)

Who builds standardize tests in the United States? • Testing Organizations • Educational Testing Service (ETS) • American College Testing (ACT) • National Board of Medical Examiners (NBME) • Iowa Testing Programs (ITP) • Center for Educational Testing and Evaluation (CETE) • State Department of Education • New Mexico State Department of Education • Build tests themselves or • Contract out job to testing organizations • Large School Districts • Wichita Public School Districts

a) Administration conditions • Design of experiment concept of control for unnecessary factors • Apply the same treatment conditions for all test takers • 1) physical test setting (group vs individual testing, etc) • 2) directions for examinees • 3) test materials • 4) administration time

b) Scoring Procedures • Same scoring process • Scoring rubric for open-ended items • Same score units and same measurements for everybody • Raw test scores (X) • Scale Scores • Same Transformation of Raw Scores • Raw (X)  Equating process  Scale Scores h(X)

Overview of Typical Standardized Examination building Process • Costly process • Important Quality control procedures at each phase • Process takes time (months to years) • Creating Test specifications • Fresh Item Development • Field-Test Development • Operational (Live) Test Development

1) Creating Test specifications • Purpose: • To operationalize the intended purpose of testing • A team of content experts and stakeholders • discuss the specifications vs the intended purpose • Serves as a guideline to building examinations • How many items should be written in each content/skill category? • Which Content/skill area is more important than others? • 2-way table of specifications typically contains • content areas (domains) versus • learning objectives • with % of importance associated in each cell

2) Fresh Item Development • Purpose: • building quality items to meet test specifications • Writing Items to meet Test Specifications • Q: Minimum # of items to write? • Which cell will need to have more items? • Item Review (Content & Bias Review) • Design of Experiment stage • Design of Test (easy items first, then mixture – increase motivation) • Design of Testing event (what time of year, sample, etc) • Data Collection stage: • Pilot-testing of Items • Scoring of items & PT exams • Analyses Stage: • analyzing Test Items • Data Interpretation & decision-making stage: • Item Review with aid of item statistics • Content Review • Bias review • Quality control step: (1) Keep good quality item, (2) Revise items with minor problem & re-pilot or (3)Scrap bad items

3) Field-Test Development • Purpose: • building quality exam scales to measure the construct (structure) of the test as intended by the test specifications • Design of Experiment stage • Designing Field-Test Booklets to meet Specifications • Use good items only from previous stage (items with known descriptive statistics) • Design of Testing event • Data collection: • Field-Testing of Test booklets • Scoring of items and FT Exams • Analyses • analyzing Examination Booklets (for scale reliability and validity) • Interpreting results: Item & Test Review • Do tests meet the minimum statistical requirements. (rxx’> 0.90) • If not, what can be done differently?

4) Operational (Live) Test Development • Purpose: • To measure student abilities as intended by the purpose of the test • Design of Experiment stage • Design of Operational Test • Use only good FT items and FT item sets • Assembling Operational Exam Booklets • Design of Pilot Tests (e.g. some state mandated programs) • New & Some of the revised items • Design of Field Test (e.g. GRE experimental section) • Good items that has been piloted before • How many sections? How many students per section? • Design of additional research studies • e.g. Different forms of the test (Paper-&-pencil vs computer version) • Design of Testing events • Data Collection: • First Operational Testing of Students with Final version of examinations • Scoring of items and Exams • Analyses of Operational Examinations • Research studies to establish Reporting scales

Different types of Exam item format • Machine –Scorable formats • Multiple-choice Questions • True-False • Multiple true-false • Multiple-mark questions (Pomplun & Omar, 1997) – aka multiple-answer multiple-choice questions • Likert-like Type Items (agree/disagree continuum) • Manual (Human) scoring formats • Short answers • Open-ended test items • Requires a scoring rubric to score papers

Statistical considerations in Examination construction • Overall design of tests • to achieve reliable (consistent) and valid results • Designing testing events • to collect reliable and valid data (correct pilot sample, correct time of the year, etc) • e.g. SAT: Spring/Summer student population difference • Appropriate & Correct Statistical analyses of examination data • Quality Control of test items and exams

Analyses & Interpretation:Descriptive statistics for distractors (Distractor Analysis) • Applies to Multiple-choice, true-false, multiple true-false format only • Statistics: • Proportion endorsing each distractor • Informs the exam authors which distractor(s) • are not functioning or • Counter-intuitively more attractive than the intended right answer (hi ability wrong answer)

Analyses and Interpretation:Item-Level Statistics • Difficulty of Items • Statistics: • Proportion correct {p-value} – mc, t/f, m-t/f, mm, short answer • Item mean – mm, open-ended items • Describes how difficult an item is • Discrimination • Statistics: • Discrimination index: high vs Low examinee difference in p-value • An index describing sensitivity to instruction • item-total correlations: correlation of item (dichotomously or polychotomously scored) with the total score • pt-biserials: correlation between total score & dichotomous (right/wrong) item being examined • Biserials: same as pt-biserials except that the dichotomous item is now assumed to come from a normal distribution of student ability in responding to item • Polyserials: same as biserials except that the item is polychotomously scored • Describes how an item relates (thus, contributes) to the total score

Examination-Level Statistics • Overall Difficulty of Exams/Scale • Statistics: Test mean, Average item Difficulty • Overall Dispersion of Exam/Scale scores • Statistics: Test variability – standard deviation, variance, range, etc • Test Speededness • Statistics: 1)Percent of students attempting the last few questions • 2) Percentages of examinees finishing the test within the allotted time period • Not speeded test: percentage is more than 95% • Consistency of the Scale/Exam Scores • Statistics: • Scale Reliability Indices • KR20: for dichotomously scored items • Coefficient alpha: for dichotomously and polychotomously scored items • Standard error of Measurement Indices • Validity Measures of Scale/Exam Scores • Intercorrelation matrix • High Correlation with similar measures • Low correlation with dissimilar measures • Structural analyses (Factor analyses, etc)

Statistical procedures describing Validityof Examination scores for its intended use • Is Reality of Exam for the students same as Authors’ Exam Specifications? • Construct validity: Analyses on exam structures (Intercorrelation matrix, Factor analyses, etc) • Can the exam measure the intended learning factors (constructs)? • Answer: with Factor analyses (Data Reduction method) • Predictive validity: predictive power of exam scores for explaining important variables • e.g. Can exam scores explain (or predict) success in college? • Regression Analyses • Differential Item Functioning: statistical bias in test items • Are test items fair for all subgroups (Female, Hispanic, Blacks, etc) of examinees taking the test? • Mantel-Haenszel chi-squared Statistics

Some research areas in Educational Testing that involve further statistical analyses • Reliability Theory • How consistent is a set of examination scores? Signal to signal+noise, 2/(2+ 2), ratio in educational measurement • Generalizability Theory • Describing & Controlling for more than 1 source of error variance • Differential Item Functioning • Pair-wise difference (F vs M, B vs W) in student performance on items • Type I error rate control (many items & comparison  inflate false detection rates) issue

Some research areas in Educational Testing that involve further statistical analyses(continued) • Test Equating • Two or more forms of the exam: Are they interchangeable? • If scores on form X is regressed on scores from form Y, will the scores from either test editions be interchangeable? Different regression functions • Item Response Theory • Theory relating students’ unobserved ability with their responses to items • Probability of responding correctly to test items for each level of ability (item characteristic curves) • Can put items (not test) on the same common scale. • Vertical Scaling • How do student performance from different school grade groups compare with each other? • Are their means increasing rapidly, slowly, etc? • Are their variances constant, increasing, or decreasing?

Some research areas in Educational Testing that involve further statistical analyses (continued) • Item Banking • Are the same items from different administrations significantly different in their statistical properties? • Need Item Response Theory to calibrate all items so that there’s one common scale. • Advantage: Can easily build test forms with similar test difficulty. • Computerized Test • Are score results taken on computers interchangeable with those on paper-and-pencil editions? (e.g. http://ftp.ets.org/pub/gre/002.pdf) • Is measure of student performances free from or tainted by their level of computer anxiety? • Computer Adaptive Testing • increase measurement precision (test information function) by allowing students to take only items that are at their own ability level.

Role of Statistics in Developing Standardized Examinations in the US

Role of Statistics in Developing Standardized Examinations in the US

Presentation Transcript

The economic role of the state in developing countries

Role of the US in the Iran-Iraq War

Role of the information technology in official statistics

A2 examinations: Developing your skills in extended writing

The Role of Marketing In Developing Ser

The Role of EURACT in developing assessment in GP / FM

The Role of Statistics in Organisational Knowledge Management

ROLE OF INSTITUTIONS IN DEVELOPING ENTREPRENUERSHIP

ROLE OF SACPCMP IN DEVELOPING CONSTRUCTION MANAGEMENT

The role of Agricultural Statistics in the Policy process

The Role of Community Organizing in Developing Cooperatives

The Role of Probability in Statistics: Statistical Significance

Role of the US

Developing Services Statistics in Singapore

The role of innovation in the US economy

Developing Standardized Assessment Items

Developing the Role of Access Officer in the HSE

The Role of the LHC in US HEP

DEVELOPING STATISTICS IN CARICOM

Role of Statistics in Developing Standardized Examinations in the US

The role of assessment in developing deep learning

Role of Insurance Chatbots in Developing the Industry