Testing: A Brief History

FUTURE OF ASSESSMENT IN INDIA CHALLENGES AND SOLUTIONSHARIHARAN SWAMINATHANUNIVERSITY OF CONNECTICUT

Testing: A Brief History • Testing is part of human nature and has a long history. In the old testament, “Mastery Testing” was used in the old testament to “classify” people into two groups: • The Gileaditesdefeated the Ephraimites, and when the Ephraimites tried to escape, they were given a high stakes test; • The test was to pronounce the word “shibboleth”. Failure had a drastic consequence. • Civil service exams were used in China more than 3000 years ago.

Testing: A Brief History Testing in India may go even further back in time In the West, testing has a mixed history – its use has waxed and waned. In the US, Horace Mann argued for written exam (objective type) and the first was introduced in Boston in 1845 Tests were used for grade-to-grade promotion This testing practice fell into disrepute because of teaching to the test.

Testing: A Brief History Grade promotion based on testing was abolished in Chicago in 1881. Binetintroduced mental testing in 1901 (became the Stanford Binet Test). The issue of fairness that “everyone should get the same test” was not relevant to Binet. Binet rank ordered the items in order of difficulty and targeted items to the child’s ability. The first “Adaptive Test” was born.

Individualized Testing • Adaptive testing was the primary mode of testing before the notion of group testing was introduced. • With the advent of group testing, individualized (adaptive) testing went to the back burner. • Primarily because it was impossible to administer to large groups. • Group based adaptive testing was not feasible, Until…… • We will return to adaptive testing later.

Testing in India • Testing in India has a long history • Knowledge/Skill testing was common in ancient India • Rama was subjected to competency testing by Sugriva • Yudhisthira was tested with a 133 item, high stakes test (Yaksha Prashna) • According to the Bhagavatam, the name Parikshit (Examiner) was given to the successor of Yudhisthira • Testing in the form of puzzles and games was often used for entertainment

Testing in India • India is a country of superlatives • India can boast of probably the longest tradition of education, stretching over at least two and half millennia • Taxasila is considered the oldest seat of learning in the world • Nalanda perhaps the oldest university in the world • This tradition of learning and the value placed on education continues in India

Testing in India Population Distribution in India

Testing in India • These numbers are expected to decline slightly over the next twenty years, but nevertheless, will far exceed that of any other country in the world. • The tradition of learning, the value placed on education, and the population explosion has created considerable stress on the Indian education system. • It is not surprising that assessment and testing procedures in India have focused on selection and certification .

Testing in India • Some of the selection examinations in India are perhaps the most grueling and most selective in the world. • -- IAS Examination: of the 450,000 candidates, 1200 selected (.3% ) • -- IIT Joint Entrance examination: of the 500,000 applicants only about 10,000 are selected for admission (2%) • According to the scientific advisor to the previous prime minister, C.N.R. Rao, “India has an examination system but not an education system.” • Another criticism levelled against the examination system is that they promote intensive coaching with little regard for a properly grounded knowledge base.

Testing in India • Needless to say, these criticisms are not unlike the criticism levelled against testing in the US. • However, intensive testing in schools in the US is now being directed more towards assessment of learning and growth for accountability purposes rather than on assessment of student achievement for the purposes of certification and promotion. • Although assessment and testing play an important role in Indian education, assessment practices in India do not seem to have kept pace with the modern approaches and trends in testing and assessment.

Test Uses in the U.S. • Management of Instruction • Placement and Counseling • Selection • Licensure and Certification • Accountability

Test Uses in the U.S. • Management of Instruction • Classroom and standardized tests for daily management of instruction (formative and diagnostic evaluation) • Classroom and standardized tests for grading (summative evaluation) 2. Placement and Counseling • Standardized tests for transition from one level of school to another or from school to work

Test Uses in the U.S. • Selection (Entry Decisions) • standardized achievement and aptitude tests for admission to college, graduate school, and special programs • Licensure and Certification • Standardized tests for determining qualification for entry into a profession • School graduation requirement

Test Uses in the U.S. • 5. Accountability • Standardized tests to show satisfactory achievement or growth of students in schools receiving public money, often required by state and federal legislation. If students in schools do not show adequate progress, sanctions are imposed on the schools and on the state.

Theoretical Framework for Tests • For all these purposes we need to design tests for measuring the student’s “ability” or “proficiency” • The use of the ability/proficiency test scores must be appropriate for the intended use • Tests are measurement instruments and when we use them to measure, like with all measurement devises, we make measurement errors

Theoretical Framework for Tests • The objective of measurement is to measure what we want to measure appropriately and do so with minimum error. • The construction of tests and the determination of an examinee’s proficiency level/ability scores are carried out within one of two theoretical frameworks: • Classical Test Theory • Modern Test Theory or Item Response Theory (IRT)

Classical Test Theory Model X = T + E Observed Score True Score Error = Observed Score Variance = True Score Variance = Error Variance

Indices Used In Traditional Test Construction • Item and Test Indices: • Item difficulty • Item discrimination • Test score reliability • Standard Error of Measurement • Examinee Indices: • Test score

Classical Item Indices: Item Difficulty • Item difficulty : Proportion of examinees answering the item correctly • It is an index of how difficult an item is. • If the value is low, it indicates that the item is very difficult. Only a few examinees will respond correctly to this item. • If the value is high, the item is easy as many examinees will respond correctly to it.

Classical Item Indices: Item Discrimination • Item discrimination: correlation between item score and total score • A value close to 1 indicates that examinees with high scores (ability) answer this item correctly, while examinees with low ability will respond incorrectly • A low value implies that there is hardly any relationship between ability and how examinees respond to this item • Such items are not very useful for separating high ability examinees from low ability examinees • Items with high values of discrimination are very useful for ADAPTIVE TESTING

Standard Error Of Measurement • Indicates the amount of error to be expected in the test scores • Arguably, the most important quantity • Depends on the scale, and therefore difficult to assess its magnitude • Can be re-expressed in terms of reliability which varies between 0 and 1.

Test Score Reliability • Reliability Index, is defined as the correlation between true scores on “parallel” tests. • It will take on value between 0 and 1, with 0 denoting totally unreliable test scores and 1 perfectly reliable test scores. • It is related to the Standard Error of Measurement according to the expression • If the test scores are perfectly reliable, =1 and = 0.

Reliability • Error in scores is due to factors such as testing conditions, fatigue, guessing, emotional or physical condition of student, etc. • Different types of reliability coefficients reflect different interpretations of error

Reliability • Reliability refers to consistency of test scores • over time • over different sets of test items • Reliability refers to test results, not the test itself • A test can have more than one type of reliability coefficient • 4. Reliability is necessary but not sufficient for validity

Shortcomings of the Indices based on Classical test Theory • They are group DEPENDENT i.e., they change as the groups change. • Reliability and hence Standard Error of Measurement are defined in terms of Parallel Tests, almost impossible to realize in practice.

And what’s wrong with that? • We cannot compare item characteristics for items whose indices were computed on different groups of examinees • We cannot compare the test scores of individuals who have taken different sets of test items

Wouldn’t it be nice if ….? • Our item indices did not depend on the characteristics of the individuals on which the item data was obtained • Our examinee measures did not depend on the characteristics of the of items that was administered

ITEM RESPONSE THEORY solves the problem!* * Certain conditions apply. Individual results may vary. IRT is not for everyone, including those with small samples. Side effects include nausea, drowsiness, and difficulty swallowing. If symptoms persist, consult a psychometrician. For more information, see Hambleton and Swaminathan (1985), and Hambleton, Swaminathan and Rogers (1991).

Item Response Theory • Based on the postulate that the probability of a correct response to an item depends on the ability of the examinee and the characteristics of the item

The Item Response Model • The mathematical relationship between the probability of a response, the ability of the examinee, and the characteristics of the item is specified by the • ITEM RESPONSE MODEL

Item Characteristics • An item may be characterized by its • DIFFICULTYlevel (usually denoted as b), • DISCRIMINATIONlevel (usually denoted by a), • “PSEUDO-CHANCE” level (usually denoted as c).

a = 0.5 b = -0.5 c = 0.0

a = 0.5 b = -0.5 c = 0.0 a = 2.0 b = 0.0 c = .25

a = 0.5 b = -0.5 c = 0.0 a = 2.0 b = 0.0 c = .25 a = 0.8 b = 1.5 c = 0.1

IRT Item Difficulty b • Differs from classical item difficulty • b is the θ value at which the probability of a correct response is .5 • The harder the item, the higher the b • b is on the same scale as θ and does not depend on the characteristics of the group of test takers

IRT Item Discrimination a • Differs from classical item discrimination • a is proportional to the slope of the ICC at θ = b • The slope indicates how much the probability of a correct response changes for individuals with slightly different θvalues, i.e., how well the item discriminates between them

IRT Item Guessing Parameter • No analog in classical test theory • c is the probability that an examinee with very low θwill answer the item correctly • b is now the θ value at which the probability of a correct response is (1 + c)/2

Item Response Models • The One-Parameter Model (Rasch Model) • The Two-Parameter Model • The Three-Parameter Model

How Is IRT Used In Practice? • Test construction • Equating of test forms • Vertical scaling (for growth assessment) • Detection of differential item functioning • Adaptive testing

Test Construction • Traditional approach: select items with p-values in the .2 - .8 range and as highly discriminating as possible • We cannot, however, design a test that has the required reliability, SEM, and score distribution. • We cannot design a test with pre-specified characteristics.

Test Construction (cont.) • IRT approach: INFORMATION FUNCTIONS • The information provided by the test about an examinee with given ability is directly related to the Standard Error of Measurement • We CAN assemble a test that has the characteristics we want, impossible to accomplish this in a classical framework

Test Construction (cont.) • The TEST INFORMATION FUNCTION specifies the information provided by the test across the θrange • Test information is a sum of the information provided by each item • Because of this property, we can combine items to obtain a pre-specified test information

Test Construction (cont.) • Items can be selected to maximize information in desired θregions depending on test purpose • Test can be constructed of minimal length to keep standard errors below a specified maximum • By selecting items that have optimal properties, we can create a shorter test that have the same degree of precision as a longer test

Item Information Functions • Bell-shaped • Peak is at or near difficulty value b:item provides greatest information at θ values near b • Height depends on discrimination; more discriminating items provide greater information over a narrow range around b • Items with low c provide most information

Test And Item Information Functions

And what do we get for all this? • The proficiency level (ability) of an examinee is not tied to the specific items we administer • We CAN compare the ability scores of examinees who have taken different sets of test items • We can therefore match items to examinee’s ability level and measure ability more precisely with shorter tests

And what do we get for all this? • We can create a bank of items by administering different items to different groups of examinees at different times • This will allow us to administer comparable tests or individually tailored tests to examinees • By administering different items to different individuals or groups we can improve test security and minimize cheating

And what do we get for all this? • We can ensure the fairness of tests by making sure the test and test items are functioning in the same way across different groups • For assessment of learning, we can give different SHORT tests made up of items that measure the entire domain of skills; otherwise such coverage will require a very long test, and be unmanageable.

Equating • Purpose of equating is to place scores from one form of a test on the scale of another • The goal of equating is for scores to be exchangeable; it should not matter to examinees which form of the test they take • True equating is not strictly possible using traditional procedures

Testing: A Brief History

Testing: A Brief History

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7