4 evaluation of measuring tools item analysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 59

4. Evaluation of measuring tools: item analysis PowerPoint PPT Presentation


  • 30 Views
  • Uploaded on
  • Presentation posted in: General

4. Evaluation of measuring tools: item analysis. Psychometrics . 2011/12. Group A ( English ). Introduction.

Download Presentation

4. Evaluation of measuring tools: item analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


4 evaluation of measuring tools item analysis

4. Evaluation of measuring tools: item analysis

Psychometrics. 2011/12. Group A (English)


Introduction

Introduction

  • Items can adopt different formats and assess cognitive variables (skills, performance, etc.) where there are right and wrong answers, or non-cognitive variables (attitudes, interests, values, etc.) where there are no right answers.

  • The statistics that we present are used primarily with attitudinal or performance items.


4 evaluation of measuring tools item analysis

  • To carry out the analysis of the items should be available :

    • A data matrix with the subjects' responses to each items.

      • To analyze test scores and the responses to the correct alternative, the matrix will take the form of ones (hits) and zeros (mistakes).

      • To analyze incorrect alternatives, in the matrix should appear specific options selected by each subject.

  • The correct alternative analysis (which is offered more information about the quality of the test) allow us to obtain the index of difficulty, discrimination, and reliability and validity of the item.


4 evaluation of measuring tools item analysis

  • Empirical difficulty of an item: proportion of subjects who answer it correctly.

  • Discriminative power: the ability of the item to distinguish subjects with different level in the trait measured.

  • Both statistics are directly related to the mean and variance of total test scores.

  • The reliability and validity of the items are related to the standard deviation of the test and indicate the possible contribution of each item to the reliability and validity of total scores of the test.


Items difficulty

Items difficulty


4 evaluation of measuring tools item analysis

  • Proportion of subjects who have responded correctly to the item:

    • One of the most popular indices to quantify the difficulty of the items dichotomous or dichotomized.

  • The difficulty thus considered is relative because it depends on:

    • Number of people who attempt to answer the item.

    • Their characteristics.

      A: number of subjects who hits the item.

      N: number of subjects that attempt to respond the item.

  • It ranges between 0 and 1.

    • 0: No subject has hit the item. It is difficult.

    • 1: All subjects have answered correctly to the item. It is easy.


4 evaluation of measuring tools item analysis

  • Example: A performance item in mathematics is applied to 10 subjects, with the following result :

  • The obtained value does not indicate whether the item is good or bad. It represents how hard it has been to the sample of subjects who have attempted to answer it.


4 evaluation of measuring tools item analysis

  • The ID is directly related to the mean and variance of the test. In dichotomous items:

    • The sum of all scores obtained by subjects in this item is equal to the number of hits. Therefore, the item difficulty index is equal to its mean.

  • If we generalize to the total test, the average of the test scores is equal to the sum of the difficulty indices of items.


4 evaluation of measuring tools item analysis

  • The relationship between difficulty and variance of the test is also direct. In dichotomous items:

  • In item analysis, a relevant question is to find the value of pj that maximizes the variance of items.

    • Maximum variance is achieved by an item when its pj is 0.5

    • An item is appropriate when it is answered by different subjects and causes in them different answers.


Correction of hits by chance

Correction of hits by chance

  • The fact of hitting an item depends not only on the subjects know the answer, but also of subjets’ luck who without know it they choose the correct answer.

  • The higher the number of distractors less likely that subjects hit the item randomly.

  • It is advisable to correct the ID:


Example test composed by items with 3 alternatives

Example. Test composed by items with 3 alternatives

Calculate ID and Idc to ecah item


Results

Results

Items which have suffered a major correction are those that have proved more difficult.


Recommendations

Recommendations

  • That items with extreme values ​​to the population that they are targeted in the index of difficulty are eliminated from the final test.

  • In aptitude test we will get better psychometric results if the majority of items are of medium difficulty.

    • Easy items should be included, preferably at the beginning (to measure less competent subjects).

    • And difficult items too (to measure less competent subjects).


Discrimination

Discrimination


4 evaluation of measuring tools item analysis

  • Logic: Given an item, subjects with good test scores have been successful in a higher proportion than those with low scores.

  • If an item is not useful to differentiate between subjects based on their skill level (does not discriminate between subjects) it should be deleted.


Index of discrimination based on extreme groups d

Index of discrimination based on extreme groups(D)

  • It is based on the proportions of hits among skill extreme groups (25 or 27% above and below from the total sample).

    • The top 25 or 27% would be formed by subjects who scored above the 75 or 73 percentile in the total test.

  • Once formed the groups we have to calculate the proportion of correct answers to a given item in both groups and then apply the following equation:


4 evaluation of measuring tools item analysis

  • D index ranges berween -1 and 1.

    • 1= when all people in the upper group hit the item and those people from the lower group fail it.

    • 0= item is equally hit in both groups.

    • Negative values= less competent subjects hit the item more than the most competent subjects (the item confuses to more skilled subjects).


Example

Example

  • Answers given by 370 subjects at 3 alternatives (A, B, C) of an item where B is the correct option. In rows are the frequency of subjects who selected each alternative and have received scores above and below to the 27% of their sample in the total test, and the group formed by the central 46%.

  • Calculate the corrected index of difficulty and the discrimination index.


Result

Result

  • Proportion of correct answers:

    (53+70+19)/370=0.38

  • Proportion of mistakes:

    228/370=0.62


Interpretation of d values ebel 1965

Interpretation of D values (Ebel, 1965)


Indices of discrimination based on the correlation

Indices of discrimination based on the correlation

  • If an item discriminate adequately, the correlation between the scores obtained by subjects in that item and the ones obtained in the total test will be positive.

    • Subjects who score high on the test are more likely to hit the item.

  • Def.: correlation between subjects scores in the item and their scores in the test (Muñiz, 2003).

  • The total score of subjects in the test will be calculated discounting the item score.


1 correlation coefficient

1. Correlation coefficient Φ

  • When test score and item scores are strictly dichotomous.

  • Permite estimar la discriminación de un ítem con algún criterio de interés (por ej. Aptos-no aptos; géneros, etc.).

  • It allow us to estimate the discrimination of an item with some criterion of interest (eg. fit and unfit, sex, etc.).

  • First, we have to sort data in a 2x2 contingency table.

    • 1= item is hit/criterion is exceeded.

    • 0= item is fail/criterion is not exceeded.


Example1

Example

  • The following table shows the sorted results from 50 subjects who took the last psychometrics exam.


Result1

Result

  • There is a high correlation between the item and the criterion. That is, those subjects who hit the item usually pass the psychometrics exam.


2 point biserial correlation

2. Point-biserial correlation

  • When the item is a dummy variable and the test score is continuous.

  • Remove the item score from the test score.


Example2

Example

  • The following table shows the responses of 5 subjects to 4 items. Calculate the point-biserial correlation of the second item.


Result2

Result

  • Subjects who have hit the item are A, B, C and E; so their mean is:

    • 1+2+3+2/4=2

  • The total mean is:

    • 1+2+3+1+2/5=1.8

  • The test scores standard deviation is:

    • √12 +22 +32 +12 +22 /5 – (1.82)= √ 0.56=0.75

  • The proportion of subjects who have hit item 2 is 4/5=0.8; and the one for subjects who have fail it is 1/5=0.2


3 biserial correlation

3. Biserial correlation

  • When both item and test score are inherently continuous variables, although one is dichotomized.

  • We can find values ​​greater than 1, especially when one of the variables is not normal.


Example3

Example

  • Biserial correlation of item 3.

  • Because the value p = 0.4 does not appear in the first column of the table you need, we look for its complement (0.6), which is associated with an y = 0.3863.


Discrimination in attitude items

Discrimination in attitude items

  • There are no right or wrong answers but the subject must be placed in the continuum established based on the degree of the measured attribute.

  • Correlation between item scores and test scores.

    • Because items are not dichotomous. Pearson correlation coefficient.

      • That coefficient can be interpreted as a Homogeneity index (IH). It indicates how much the item is measuring the same dimension or attitude as the rest of the scale items.


4 evaluation of measuring tools item analysis

  • Items in which IH is below 0.20 should be eliminated.

  • Correction: deduct from the total score the item score or apply the formula below:


4 evaluation of measuring tools item analysis

  • Other procedures:

    • Useful but less efficient than the previous because it does not use the entire sample.

    • Determine whether the item mean for the subjects with higher scores on the total test is statistically higher than the mean of those with lower scores. It is common to use 25% or 27% of subjects with best and worst scores.

    • Once the groups are identified, we calculate if the mean difference is statistically significant by Student T test.

Ho: means in both groups are equal.


Example4

Example

  • Answers of 5 people to 4 attitudes items. Calculate the discrimination of item 4 by Pearson correlation and that one of item 2 by Student's T test.


Solution

Solution

  • The correlation or IH between item 4 and total score of the test will be:

  • Inflated result because item 4 score is included in total score. Correction:

    • Standard deviation both for item 4 and total score:


4 evaluation of measuring tools item analysis

  • The big difference when applying the correction is due to the small number of items that we have used in the example.

    • As the number of items increase that effect decreases because the influence of item scores on the total score is getting smaller. With more than 25 items, the result is very close.


4 evaluation of measuring tools item analysis

  • To calculate the discrimination of item 2 by Student T Test, we have to do groups with extreme scores. Because of didactic reasons, we are going to use just 2 subjects to conform those groups.


4 evaluation of measuring tools item analysis

  • Confidence level: 95%. 2 df (2+2-2). T: 2.92

  • We accept the Ho. The item does not discriminate properly.

  • To apply Student's T test, item scores and total scale scores should be normally distributed and with equal variances. If not, we have to use a nonparametric test.


Factors that affect discrimination

Factors that affect discrimination


Variability

Variability

  • Relation between test variability and item discrimination:

  • If the test is composed by dichotomous items:

  • To maximize the discriminative ability of one test we have to consider together both the difficulty (pj) and the discrimination (rjx) of its items.

    • It is achieved when discrimination is maximun (rjx=1) and the difficulty is medium (p=0.5).


Item difficulty

Item difficulty

  • An item reaches its maximum discriminative power when its difficulty is medium.

  • To optimize the discrimination we must take into account the item difficulty.


Dimensionality of the test

Dimensionality of the test

  • Number of constructs that we are measuring.

  • When a test is constructing, normally we try that it only measures one single construct (unidimensionality).

  • In multidimensional test, item discrimination should be estimated considering only the items that are associated with each dimension.


Test reliability

Test reliability

  • If discrimination is defined as the correlation between scores obtained by subjects in the item and test scores, then reliability and discrimination are closely related.

    • It is posible to express the Cronbach alpha coefficient from the discrimination of items:

  • Small values ​​in item discrimination are typically associated with unreliable tests.

  • As the mean discrimination of the test increases, so does the reliability coefficient.


Indices of item reliability and validity

Indices of item reliability and validity


Reliability index

Reliability index

  • To quantify the degree in which an item is accurately measuring the attribute of interest.

  • When any correlation coefficient is used to calculate the discrimination of items:

  • The squared sumatory of items IF matches with the variance of subjects’ scores in the test.

  • To the extent that we select items with higher IF, the better the reliability of the test will be.


Validity index

Validity index

  • The validity of an item involves the correlation of the scores obtained by a sample of subjects in the item with the scores obtained by the same subjects in any external criterion of our interest.

    • It serves to determine how much each item of one test contributes to successfully make predictions about that external criterion.

  • In the case that the criterion is a continuous variable and the item is a dichotomous variable,we will use point biserial correlation; but it is not necessary to rest from the total score of the external criterion the item score because it is not included.

  • Test validity can be expresed in function of items IV. The higher items IV are, the more optimized the validity of the test will be.


4 evaluation of measuring tools item analysis

  • The following formula allows us to see how the validity of the test can be estimated from the discrimination of each (rjx), from its validity (rjy) and from its difficulty (Sj=pjqj).

  • Paradox in the selection of items: if we want to select items to maximize the reliability of the test we have to choose those items with a high ID, but this would lead us to reduce the coefficient of test validity because it increases as IV are high and IF are low.


Analysis of distractors

Analysis of Distractors


4 evaluation of measuring tools item analysis

  • It involves investigating in the distribution of subjects across the distractors, to detect possible reasons for the low distribution of any item or see that some alternatives are not selected by anyone, for example.

  • In that analysis, the first step is:

    • Check that all the incorrect options are chosen by a minimum number of subjects. If possible, they should be equiprobable.

      • Criterion: that each distractor is selected by at least 10% of the sample and that there is not much difference between them.

    • That performance on the test of subjects who have selected each incorrect alternative is less than the performance of subjects that have selected the correct one.

    • It is expected that as the skill level of subjects increases, the percentage of those who select incorrect alternatives decrease and vice versa.


Equiprobability of distractors

Equiprobability of distractors

  • Equiprobable if distractors are selected by a minimum of subjects and if they are equally attractive to those who do not know the correct answer.

  • X2 Test:

    • Degree of freedom: K -1. K is the number of incorrect alternatives.

    • Ho: FT=FO (to subjects that do not know the correct answer, the election of any distractor is equally atractive).


Example5

Example

  • Determine if the incorrect alternatives are equally attractive.


Solution1

Solution

  • FT: (136+92)/2=114

    • Each distractor should be selected by 114 subjets.

  • FO is the one that appears in the last row.

  • X2 Table; for 1 df and 95% CL=3.84

  • 8.49>3.84; incorrect alternatives are not equally attractive to all subjects even if they are selected by a minimum of 10%.


Discriminative power of distractors

Discriminative power of distractors

  • It is expected from a good distractor that its correlation with test scores is negative.

  • To quantify the discriminative power of incorrect alternatives we use the correlation. Depending on the kind of variable, we will use biserial, biserial-point, phi or Pearson.

  • **poner figuras escaneadas


Example6

Example

  • Answer of 5 subjects to 4 items. Brackets show the alternatives selected by each subject and the correct alternative with an asterisk. Calculate discrimination of distractor b in item 3.


Solution2

Solution

  • Subjets who have selected alternative b (the incorrect one) in item 3 have been A, B and D. The mean of these subjects in the test after eliminating the analyzed item score is:

  • Total mean of the test resting from the scores obtained by subjects, the score of item 3:

  • Standard deviación of scores corresponding to (X-i):

  • The proportion of subjects that have hit the item is 2/5=0.4; and the proportion of subjects that have failed is 3/5=0.6.


4 evaluation of measuring tools item analysis

  • The point-biserial correlation between the incorrect alternative “b” and test scores discounting the item score is:

    • As the incorrect alternative in the score of these score subjects in the item is 0, no need to remove anything from the total test.

  • The distractor discriminates in the opposite direction than the correct alternative. It's a good distractor.


4 evaluation of measuring tools item analysis

  • Visual inspection of the distribution of subject answers to the various alternatives.

    • Proportion of subjects that have selected each option: p

    • Test mean of subjects that have selected each alterative: mean

    • Discrimination index of all options: rbp


4 evaluation of measuring tools item analysis

  • Positive discrimination index: the correct alternative is mostly chosen by competent subjects.

  • Distractor A:

    • Has been selected by an acceptable minimum of subjects (28%) and is selected by subjects less competent in a higher proportion.

    • The test mean of subjects who have selected it is less than the test mean of subjects that have selected the correct alternative (consistent with its negative discrimination index).


4 evaluation of measuring tools item analysis

  • Distractor B should be revised:

    • It is chosen as correct by the subjects with better scores in the test.

    • It has been the most selected (50%), its discrimination is positive, and the mean of the subjects that have selected it is higher than one of subjects who have chosen the correct alternative.


4 evaluation of measuring tools item analysis

  • In distractors analysis we still can go further and use statistical inference. The test mean of subjects that choose the correct alternative should be higher than the test mean of subjects that have chosen each distractor.

    • ANOVA. IV or factor: each item with as many levels as answer alternatives. DV: the raw score of subjects in the test.


  • Login