Educational Research

EducationalResearch Chapter 9

Validity and Reliability • Validity – “the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests” • Reliability – indicates how consistently a test measures whatever it does measure

Validity Validity – focus is no longer on the test/instrument but on the interpretation and meaning of the scores derived from the instrument. Not the validity of the instrument but the interpretations that are drawn from the instrument’s scores. Test may be valid for one population but not another. • Need to move from theoretical domain surrounding the construct to the empirical level that operationalizes the construct. • Select specific set of observable tasks which are believed to be indicators of the particular construct • Then you assume that the scores on the task reflect the construct of interest

Validity cont Two problems that threaten the validity of test scores • Construct underrepresentation – fails to include important dimensions of the construct – general self-concept measure only measure social self-concept not academic or physical • Construct-irrelevant variance – extent to which test scores are affected by processes that are extraneous to the construct – for example – reading comprehension is a source of construct-irrelevant variance in a science achievement test.

Validity cont • Evidence for validity: 1 Test content 2. Test criterion relationships 3. Construct-related

Validity cont 1. Test content evidence – is there evidence that the test represents a balanced and adequate sampling of all the relevant skills and dimensions Example: • teacher-made achievement test – sample the topics and cognitive processes in proportion to how done in class (e.g., test blueprinting); • if a national test for math – would have to look at curriculum for math throughout the states and make a representative set of questions – • this evidence is based on a logical analysis

Validity cont 2. Test-criterion relationships: Relationship between a test and a relevant outcome criterion for example – SAT (test) and GPA (relevant criterion)

Validity cont • Two ways to determine test-criterion relationships: • (a)predictive validity – look for relationship between scores on a test and criterion scores available at a future time – SAT/GPA • (b)concurrent validity – look at relationship between test scores and criterion scores obtained at the same time – new psychiatric screening device and well established instrument already in use.

Validity cont • Validity coefficient – coefficient of correlation between test scores and a criterion – the nearer to 1.00 (+ or -) the stronger the evidence the test is useful for its intended purposes. • How high – depends on the purpose of the test – want to use a test with the highest validity of all the available tests. • If a test has a high correlation with a future criterion, then the test can be used to predict that criterion.

Validity cont. • 3. • Construct- related validity – To what extent do test scores reflect the theory behind the psychological construct being measured • a. Related measures studies: • convergent evidence – there is a relation between the test scores and other measures intended to assess similar constructs – math reasoning test scores should correlate with grades in math because they are all testing the same construct • discriminant evidence – relationships between test scores and measures – math reasoning test should have little to no relationship with a reading test.

Validity cont. • b. Known-groups technique – compare two groups known to differ on the construct – if the expected difference is found – there is evidence of support for the intended interpretation and use of scores – example – a music aptitude test should distinguish between musicians and non-musicians

Validity cont • c. Intervention studies – apply an experimental manipulation and see if the scores change in the hypothesized way. For example: Scale to measure anxiety – test – then put experimental subjects in an anxiety provoking situation and see if scores on scale of anxiety increase. If they do, and if the scores of the control group, who were not put in an anxiety provoking situation, do not increase, you can say the scale measures anxiety.

Validity cont d. Internal structure studies – all items making up the test or scale are measuring the same thing, that is that the test has internal consistency. You would expect people who support stem cell research to be consistent and agree with the positive statements and disagree with the negative statements about stem cell research. You look at the intercorrelations among items; high indicate the test is measuring a single construct and then you decide if these intercorrelations conform to the theory.

Validity cont e. Studies of response processes: Question test takers about the mental processes and skills they use when responding to the items on a test – An example - if have new verbal reasoning test and you find out that other factors such as vocabulary or reading comprehension are being used, then scores for different subgroups may have different meanings.

Validity cont • No test is valid for all purposes or in all situations

Reliability • the reliability of a measuring instrument is the degree of consistency (repeatability, dependability) with which it measures whatever it is measuring • two types of error in measurement: random and systematic • random errors of measurement - refers to error that is a result of pure chance- example-if you are measuring how far a child throws a ball – the child would not throw it the same difference each time

Reliability cont. • Sources of random error • Individual being measured – fluctuations in person’s motivation, interest, level of fatigue • Administration of measuring instrument may introduce error – there may be a departure from standardized procedures in administering or scoring. Testing conditions may vary (noise, heat) • Instrument itself may be a source of error – for example a short test.

Reliability cont • Example of base ball throw – if throw on one day not a reliable measure of ability • Say you had the child throw on 2 consecutive days – they would almost never be the same because • Student may change from day to day • Task may change from day to day • Limited sample of behavior results in a less reliable score

Reliability cont • RELIABILITY IS CONCERNED WITH THE EFFECT OF SUCH RANDOM ERRORS OF MEASUREMENT ON THE CONSISTENCY OF SCORES • systematic errors – errors in measurement that are systematic and predictable. Example: instructions for ball throwing are given in English and not everyone speaks English – they would consistently depress the scores of the non-English children (a validity problem) • Validity of score-based inferences is lowered whenever scores are systematically changed by the influence of anything other than what you are trying to measure. (measuring throwing and also English comprehension.

Reliability cont A measuring instrument can be reliable without being valid; but it cannot be valid unless it is first reliable. coefficient of reliability – range from 1.00 where there is no error in measurement to 0 when the measurement is all error; a 1.00 indicates that each individual’s relative position on the two administrations remained exactly the same and the test is perfectly reliable To determine reliability (1) can make repeated measurements of a single individual or (2) can determine theextent to which each individual maintains the same relative position in the group.

Reliability cont • 3 types of reliability coefficients: • Test-Retest reliability involves administering a test to the same group of individuals on two different occasions and correlating the paired scores. The correlation coefficient obtained by this procedure is called a test-retestreliability coefficient or because it is indicative of the consistency of subject's scores over time, it is sometimes referred to as the coefficient of stability • problems involve practice and memory effects or carry-over effects, can increase the time interval between tests, but if the interval is too long, individuals can undergo real changes; recommend not use or use with caution

Reliability cont • Equivalent forms (alternate or parallel-forms) - reliability is used when it is probable that subjects will recall their responses to the test items. Here one correlates the results of equivalent forms of the test administered to the same individuals. The resulting reliability coefficient is called the coefficient of equivalence if they are administered at essentially the same time. The two forms must have the same number of items, form, instructions, time limits, format, content, range, and level of difficulty, but the actual questions are not the same - difficulty creating equal forms. • Called the coefficient of stability and equivalence if the two forms are administered but with some time between administrations – most rigorous.

Reliability cont • Internal consistency estimates of reliability – determine if the items in a test are measuring the same thing. Require only a single administration of one form of a test. 1. Split Half Reliability here the test is administered to a group and later the items are divided into two comparable halves. Scores are obtained for each individual on the comparable halves and a coefficient of correlation calculated for the two sets of scores – called: split-half reliability coefficient. If perform similarly on two halves have high reliability and if perform differently on two halves have low reliability.Usually, correlate odd to even numbered items.

Reliability cont The correlation coefficient between the two halves underestimates the reliability of the test – need to transform, use the Spearman-Brown prophecy formula to correct for this. 2. Coefficient Alpha - internal-consistency approach based on the notion that the items, or subparts, of the instrument measure the same phenomena -this means that the items are homogeneous - this type of reliability is not established through correlation, but rather estimates internal consistency by determining how all items on a test relate to all other items and to the total test

Educational Research