Agreement Indices in Multi-Level Analysis. Ayala Cohen Faculty of Industrial Engineering& Management Technion-Israel Institute of Technology May 2007. Outline. Introduction ( on Interrater agreement-IRA) r WG(J) Index of agreement AD ( Absolute Deviation),
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Faculty of Industrial Engineering& Management
Technion-Israel Institute of Technology
Alternative measure of agreement
Review Our work (2001) (2007)
Etti Doveh Etti Doveh
Uri Eick Inbal Shani
In recent years there has been a growing number of studies based on multi-level data in applied psychology, organizational behavior, clinical trials.
Typical data structure:
Individuals within groups ( two levels)
Groups within departments (three levels)
The construct is supposed to be shared, is it really?
Groups are expected to differ significantly, do they really?
This framework includes the assessment of inter-group agreement
Assessment of agreement is a pre-requisite for arguing that a higher level construct can be operationalized .
Interrater reliability (IRR= Interrater Reliability) and
Interrater agreement (IRA= Interrater Agreement)
Many past studies wrongly used the two terms interchangeably in their discussions.
namely, it reflects the extent to which raters provide essentially the same rating.
(Kozlowski & Hattrup,1992;Tinsley&Weiss,1975( .
refers to the degree to which ratings of different judges are proportional when expressed as deviations from their means
Interrater agreement (IRA) refers to the absolute consensus in scores assigned by the raters and is assessed by measures of variability.
1 2 3 4 5
Strongly Disagree Indifferent Agree Strongly
J= 5 items
To do so, we used two complementary approaches (Kozlowski & Klein, 2000)
A consistency based approach ,computation of the intra class correlation coefficient ,ICC(1)
A consensus based approach ( index of agreement)
MAD( Mean Absolute Deviation)
Problem: What are “small / large” values ?
James ,Demaree & Wolf (1984).
J stands for the number of items in the scale
Group socialization emphasis
Transformational and transactional leadership
Positive and negative affective group tone
In the particular case of one item (stimulus) , (J=1)
this index is denoted as rWG and is equal to
is the variance of some “null distribution” corresponding to no agreement
A limitation of rWG(J) is that there is no clear-cut definition of a random response and the appropriate specification of the null distribution which models no agreement is debatable
If the null distribution used to define fails to model properly a random response, then the interpretabilityof the index is suspect.
the proportion of cases in each category will be equal to 1/A
For an item with A number of categories
We have n ratings and suppose n is “small”
A=5 k=9 raters: 3 3 3 3 5 5 5 5 4
( With ( n-1) in the denominator),
“ The distribution of responses could be non-uniform when no genuine agreement exists among the judges.
The systematic biasing of a response distribution due to a common response style within a group of judges be considered.
This distribution may be negatively skewed, yielding a smaller variance than the variance of a uniform distribution”.
1 = .05 2 = .15 3 = .20 4 = .35 5 = .25
Yielding = 1.34
Used as a “null distribution” in several studies (e.g., Schreisheim et al., 1995; Shamir et al., 1998).
Their justification for choosing this null distribution was that it appears to be a reasonably good approximation of random response to leadership and attitude questionnaires.
Consider the subset of likely null distributions and calculate the largest and smallestnull variance specified by this subset.
The index can have negative values
Larger variance than expected from random response
Half answer 1 , Half answer 5
James et al ( 1984) recommended replacing a negative value by zero.
Criticized by Lindell et al. ( 1999)
Is the average variance over the J items
Spearman Brown Reliability :
7 categories Likert scale
Var calculated with n in denominator
It was initially described by James et al. (1984) as a measure of interrater reliability.
Schmidt & Hunter (1989) criticized this index claiming that an index of reliability cannot be defined on a single item
James, Demaree & Wolf (1993) concurred with this distinction, and it has now been accepted that rWG(J)) is a measure of agreement.
First, it can obtain negative values, when the observed agreement is less than hypothesized.
Secondly, unlikerWG(J)it does not include a Spearman-Brown correction and thus it does not depend on the number of items (J(
Does Ceo Carisma matter…..Agle et al.
“Because of strengths and weaknesses of various interrater agreement measures, we computed both the intraclass correlation statistics ICC(1) and ICC(2), and the interrater agreement statistics r*WG(J)……………”
Overall, the very high interrater agreement justified the combination of individual manager’s responses into a single measure of charisma for each CEO….
r*WG(J) = One number ?
Shall we report median, mean?
….” Observed distributions of rWG(J) are often wildly skewed ….medians are the most appropriate summary statistic”…..
Ehrhart,M.G.(PersonnelPsychology, 2004) Leadership and procedural justice climate –as antecedents of unit-level organizational citizenship behavior
Grocery store chain
3914 employees in 249 departments
WHAT TO CONCLUDE ??????
The practice of viewing rWG in the 0.70’s and higher as representing acceptable convergence is widespread.
For example: Zohar (2000) cited rWG values in the .70’s and mid .80’s as proof that judgments “were sufficiently homogeneous for within group aggregation”
Paper presented in the Annual Conference of the Society for Industrial and Organizational Psychology
Chicago April 2004
R.J. Harvey and E. Hollander
Lance, Butts, Michels (2006) ORM
1) GFI>.9 Indicates well fitting SEM’s
2) Reliability of .7 or higher is acceptable
A reviewer …
“ I believe the phrase has its origins in describing the size of the stick appropriate for beating one’s wife. A stick was to be no larger in diameter than the man’s thumb….. Thus, use of this phrase might be offensive to some of the readership”…
Feminists often make that claim that the “rule of thumb” used to mean that it was legal to beat your wife with a rod, so long as that rod was no thicker than the husband’s thumb. But, it turns out to be an excellent example of what may be called fiction….
The length from the tip of one’s thumb to the first knuckle was used as an approximation for one inch
As such, we apologize to readers who may be offended by the reference to “rule of thumb” but remind them of the mythology surrounding its interpretation.
Test the null hypothesis of no agreement .
Dunlop et al. (JAP, 2003)
Provided a table of rWG “critical values”.
Under the null hypothesis of uniform null
( J=1, one item, different A values of the Likert scale, and different number of judges )
“Statistical tests of rWGare useful if one’s objective is to determine if any nonzero agreement exists ; although useful, this reflects a qualitatively different goal from determining if reasonable consensus exists for a group to aggregate individual level data to the group level of analysis “ ……
Proposed by Burke, Finkelstein & Dusig Organizational ResearchMethods, 1999)
“ The ADM(J) index has several advantages compared with the James, Demaree and Wolf ( 1984) interrater agreement index rwg, see Burke et al. (1999)”…..
Measured by three items ( J=3).
Respondents answered using a 5-point scale (A=5)
The mean ADM(J) was 0.68 ( SD=0.32)
and the ICC(1) was .22.
The one –way ANOVA result , F(196,441)=1.7,p<.01,
Suggests an adequate between differentiation and supports the validity of the aggregate organizational commitment measure.
Test the null hypothesis of no agreement .
Dunlop et al. (JAP, 2003) provided a table of AD “critical values”.
Under the null hypothesis of uniform null
( One item, different A values of the Likert scale, and different number of judges )
in more than 450 papers
We performed a systematic literature search of organizational multilevel studies, that were published during the years 2000-2006 (ADM(J) was introduced in 1999).
Among the 41 papers that included justification of the aggregation from individual to group level, there were 40 (98%) that used the index rWG(J) and only 2 (5%) used the index ADM(J) . One study used both indices.
Studied the sampling distribution of rWG(J)under the null hypothesis
“Easy” for a single item
Simulate data for a discrete “null” distribution
Display of E (rWG(J))
n=3,10,100 corresponding to small,medium,large group sizes
Data either uniform or slight skew
Independent or CS ( Compound Symmetry)
“Error of first kind”
Skew data uniform null
Ayala Cohen ,Etti Doveh & Inbal Shani
Organizational Research Methods (2007)
CS correlation structure
Simulate the null distribution of the index for a given n, A, J, k and the correlation structure among the items.
Bliese et al.'s (2002) sample of 2042 soldiers in 49 U.S. Army Companies.
The companies ranged in size from n=10 to n=99.
Task significance is a three-item scale ( J=3) assessing a sense of task significance during the deployment (A=5).
Analogous to regression analysis: F vs individual t tests
Based on the rWG(J) tests, significance was obtained for 5 companies, while based on ADM(J)it was obtained for 6 companies.
Under the null hypothesis, the probability of obtaining at least 5 (out of 49 independent tests), with a significance level α=0.05, is 0.097 and it is 0.035 for obtaining at least 6.
Thus, if we allow a global significance level of no more than 0.05, we cannot reject the null hypothesis based on rWG(J) , but will reject based on ADM(J).
What shall we do with groups that fail the threshold?
If on average we get reasonable agreement across groups on a construct, it justifies aggregation for all.
( …CFA: We index everyone, even if some people may have answered in a way which is inconsistent across items, but we do not drop them..)