agreement indices in multi level analysis l.
Skip this Video
Loading SlideShow in 5 Seconds..
Agreement Indices in Multi-Level Analysis PowerPoint Presentation
Download Presentation
Agreement Indices in Multi-Level Analysis

Loading in 2 Seconds...

play fullscreen
1 / 85

Agreement Indices in Multi-Level Analysis - PowerPoint PPT Presentation

  • Uploaded on

Agreement Indices in Multi-Level Analysis. Ayala Cohen Faculty of Industrial Engineering& Management Technion-Israel Institute of Technology May 2007. Outline. Introduction ( on Interrater agreement-IRA) r WG(J) Index of agreement AD ( Absolute Deviation),

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Agreement Indices in Multi-Level Analysis' - Albert_Lan

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
agreement indices in multi level analysis

Agreement Indices in Multi-Level Analysis

Ayala Cohen

Faculty of Industrial Engineering& Management

Technion-Israel Institute of Technology

May 2007

  • Introduction ( on Interrater agreement-IRA)
  • rWG(J)Index of agreement
  • AD ( Absolute Deviation),

Alternative measure of agreement


Review Our work (2001) (2007)

Etti Doveh Etti Doveh

Uri Eick Inbal Shani

introduction why we need a measure of agreement
INTRODUCTION Why we need a measure of agreement

In recent years there has been a growing number of studies based on multi-level data in applied psychology, organizational behavior, clinical trials.

Typical data structure:

Individuals within groups ( two levels)

Groups within departments (three levels)

  • Constructs are our building blocks in developing and in testing theory.
  • Group-level constructs describe the group as a whole and are of three types (Kozlowski & Klein, 2000):
    • Global, shared, or configural.
global constructs
Global Constructs
  • Relatively objective, easily observable, descriptive group characteristics.
  • Originate and are manifest at the group level.
  • Examples:
    • Group function, size, or location.
  • No meaningful within-group variability.
  • Measurement is generally straightforward.
shared constructs
Shared Constructs
  • Group characteristics that are common to group members
  • Originate in group members’ attitudes, perceptions, cognitions, or behaviors
    • Which converge as a function of socialization, leadership, shared experience, and interaction.
  • Within-group variability predicted to be low.
  • Examples: Group climate, norms.
configural group level constructs
Configural Group-Level Constructs
  • Group characteristics that describe the array, pattern, dispersion, or variability within a group.
  • Originate in group member characteristics (e.g., demographics, behaviors, personality)
    • But no assumption or prediction of convergence.
  • Examples:
    • Diversity, team star or weakest member.
justifying aggregation
Justifying Aggregation
  • Why is this essential?
    • In the case of shared constructs, our construct definitions rest on assumptions regarding within- and between-group variability.
    • If our assumptions are wrong, our construct “theories,” our measures, are flawed and so are our conclusions.
  • So, test both: Within group agreement

The construct is supposed to be shared, is it really?

    • Between group variability (reliability)

Groups are expected to differ significantly, do they really?

Chen, Mathieu & Bliese ( 2004) proposed a framework for conceptualizing and testing multilevel constructs.

This framework includes the assessment of inter-group agreement

Assessment of agreement is a pre-requisite for arguing that a higher level construct can be operationalized .

Distinction should be made between:

Interrater reliability (IRR= Interrater Reliability) and

Interrater agreement (IRA= Interrater Agreement)

Many past studies wrongly used the two terms interchangeably in their discussions.

The term interrater agreement refers to the degree to which ratings from individuals are interchangeable ;

namely, it reflects the extent to which raters provide essentially the same rating.

(Kozlowski & Hattrup,1992;Tinsley&Weiss,1975( .

Interrater reliability

refers to the degree to which ratings of different judges are proportional when expressed as deviations from their means

Interrater reliability (IRR) refers to the relative consistency and assessed by correlations

Interrater agreement (IRA) refers to the absolute consensus in scores assigned by the raters and is assessed by measures of variability.

scale of measurement
Scale of Measurement
  • Questionnaire with J parallel items on a Likert scale with A categories

e.g. A=5

1 2 3 4 5

Strongly Disagree Indifferent Agree Strongly

disagree agree


k=3 raters

Likert scale

A=7 categories

J= 5 items

Prior to aggregation , we assessed within unit agreement on……

To do so, we used two complementary approaches (Kozlowski & Klein, 2000)

A consistency based approach ,computation of the intra class correlation coefficient ,ICC(1)

A consensus based approach ( index of agreement)

how can we assess agreement
How can we assess agreement ?
  • Variability measures:

e.g. Variance

MAD( Mean Absolute Deviation)

Problem: What are “small / large” values ?

The most widely used index of interrater agreement on Likert type scales has been rWG(J),introduced by

James ,Demaree & Wolf (1984).

J stands for the number of items in the scale

examples when r wg j was used to assess interrater agreement
Examples when rWG(J)was used to assess interrater agreement

Group cohesiveness

Group socialization emphasis

Transformational and transactional leadership

Positive and negative affective group tone

Organizational climate

This index compares the observed within group variances to an expected variance from “random responding “

In the particular case of one item (stimulus) , (J=1)

this index is denoted as rWG and is equal to

is the variance of ratings for the single stimulus

is the variance of some “null distribution” corresponding to no agreement


A limitation of rWG(J) is that there is no clear-cut definition of a random response and the appropriate specification of the null distribution which models no agreement is debatable

If the null distribution used to define fails to model properly a random response, then the interpretabilityof the index is suspect.

The most natural candidate to represent non agreement is the uniform (rectangular) distribution, which implies that for an item with number of categories which equals to A,

the proportion of cases in each category will be equal to 1/A

for a uniform null
For a uniform null

For an item with A number of categories

how to calculate the sample variance
How to calculate the sample variance ?

We have n ratings and suppose n is “small”


A=5 k=9 raters: 3 3 3 3 5 5 5 5 4

( With ( n-1) in the denominator),

James et al. (1984):

“ The distribution of responses could be non-uniform when no genuine agreement exists among the judges.

The systematic biasing of a response distribution due to a common response style within a group of judges be considered.

This distribution may be negatively skewed, yielding a smaller variance than the variance of a uniform distribution”.

slight skewed null
Slight Skewed Null

1 = .05 2 = .15 3 = .20 4 = .35 5 = .25

Yielding = 1.34

Used as a “null distribution” in several studies (e.g., Schreisheim et al., 1995; Shamir et al., 1998).

Their justification for choosing this null distribution was that it appears to be a reasonably good approximation of random response to leadership and attitude questionnaires.

James et al(1984) suggested several skewed distributions , (which differ in their skewness and variance) to accommodate for systematic bias.
Often, several null distributions (including the uniform) could be suitable to model disagreement.Thus, the following procedure is suggested.

Consider the subset of likely null distributions and calculate the largest and smallestnull variance specified by this subset.

additional problem
Additional “problem”

The index can have negative values

Larger variance than expected from random response

Bi-modal distribution: ( extreme disagreement)

Example: A=5

Half answer 1 , Half answer 5

Variance: 4

Uniform variance

what to do when r wg is negative
What to do when rWG is negative?

James et al ( 1984) recommended replacing a negative value by zero.

Criticized by Lindell et al. ( 1999)

for a scale of j items
For a scale of J items

Is the average variance over the J items

for a scale of j items37
For a scale of J items

Spearman Brown Reliability :


3 raters

7 categories Likert scale

5 items

Var calculated with n in denominator

Since its introduction, the use of rWG(J)has raised several criticisms and debates.

It was initially described by James et al. (1984) as a measure of interrater reliability.

Schmidt & Hunter (1989) criticized this index claiming that an index of reliability cannot be defined on a single item

In response, Kozlowski and Hattrup (1992) argued that it is an index of agreement not reliability.

James, Demaree & Wolf (1993) concurred with this distinction, and it has now been accepted that rWG(J)) is a measure of agreement.

Lindell, Brandt and Whitney (1999) suggested, as an alternative to , rWG(J) a modified indexwhich is allowed to obtain negative values (even beyond minus 1)
The modified index r*WG(J)provides corrections to two of the criticisms which were raised againstrWG(J).

First, it can obtain negative values, when the observed agreement is less than hypothesized.

Secondly, unlikerWG(J)it does not include a Spearman-Brown correction and thus it does not depend on the number of items (J(

academy of management journal 2006
Academy of management Journal2006

Does Ceo Carisma matter…..Agle et al.

  • Ceo’s 770 team members

“Because of strengths and weaknesses of various interrater agreement measures, we computed both the intraclass correlation statistics ICC(1) and ICC(2), and the interrater agreement statistics r*WG(J)……………”

agle et al 2006
Agle et al.2006

Overall, the very high interrater agreement justified the combination of individual manager’s responses into a single measure of charisma for each CEO….


They displayICC(1)=


r*WG(J) = One number ?

ensemble of groups rmnet
Ensemble of groups(RMNET)

Shall we report median, mean?

….” Observed distributions of rWG(J) are often wildly skewed ….medians are the most appropriate summary statistic”…..


Ehrhart,M.G.(PersonnelPsychology, 2004) Leadership and procedural justice climate –as antecedents of unit-level organizational citizenship behavior

Grocery store chain

3914 employees in 249 departments

….”The median rwg values across the 249 departments were : 0.88 for servant leadership, …………


rule of thumb

The practice of viewing rWG in the 0.70’s and higher as representing acceptable convergence is widespread.

For example: Zohar (2000) cited rWG values in the .70’s and mid .80’s as proof that judgments “were sufficiently homogeneous for within group aggregation”

benchmarking r wg interrater agreement indices let s drop the 70 rule of thumb
Benchmarking rWG Interrater Agreement Indices: Let’s Drop the .70 Rule-Of-Thumb

Paper presented in the Annual Conference of the Society for Industrial and Organizational Psychology

Chicago April 2004

R.J. Harvey and E. Hollander

“It is puzzling why many researchers and practitioners continue to rely on arbitrary rules-of-thumb to interpret rWG, especially the popular rule-of-thumb stating that rWG≥0.70 denotes acceptable agreement”…..
“The justification of the rule rests largely on the argument that some researchers ( e.g. James et al., 1984) viewed rater agreement as being similar to reliability, reliabilities as low as .7 are useful ( e.g. Nunnaly,1978) , therefore rWG ≥ 0.7 implies interrater reliability”…..
There is little empirical basis for a .70 cutoff and few studies have attempted to determine how various values equate with “real world” levels of interrater agreement
the sources of four commonly reported cutoff criteria
The sources of four commonly reported cutoff criteria

Lance, Butts, Michels (2006) ORM

1) GFI>.9 Indicates well fitting SEM’s

2) Reliability of .7 or higher is acceptable

  • rWG ‘s>.7 justify aggregation of individual responses to group-level measures
  • Keep the number of factors whose eigenvalues are greater than 1.
rule of thumb54

A reviewer …

“ I believe the phrase has its origins in describing the size of the stick appropriate for beating one’s wife. A stick was to be no larger in diameter than the man’s thumb….. Thus, use of this phrase might be offensive to some of the readership”…

rule of thumb55

Feminists often make that claim that the “rule of thumb” used to mean that it was legal to beat your wife with a rod, so long as that rod was no thicker than the husband’s thumb. But, it turns out to be an excellent example of what may be called fiction….

rule of thumb56

From carpentry:

The length from the tip of one’s thumb to the first knuckle was used as an approximation for one inch

As such, we apologize to readers who may be offended by the reference to “rule of thumb” but remind them of the mythology surrounding its interpretation.

statistical tests
Statistical Tests

Test the null hypothesis of no agreement .

Dunlop et al. (JAP, 2003)

Provided a table of rWG “critical values”.

Under the null hypothesis of uniform null

( J=1, one item, different A values of the Likert scale, and different number of judges )

Dunlop et al (2003)….

“Statistical tests of rWGare useful if one’s objective is to determine if any nonzero agreement exists ; although useful, this reflects a qualitatively different goal from determining if reasonable consensus exists for a group to aggregate individual level data to the group level of analysis “ ……

alternative index ad
Alternative index AD

Proposed by Burke, Finkelstein & Dusig Organizational ResearchMethods, 1999)


gonzalez roma v peiro j m tordera n 2002 jap
Gonzalez-Roma,V. ,Peiro,J.M.,&Tordera,N. (2002). JAP

“ The ADM(J) index has several advantages compared with the James, Demaree and Wolf ( 1984) interrater agreement index rwg, see Burke et al. (1999)”…..

The ADM(J) index does not require modeling the null response distribution. It only requires an a priori specification of a null response range of interrater agreement. Second, the index provides estimates of interrater agreement in the metric of the original response range.
We followed Burke and colleagues’(1999) specification of using a null response range equal to or less than 1 when the response scale is a Likert-type 5 point scale. This value is consistent with our judgement that any two contiguous scale points are somewhat similar for the 5-point Likert-type scales used in the present study
…..Organizational commitment

Measured by three items ( J=3).

Respondents answered using a 5-point scale (A=5)

The mean ADM(J) was 0.68 ( SD=0.32)

and the ICC(1) was .22.

The one –way ANOVA result , F(196,441)=1.7,p<.01,

Suggests an adequate between differentiation and supports the validity of the aggregate organizational commitment measure.

statistical tests65
Statistical Tests

Test the null hypothesis of no agreement .

Dunlop et al. (JAP, 2003) provided a table of AD “critical values”.

Under the null hypothesis of uniform null

( One item, different A values of the Likert scale, and different number of judges )

criticism vs reality
Criticism vs Reality
  • Citations and applications of


in more than 450 papers

So far, the index rWG(J) has been much more frequently used than ADM(J) .

We performed a systematic literature search of organizational multilevel studies, that were published during the years 2000-2006 (ADM(J) was introduced in 1999).

Among the 41 papers that included justification of the aggregation from individual to group level, there were 40 (98%) that used the index rWG(J) and only 2 (5%) used the index ADM(J) . One study used both indices.

statistical properties of r wg j
Statistical properties of rWG(J)
  • Cohen, A., Doveh, E., & Eick, U. (2001). Statistical properties of the rWG(J index of agreement. Psychological Methods, 6, 297-310.

Studied the sampling distribution of rWG(J)under the null hypothesis

simulations to study the sampling distribution
Simulations to Study the Sampling Distribution

Depends on

  • J( number of items in the scale)
  • A Likert scale
  • n group size
  • The null variance
  • Correlation structure between items in the scale
simulations to study the sampling distribution71
Simulations to Study the Sampling Distribution

“Easy” for a single item

Simulate data for a discrete “null” distribution

dependent vs independent
Dependent vs Independent

Display of E (rWG(J))

n=3,10,100 corresponding to small,medium,large group sizes


Data either uniform or slight skew


Independent or CS ( Compound Symmetry)



Uniform data , uniform null

“Error of first kind”

Skew data uniform null


testing agreement for multi item scales the indices r wg j and ad m j with
Testing Agreement for Multi-item Scales the Indices rWG(J) and ADM(J)with

Ayala Cohen ,Etti Doveh & Inbal Shani

Organizational Research Methods (2007)

table of 95 rwg percentile
Table of .95 RWG Percentile

CS correlation structure

  • Software available:

Simulate the null distribution of the index for a given n, A, J, k and the correlation structure among the items.


Bliese et al.'s (2002) sample of 2042 soldiers in 49 U.S. Army Companies.

The companies ranged in size from n=10 to n=99.

Task significance is a three-item scale ( J=3) assessing a sense of task significance during the deployment (A=5).

inference on the ensemble
Inference on the Ensemble
  • ICC summarizes the whole structure, assuming homogeneity of within variances.
  • Agreement indices are derived for each group separately.
  • How to infer on the ensemble?

Analogous to regression analysis: F vs individual t tests


Based on the rWG(J) tests, significance was obtained for 5 companies, while based on ADM(J)it was obtained for 6 companies.

Under the null hypothesis, the probability of obtaining at least 5 (out of 49 independent tests), with a significance level α=0.05, is 0.097 and it is 0.035 for obtaining at least 6.

Thus, if we allow a global significance level of no more than 0.05, we cannot reject the null hypothesis based on rWG(J) , but will reject based on ADM(J).

ensemble of groups rmnet83
Ensemble of groups(RMNET)

What shall we do with groups that fail the threshold?

  • Toss them out because something is “wrong” with these groups. The logic is that if they do not agree, then the construct has not solidified with them, may be they are not a collective, they are distinct subgroups….
ensemble of groups rmnet84
Ensemble of groups(RMNET)

2)KEEP them…

If on average we get reasonable agreement across groups on a construct, it justifies aggregation for all.

( …CFA: We index everyone, even if some people may have answered in a way which is inconsistent across items, but we do not drop them..)

open questions
Open Questions
  • Extension of slight skew to A>5
  • Power Analysis
  • Comparison of degree of agreements , non-null cases