Agreement Indices in Multi-Level Analysis

Agreement Indices in Multi-Level Analysis Ayala Cohen Faculty of Industrial Engineering& Management Technion-Israel Institute of Technology May 2007

Outline • Introduction ( on Interrater agreement-IRA) • rWG(J)Index of agreement • AD ( Absolute Deviation), Alternative measure of agreement -------------------------------- Review Our work (2001) (2007) Etti Doveh Etti Doveh Uri Eick Inbal Shani

INTRODUCTION Why we need a measure of agreement In recent years there has been a growing number of studies based on multi-level data in applied psychology, organizational behavior, clinical trials. Typical data structure: Individuals within groups ( two levels) Groups within departments (three levels)

Constructs • Constructs are our building blocks in developing and in testing theory. • Group-level constructs describe the group as a whole and are of three types (Kozlowski & Klein, 2000): • Global, shared, or configural.

Global Constructs • Relatively objective, easily observable, descriptive group characteristics. • Originate and are manifest at the group level. • Examples: • Group function, size, or location. • No meaningful within-group variability. • Measurement is generally straightforward.

Shared Constructs • Group characteristics that are common to group members • Originate in group members’ attitudes, perceptions, cognitions, or behaviors • Which converge as a function of socialization, leadership, shared experience, and interaction. • Within-group variability predicted to be low. • Examples: Group climate, norms.

Configural Group-Level Constructs • Group characteristics that describe the array, pattern, dispersion, or variability within a group. • Originate in group member characteristics (e.g., demographics, behaviors, personality) • But no assumption or prediction of convergence. • Examples: • Diversity, team star or weakest member.

Justifying Aggregation • Why is this essential? • In the case of shared constructs, our construct definitions rest on assumptions regarding within- and between-group variability. • If our assumptions are wrong, our construct “theories,” our measures, are flawed and so are our conclusions. • So, test both: Within group agreement The construct is supposed to be shared, is it really? • Between group variability (reliability) Groups are expected to differ significantly, do they really?

Chen, Mathieu & Bliese ( 2004) proposed a framework for conceptualizing and testing multilevel constructs. This framework includes the assessment of inter-group agreement Assessment of agreement is a pre-requisite for arguing that a higher level construct can be operationalized .

Distinction should be made between: Interrater reliability (IRR= Interrater Reliability) and Interrater agreement (IRA= Interrater Agreement) Many past studies wrongly used the two terms interchangeably in their discussions.

The term interrater agreement refers to the degree to which ratings from individuals are interchangeable ; namely, it reflects the extent to which raters provide essentially the same rating. (Kozlowski & Hattrup,1992;Tinsley&Weiss,1975( .

Interrater reliability refers to the degree to which ratings of different judges are proportional when expressed as deviations from their means

Interrater reliability (IRR) refers to the relative consistency and assessed by correlations Interrater agreement (IRA) refers to the absolute consensus in scores assigned by the raters and is assessed by measures of variability.

Scale of Measurement • Questionnaire with J parallel items on a Likert scale with A categories e.g. A=5 1 2 3 4 5 Strongly Disagree Indifferent Agree Strongly disagree agree

Example k=3 raters Likert scale A=7 categories J= 5 items

Prior to aggregation , we assessed within unit agreement on…… To do so, we used two complementary approaches (Kozlowski & Klein, 2000) A consistency based approach ,computation of the intra class correlation coefficient ,ICC(1) A consensus based approach ( index of agreement)

How can we assess agreement ? • Variability measures: e.g. Variance MAD( Mean Absolute Deviation) Problem: What are “small / large” values ?

The most widely used index of interrater agreement on Likert type scales has been rWG(J),introduced by James ,Demaree & Wolf (1984). J stands for the number of items in the scale

Examples when rWG(J)was used to assess interrater agreement Group cohesiveness Group socialization emphasis Transformational and transactional leadership Positive and negative affective group tone Organizational climate

This index compares the observed within group variances to an expected variance from “random responding “ In the particular case of one item (stimulus) , (J=1) this index is denoted as rWG and is equal to

is the variance of ratings for the single stimulus is the variance of some “null distribution” corresponding to no agreement

Problem: A limitation of rWG(J) is that there is no clear-cut definition of a random response and the appropriate specification of the null distribution which models no agreement is debatable If the null distribution used to define fails to model properly a random response, then the interpretabilityof the index is suspect.

The most natural candidate to represent non agreement is the uniform (rectangular) distribution, which implies that for an item with number of categories which equals to A, the proportion of cases in each category will be equal to 1/A

For a uniform null For an item with A number of categories

How to calculate the sample variance ? We have n ratings and suppose n is “small”

Example A=5 k=9 raters: 3 3 3 3 5 5 5 5 4 ( With ( n-1) in the denominator),

James et al. (1984): “ The distribution of responses could be non-uniform when no genuine agreement exists among the judges. The systematic biasing of a response distribution due to a common response style within a group of judges be considered. This distribution may be negatively skewed, yielding a smaller variance than the variance of a uniform distribution”.

Slight Skewed Null 1 = .05 2 = .15 3 = .20 4 = .35 5 = .25 Yielding = 1.34 Used as a “null distribution” in several studies (e.g., Schreisheim et al., 1995; Shamir et al., 1998). Their justification for choosing this null distribution was that it appears to be a reasonably good approximation of random response to leadership and attitude questionnaires.

Null Distributions A=5

James et al(1984) suggested several skewed distributions , (which differ in their skewness and variance) to accommodate for systematic bias.

Often, several null distributions (including the uniform) could be suitable to model disagreement.Thus, the following procedure is suggested. Consider the subset of likely null distributions and calculate the largest and smallestnull variance specified by this subset.

Additional “problem” The index can have negative values Larger variance than expected from random response

Bi-modal distribution: ( extreme disagreement) Example: A=5 Half answer 1 , Half answer 5 Variance: 4 Uniform variance

What to do when rWG is negative? James et al ( 1984) recommended replacing a negative value by zero. Criticized by Lindell et al. ( 1999)

For a scale of J items Is the average variance over the J items

For a scale of J items

For a scale of J items Spearman Brown Reliability :

Example 3 raters 7 categories Likert scale 5 items Var calculated with n in denominator

Since its introduction, the use of rWG(J)has raised several criticisms and debates. It was initially described by James et al. (1984) as a measure of interrater reliability. Schmidt & Hunter (1989) criticized this index claiming that an index of reliability cannot be defined on a single item

In response, Kozlowski and Hattrup (1992) argued that it is an index of agreement not reliability. James, Demaree & Wolf (1993) concurred with this distinction, and it has now been accepted that rWG(J)) is a measure of agreement.

Lindell, Brandt and Whitney (1999) suggested, as an alternative to , rWG(J) a modified indexwhich is allowed to obtain negative values (even beyond minus 1)

The modified index r*WG(J)provides corrections to two of the criticisms which were raised againstrWG(J). First, it can obtain negative values, when the observed agreement is less than hypothesized. Secondly, unlikerWG(J)it does not include a Spearman-Brown correction and thus it does not depend on the number of items (J(

Academy of management Journal2006 Does Ceo Carisma matter…..Agle et al. • Ceo’s 770 team members “Because of strengths and weaknesses of various interrater agreement measures, we computed both the intraclass correlation statistics ICC(1) and ICC(2), and the interrater agreement statistics r*WG(J)……………”

Agle et al.2006 Overall, the very high interrater agreement justified the combination of individual manager’s responses into a single measure of charisma for each CEO…. ------------------ They displayICC(1)= ICC(2)= r*WG(J) = One number ?

Ensemble of groups(RMNET) Shall we report median, mean? ….” Observed distributions of rWG(J) are often wildly skewed ….medians are the most appropriate summary statistic”…..

Ehrhart,M.G.(PersonnelPsychology, 2004) Leadership and procedural justice climate –as antecedents of unit-level organizational citizenship behavior Grocery store chain 3914 employees in 249 departments

….”The median rwg values across the 249 departments were : 0.88 for servant leadership, ………… WHAT TO CONCLUDE ??????

Rule-Of-Thumb The practice of viewing rWG in the 0.70’s and higher as representing acceptable convergence is widespread. For example: Zohar (2000) cited rWG values in the .70’s and mid .80’s as proof that judgments “were sufficiently homogeneous for within group aggregation”

Benchmarking rWG Interrater Agreement Indices: Let’s Drop the .70 Rule-Of-Thumb Paper presented in the Annual Conference of the Society for Industrial and Organizational Psychology Chicago April 2004 R.J. Harvey and E. Hollander

“It is puzzling why many researchers and practitioners continue to rely on arbitrary rules-of-thumb to interpret rWG, especially the popular rule-of-thumb stating that rWG≥0.70 denotes acceptable agreement”…..

Agreement Indices in Multi-Level Analysis

Agreement Indices in Multi-Level Analysis

Presentation Transcript

Service Level Agreement 2008

Participation in Multi-Level Governance

Precision and Accuracy Agreement Indices in HSP

Cost-Benefit Analysis of Multi-Level Government

The multi-year proposed agreement

Multi level Multi Agent case

Service Level Agreement

Service Level Agreement(s)

Service Level Agreement

Service Level Agreement 2008

Multi Level building

Multi-Level Optimization

Multi-level Analysis Recognizing the Problem

Service level agreement in cloud computing

Service Level Agreement Workshop

Service Level Agreement 2008

Service Level Agreement

Service Level Agreement

multi level marketing

Service Level Agreement Template

Multi Level Marketing In Panipat