360-degree reviews: Advantages and traps

360-degree reviews: Advantages and traps Duncan J. R. Jackson – King’s College London George Michaelides – University of East Anglia Chris Dewberry – Birkbeck University of London Ben Schwencke – Test Partnership LTD

Duncan Jackson • King’s Business School, London • Researches multifaceted assessment • Works as a consultant to several organisations in the UK and NZ

Focus for Tonight’s Talk • I’ll focus on what is scored in 360s and how to make the most out of 360 scores for feedback and development purposes • 360’s are often designed with the purpose of measuring competencies • But, until recently, no one has (properly) checked if or how competencies are relevant to 360 assessments • We’ll do that very thing tonight!

Proposition:Averaging Competency Observations Across Raters and Sources • It is implied in some measurement designs that it’s possible to simply average all of the observations concerning competencies to form competency scores. Such aggregate scores can be used in feedback. • We can think of this suggestion as a 2-way interaction involving: • Ratees × Dimensions • We will test this proposition to see if it holds up.

Proposition:CompetencY Use Depends on the Source • Others have historically suggested that different sources (i.e., subgroups of raters, e.g., peers, managers, clients) use different competencies in different ways in order to form their judgments (Guion, 1965; Klimoski & London, 1974) • E.g., peer ratings are high for social skills and low for task performance, but manager ratings are low for social skills and high for task performance • This idea can be thought of as a 3-way interaction involving: • Ratees × Dimensions × Sources • We’ll test this proposition too

360-degree Ratings (AKA Multisource Ratings) • Popular among organizations for performance assessment and development (Toegel & Conger, 2003) • Provide a rich evaluation incorporating perspectives from multiple source perspectives, e.g., supervisors, executive managers, peers, clients • Multisource design feature often seen as an advantage because of the value of different perspectives on behavior (Borman, 1974; Borman & Motowidlo, 1997)

Meaningful Developmental Feedback • To provide developmental feedback to ratees, it is necessary to understand the measurement structure of the feedback procedure • Otherwise we won’t have any idea about whether what is being fed back is: • Meaningful to employees • Helpful for their development • Of assistance to the growth of organisations

Measurement Design • Multisource ratings involve • Ratees • Raters • Sources • Rating items • Competencies/Dimensions • But there’s more to this because of the different ways in which these design features are related

Relationships Between Measurement Design Features • Notation is used to help describe the relationships between design features • In multisource ratings, the design that’s often applied is: • p × i:d × r:s

Relationships Between Measurement Design Features • ….let’s see what this really means with a diagram

Managers Clients Peers Assessed by raters who are nested in sources. Communication Teamwork Tolerance Ratings are assigned using items, which are nested in each of several dimensions (i.e., competencies) Each ratee p × i:d × r:s design

Implications of using complex designs • A complex, multifaceted p × i:d × r:s design means that many main effects an interactions will be relevant to the measurement process

Seventeen Effects in Total General performance Dimensions Sources Raters Dimensions and sources Other effects concern item-, rater-related, residual, and BP-irrelevant variance

Previous Research • Attempts have been made to isolate some of these effects • Considerable variability in findings across studies

General performance: variable Dimensions: small Sources: variable Raters: variable Residual: variable Missing effects

What’s with the Cross-Study variability? • There could be several reasons for this, some of which might be methodological, others might be to do with situation-specificity • But one issue that certainly does not help in terms of promoting clarity is confounding • All of the studies mentioned are confounded to some degree as are others in the literature

Unconfounding • What is needed is an unconfounded perspective on multisource ratings • Confounding represents a major barrier to our understanding about the measurement structure of multisource ratings

Generalizability Theory • Our approach involved an application of generalizability theory (G theory)

What Is G theory? • Developed by Lee Cronbach and his colleagues in the 1960s as an approach to estimating reliability in multifaceted measurement contexts • Any form of assessment with multiple sources of variance • Cronbach researched in the education context in which the application of G theory is clearly applicable • E.g., students (s) crossed with classes (c) rated on items (i) nested in tests (t) • s • c • t • i:t • …and interactions

What is the Statistical Basis for G theory? • It forms its basis from the work of Ronald Fisher on factorial ANOVA • But there are marked differences • No significance testing • Focus on effect sizes • And nowadays, non-ANOVA estimators (e.g., REML, Bayes’) tend to be used more often because of their flexibility

What Is G theory? • How does it help? • If you are faced with a measurement design involving many sources of variance (i.e., effects), it can help to unconfound all of those effects • Other approaches might provide more detail on effects of interest but might not be able, in a practical sense, to estimate as many effects as G theory can • You can use G theory to help optimize assessments (e.g., the optimal number of exercises for your assessment centre, optimal number of raters for your performance management system)

Why is it called G theory? • Because the researcher aims to generalize across specific sources of variance regarded as error • Implied here is that the researcher has some, reasonable discretion over what is regarded as true score and what is regarded as error

Why is it called G theory? • Take for example a Big Five model across different situations • Here, we would have the following main effects • p • t • i:t • s • And the following interactions • pt • pi:t • ps • ts • is:t • pts • pis:t General effect for participants (True Score) Scores for individuals depend on traits (True Score) Scores for individuals depend on situations (error) It is across these effects that we wish to generalize our observed scores Residual variance (Error)

Let’s Apply this Logic to Multisource Ratings

Present Study • Operational multisource rating procedure in the UK • 1495 raters • 5 sources • Senior managers, colleagues, direct reports, self, stakeholder • 392 managerial ratees • 4 dimensions • Teamwork, organizational citizenship, results-focus, motivation • Multiple items per dimension

Analysis • Generalizability theory with Bayesian inference • 17 effects, 15 of which relevant to between-participant comparisons, as described earlier • Aggregation was taken into account • Corrected for ill-structured nature of the measurement design • See Putka, Le, McCloy, and Diaz (2008)

General performance: sizable Source effects I: sizable Dimension effects: uh-oh! Source effects II: sizable Source effects don’t have anything to do with dimension effects Other rater- and item-related effects tend to be quite small

What Should We Feed Back to Ratees? • Our results suggest that feedback should focus on: • General performance • Source perspectives • Dimensions have almost nothing to do with what is being assessed • Problematic for historical literature, which assumed source * dimension interactions (Guion, 1965; Klimoski & London, 1974) • Rater- and item-related effects tend to be small, so the instrument tends to be fairly reliable (>.80, regardless of generalization type) • We have repeated these results with a second sample

Dimensions • Dimensions are routinely the focus of assessments of this type and perhaps they should not be • This is not the same as saying that there are no constructs being assessed in multisource ratings, but that the true constructs might not be those designated for measurement

Role Perspectives • Provide potentially meaningful insights (Borman, 1974; Borman & Motowidlo, 1997) • E.g., the hard worker

General Performance • Might subsume psychological factors (e.g., g, personality) if these are relevant to the assessment

References Borman, W. C. (1974). The rating of individuals in organizations: An alternate approach. Organizational Behavior & Human Performance, 12, 105-124. doi: 10.1016/0030-5073(74)90040-3 Borman, W. C., & Motowidlo, S. J. (1997). Task performance and contextual performance: The meaning for personnel selection research. Human Performance, 10, 99-109. Brennan, R. L. (2001). Generalizability theory. New York: Springer Verlag. Putka, D. J., Le, H., McCloy, R. A., & Diaz, T. (2008). Ill-structured measurement designs in organizational research: implications for estimating interrater reliability. Journal of Applied Psychology, 93, 959-981. doi: 10.1037/0021-9010.93.5.959 Toegel, G., & Conger, J. A. (2003). 360-degree assessment: Time for reinvention. Academy of Management Learning and Education, 2, 297-311. doi: 10.5465/AMLE.2003.10932156

360-degree reviews: Advantages and traps