470 likes | 632 Views
What’s New in the I/O Testing and Assessment Literature That’s Important for Practitioners?. Paul R. Sackett. New Developments in the Assessment of Personality. Topic 1: A faking-resistant approach to personality measurement. Tailored Adaptive Personality Assessment System (TAPAS)
E N D
What’s New in the I/O Testing and Assessment Literature That’s Important for Practitioners? Paul R. Sackett
Topic 1: A faking-resistant approach to personality measurement • Tailored Adaptive Personality Assessment System (TAPAS) • Developed for Army Research Institute by Drasgow Consulting Group • Multidimensional Pairwise Preference Format combined with applicable Item Response Theory model • Items are created by pairing statements from different dimensions that are similar in desirability and trait “location” • Example item: “Which is more like you?” • __1a) People come to me when they want fresh ideas. • __1b) Most people would say that I’m a “good listener”.
A faking-resistant approach to personality measurement (continued) • Extensive work show it’s faking-resistant • Non-operational field study in Army show useful prediction of attrition, disciplinary incidents, completion of basic training, adjustment to Army life, among other criteria • Now in operational use on a trial basis • Drasgow, F., Stark, S., Chernyshenko, O. S., Nye, C. D., and Hulin, C. L. (2012). Development of the Tailored Adaptive Personality Assessment System (TAPAS) to Support Army Selection and Classification Decisions. Technical Report 1311, Army Research Institute
Topic 2: The Value of Contextualized Personality Items • A new meta-analysis documents the higher predictive power obtained by “contextualizing” items (e.g., asking about behavior at work, rather than behavior in general) • Mean r with supervisory ratings for work context vs. general: • Conscientiousness: .30 vs .22 • Emotional Stability: .17 vs. 12 • Extraversion: .25 vs. .08 • Agreeableness: .24 vs. .10 • Openness: .19 vs. .02 • Shaffer, J.A., & Postlethwaite, B. E. (2012). A matter of context: A meta-analytic investigation of the relative validity of contextualized and noncontextualized personality measures. Personnel Psychology, 65, 445-494.
Topic 3: Moving from the Big 5 to Narrower Dimensions • DeYoung, Quilty and Peterson (2007) suggested the following: • Neuroticism: • Volatility - irritability, anger, and difficulty controlling emotional impulses • Withdrawal - susceptibility to anxiety, worry, depression, and sadness • Agreeableness: • Compassion - empathetic emotional affiliation • Politeness - consideration and respect for others’ needs and desires • Conscientiousness: • Industriousness - working hard and avoiding distraction • Orderliness - organization and methodicalness • Extraversion: • Enthusiasm - positive emotion and sociability • Assertiveness - drive and dominance • Openness to Experience: • Intellect - ingenuity, quickness, and intellectual engagement • Openness - imagination, fantasy, and artistic and aesthetic interests DeYoung, C. G., Quilty, L. C., & Peterson, J. B. (2007). Between facets and domains: 10 Aspects of the Big Five, Journal of Personality and Social Psychology, 93, 880-896
Moving from the Big 5 to Narrower Dimensions (continued) • Dudley et al (2006) show the value of this perspective • Four conscientiousness facets: achievement, dependability, order, and cautiousness • Validity was driven largely by the achievement and/or dependability facets, with relatively little contribution from cautiousness and order • Achievement receives the dominant weight in predicting task performance, while dependability receives the dominant weight in predicting counterproductive work behavior • Dudley NM, Orvis KA, Lebiecki JE, Cortina JM. 2006. A meta-analytic investigation of conscientiousness in the prediction of job performance: Examining the intercorrelations and the incremental validity of narrow traits. J. Appl. Psychol. 91:40-57
Topic 4: The Use of Faking Warnings • Landers et al (2011) administered a warning after 1/3 of the items to managerial candidates exhibiting what they called “blatant extreme responding”. • Rate of extreme responding was halved after the warning • Landers, R. N., Sackett, P. R., & Tuzinski, K. A. (2011). Retesting after initial failure, coaching rumors, and warnings against faking in online personality measures for selection. Journal of Applied Psychology, 96(1), 202.
More on the Use of Faking Warnings • Nathan Kuncel suggests three potentially relevant goals when individuals take a personality test: • - be impressive • - be credible • - be true to oneself
More on the Use of Faking Warnings • Jenson and Sackett (2013) suggested that “priming” concern for being credible could reduce faking. • Test-takers who scheduled a follow-up interview just before taking the personality test obtained lower scores than those who did not Jenson, C. E., and Sackett, P. R. (2013). Examining ability to fake and test-taker goals in personality assessments. SIOP presentation.
A cognitive test with reduced adverse impact • In 2011, SIOP awarded its M.Scott Myers Award for applied research to Yusko, Goldstein, Scherbaum, and Hanges for the development of the Siena Reasoning Test • This is a nonverbal reasoning test, using unfamilar item content, such as made-up words (if a GATH is larger than a SHET…) and figures • Concept is that adverse impact will be reduced by eliminating content with which groups have differential familiarity
Validity and subgroup d for Siena Test • Black-White d commonly in the .3-.4 range • Sizable number of validity studies, with validities in the range commonly seen for cognitive tests. • In one independent study, HumRRO researchers included Siena along with another cognitive test; corrected validity .45 for other test (d = 1.); .35 for Siena (d = .38) (SIOP 2010: Paullin, Putka, and Tsacoumis)
Why the reduced d? • Somewhat of a puzzle. There is a history of using non-verbal reasoning tests • Raven’s Progressive Matrices • Large sample military studies in Project A • But these do not show the reduced d that is seen with the Siena Test • Things to look into: does d vary with item difficulty, and how does Siena compare with other tests? • (Note: Nothing published to date that I am aware of. Some powerpoint decks from SIOP presentations can be found online: search for “Siena Reasoning Test”)
Sample SJT item • You find yourself in an argument with several co-workers about who should do a very disagreeable, but routine task. Which of the following would be the most effective way to resolve this situation? • (a) Have your supervisor decide, because this would avoid any personal bias. • (b) Arrange for a rotating schedule so everyone shares the chore. • (c) Let the workers who show up earliest choose on a first-come, first-served basis. • (d) Randomly assign a person to do the task and don't change it.
Key findings • Extensive validity evidence • Can measure different constructs (problem solving, communication skills, integrity,etc.) • Incremental validity over ability and personality • Small subgroup differences, except for cognitively-oriented SJTs • Items can be presented in written form or by video; recent move to animation rather than recording live actors
Lievens, Sackett, and Buyse, T. (2009) comparing response instructions • Ongoing debate re “would do” vs. “should do” instructions • Lievens et al. randomly assigned Belgian medical school applicants to “would do” or “should do” in operational interpersonal skills SJT; did the same with a student sample
Lievens, Sackett, and Buyse, T. (2009) comparing response instructions • In operational setting, all gave “should do” responses • So: we’d like to know “would do”, but in effect, can only get “should do”
Arthur et al (2014): comparing response formats • Compared 3 options: • Rate effectiveness of each response • Rank the responses • Choose best and worst response • 20-item integrity-oriented SJT • Administered to over 30,000 retail/hospitality job applicants • On-line admin; each format used for one week
“Rate each response” emerges as superior • Higher reliability • Lower correlation with cognitive ability • Smaller gender mean difference • Higher correlation with conceptually relevant personality dimensions (conscientiousness, agreeableness, emotional stability) • Follow-up study with student sample • Higher retest reliabilty • More favorable reactions
Krumm et al. (in press) • Question: how “situational” is situational judgment? • Some suggest SJTs really just measure general knowledge about appropriate social behavior • So Krumm et al. conducted a clever experiment: they “decapitated” SJT items • Removed the stem – just presented the responses
559 airline pilots completed 10 items each from • Airline pilot knowledge SJT • Integrity SJT • Teamwork SJT • Overall, mean scores are 1 SD higher with the stem • But for more than half the items, there is no difference with and without stem • So stem matters overall, but is irrelevant for lots of SJT items • Depends on specificity of stem content
“You are flying an “angel flight” with a nurse and noncritical child patient, to meet an ambulance at a downtown regional airport. You filed visual flight rule: it is 11:00 p.m. on a clear night, when, at 60 nm out, you notice the ammeter indicating a battery discharge and correctly deduce the alternator has failed. Your best guess is that you have from 15 to 30 min of battery power remaining. You decide to: • (a) Declare an emergency, turn off all electrical systems, except for 1 NAVCOM and transponder, and continue to the regional airport as planned. • (b) Declare an emergency and divert to the Planter’s County Airport, which is clearly visible at 2 o’clock, at 7 nm. • (c) Declare an emergency, turn off all electrical systems, except for 1 NAVCOM, instrument panel lights, intercom, and transponder, and divert to the Southside Business Airport, which is 40 nm straight ahead. • (d) Declare an emergency, turn off all electrical systems, except for 1 NAVCOM, instrument panel lights, intercom, and transponder, and divert to Draper Air Force Base, which is at 10 o’clock, at 32 nm.”
Arthur, W., Jr., Glaze, R. M., Jarrett, S. M., White, C. D., Schurig, I., & Taylor, J. E. (2014). Comparative evaluation of three situational judgment test response formats in terms of construct-related validity, subgroup differences, and susceptibility to response distortion. Journal of Applied Psychology, 99(3), 535-545. • Krumm, S, Lievens, F., Huffmeier,J., Lipnevich, A., Bendels,H., and Hertel, G.(in press). How “situational” is judgment in situational judgment tests? Journal of Applied Psychology. • Lievens, F., Sackett, P. R, and Buyse, T. (2009). The effects of response instructions on situational judgment test performance and validity in a high-stakes context. Journal of Applied Psychology, 94, 1095-1101.
Two meta-analyses with differing findings • Ones, Viswesvaran, and Schmidt (1993) is the “classic” analysis of integrity test validity. • found 662 studies, including many where only raw data was provided (i.e., no write-up). Info sharing from many publishers • In 2012, Van Iddekinge et al conducted an updated meta-analysis • applied strict inclusion rules as to what studies to include (e.g., reporting of study detail) • 104 studies (including 132 samples) met inclusion criteria. • 30 publishers contacted; only 2 shared info. • Both based bottom line conclusions on studies using a predictive design and a non-self report criterion.
Predicting Counterproductive Behavior K N Mean Validity • Ones et al – overt tests 10 5598 .39 • Ones et al- personality- 62 93092 .29 based tests • Van Iddekinge et al 10 5056 .11
Why the difference? • Not clear. A number of factors do not seem to be the cause: • Differences in types of studies examined (e.g., both excluded studies with polygraph as criteria) • Differences in corrections (e.g., unreliability) • Several factors may contribute, though this is speculation • Some counterproductive behaviors may be more predictable than others, but all are lumped together in these analyses • Given reliance in both on studies not readily available to public scrutiny, this won’t be resolved until further work is done
Broader questions • This raises broader issues about data openness policies • Publisher obligations? • Researcher obligations? • Journal publication standards? • Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities: Findings and implications for personnel selection and theories of job performance. Journal of Applied Psychology, 78, 679 –703 • Van Iddekinge, C. H., Roth, P. L., Raymark, P. H., & Odle-Dusseau, H. N. (2012). The criterion-related validity of integrity tests: An updated meta-analysis. Journal of Applied Psychology, 97, 499 –530.
Since Hunter and Hunter (1984), interest in using interest measures for selection has diminished greatly • They report a meta-analytic estimate of validity for predicting performance as .10 • BUT: how many studies in this meta-analysis? • 3!!!
New meta-analysis by Van Iddekinge et al. (2011) • Lots of studies (80) • Mean validity for a single interest dimension: .11 • Mean validity for a single interest dimension relevant to the job in question: .23 • Other studies suggest incremental validity over ability and personality
The “catch”: studies use data collected for research purposes • Concern that candidates can “fake” a job-relevant interest profile • I expect interest to turn to developing faking-resistant interest measures
Van Iddekinge, C. H., Roth, P. L., Putka, D. J., & Lanivich, S. E. (2011). Are you interested? A meta-analysis of relations between vocational interests and employee performance and turnover. Journal of Applied Psychology, 96(6), 1167. • Nye, C. D., Su, R., Rounds, J., & Drasgow, F. (2012). Vocational interests and performance a quantitative summary of over 60 years of research. Perspectives on Psychological Science, 7(4), 384-403.
Van Iddekinge et al (in press) • Students about to graduate made Facebook info available • Recruiters rated profile on 10 dimensions • Supervisors rated performance a year later • Facebook ratings did not predict performance • Higher ratings for women than men • Lower ratings for Blacks and Hispanics than Whites • Van Iddekinge, C. H., Lanivich, S. E., Roth, P. L., & Junco, E. (in press). Social Media for Selection? Validity and Adverse Impact Potential of a Facebook-Based Assessment. Journal of Management.
Is performance normally distributed? • We’ve implicitly assumed this for years • Data analysis strategies assume normality • Evaluations of selection system utility assume normality • O’Boyle and Aguinis (2012) offer hundreds of data sets, all consistently showing that a “power law” distribution fits better • This is a distribution with the largest number of observations at the very bottom, with the number of observations then dropping rapidly
The O’Boyle and Aguinis data • They argue against looking at ratings data, as ratings may “forced” to fit a normal distribution • Thus they focus on objective data • Tallies of publication in journals • Sports performance (e.g., golf tournaments won, points scored in NBA) • Awards in arts and letters (e.g. Number of Academy Award nominations) • Political elections (number of terms to which one has been elected)
An alternate view • “Job performance is defined as the total expected value of the discrete behavioral episodes an individual carries out over a standard period of time” (Motowidlo and Kell, 2013)
References • O’Boyle Jr. E., & Aguinis, H. (2012). The best and the rest: Revisiting the norm of normality of individual performance. Personnel Psychology, 65(1), 79. • Beck, J., Beatty, A. S., and Sackett, P. R. (2014) On the distribution of performance: A reply to O’Boyle and Aguinis. Personnel Psychology, 67, 531-566.