The Bland-Altman LIMITS OF AGREEMENT: How Often HAVE THEY Been Misapplied?

The Bland-Altman LIMITS OF AGREEMENT:How Often HAVE THEY Been Misapplied? Introdução à Medicina – 23/Maio/2011 Turma 13

INTRODUCTION

Background • Due to the advances of technology, new methods of clinical measurement appear constantly, and they keep becoming more innovating.1 • In the 80s, Bland and Altman took knowledge of the wide use of the correlation coefficient as a way to evaluate the agreement between two methods of clinical measurement. • Theyrealizeditwasn’tadequate. • So, they created their own method - the limits of agreement of Bland-Altman.2 • 1 - Zietman A, Goitein M, Tepper JE. Technology evolution: is it survival of the fittest? Journal of Clinical Oncology: official journal of the American Society of Clinical Oncology, 2010 Sep 20; 28(27): 4275-4279. • 2 - Altman DG, Bland JM. Statistical Methods For Assessing Agreement Between Two Methods of Clinical Measurement. Lancet, 1986; i: 307-310.

Statisticalmethods for assessingagreementbetweentwomethodsofclinicalmeasurement • The Lancet, 1986 • Objective of the method • Assesstheagreementbetweentwomethodsofclinicalmeasurement • Importance: • If the agreement isn’t accomplished, there is a high risk of diagnosis mistakes, which may lead to severe consequences3 3 - Stoker, Mark. Common Errors in Clinical Measurement. Anesthesia & Intensive Care Medicine, December 2008; volume 9, issue 12: 553-558.

How do weapplythe method?2 Correlationcoefficient Instrument 2 Instrument 1 Measurement Measurement Average Difference • 2 - Altman DG, Bland JM. Statistical Methods For Assessing Agreement Between Two Methods of Clinical Measurement. Lancet, 1986; i: 307-310.

= 0  No systematicerror ≠ 0  Systematicerror

Ifthelimitsofagreement are… Toowide… Smallbuttheaverageofthedifferencesis≠ 0… • There are random mistakes associated with the measuring instrument; • It is unacceptable for clinical use. • Thereis a systematicerror; • Themeasuringdevicemustbecalibrated.

The evaluation of whether the limits of agreement are too wide or, on the other hand, adequate, may be a little subjective. Thereby, it is important that the maximum limits of agreement are defined according to the clinical needs.

Assumptions • Images: Bland JM and Altman DG. Applying the Right Statistics: Analyses of Measurement Studies. Ultrasound in Obstetrics and Gynecology, 2003; 22, 85-93.

Exampleoftheexistenceof a relationbetweentheaveragesanddifferences • Images: Bland JM and Altman DG. Applying the Right Statistics: Analyses of Measurement Studies. Ultrasound in Obstetrics and Gynecology, 2003; 22, 85-93.

had a great impact on the scientific community and,after being published in The Lancet, was quoted Bland-AltmanMethod more than 17000times4 BUT, some of the quotes/applications of this method may not have been correctly made! Bland and Altman noticed themselves that their limits of agreement were being misapplied and, thereby, led to false conclusions about the agreement between two instruments of clinical measurement.5 • 4 - Ryan TP and Woodall WH. The Most Cited Statistical Papers. Journal of Applied Statistics, 2005; 32: 461-474. • 5 - Bland JM and Altman DG. Applying the Right Statistics: Analyses of Measurement Studies. Ultrasound in Obstetrics and Gynecology, 2003; 22, 85-93.

RESEARCH QUESTION AND AIMS

ResearchQuestion “What is the percentage of articles in which the Bland-Altman method is applied correctly?”

METHODS

Methods • Sample • 70 articles indexed by ISI that cite the article where Bland and Altman expose their method, published by TheLancet RANDOMLY CHOSEN

Check-list • Evaluates the article when it comes to the: • Verification of the assumptions; • Application of the method itself; • Interpretation of the obtained limit of agreement.

Check-list • Evaluates the article when it comes to the: • Verification of the assumptions; • Application of the method itself; • Interpretation of the obtained limit of agreement. • The check list will also gather some relevant data related to the articles: type of article and year and journal in which it was published.

Reprodutibilityofthecheck-list Student B Student A Article X Comparisonbetweentheanswersgivenbetweenthetwostudents.

To analyzeourresults… • Wecalculatedthemedianoftheimpact factor andyearofpublication • Createdtwogroups ≤ median > median

How do weknowthedifferences are significant? Data of the tables related to the year, journal of publication and type of data of each article ChiSquare Test6 P ≤ 0,05  Statisticallysignificant • 6 - PERLA, Rocco J, CARIFIO James. Use of the Chi-square Test to Determine Significance of Cumulative Antibiogram Data. American Journal of Infectious Diseases, 2005; 1 (4): 162-167

EXPECTED RESULTS

Manyarticleswillhavemisappliedthemethod Why? • It requires the construction of a different graph (histogram of the differences), while the other assumption can be verified by analysis of the averages vs. differences one, which is often used to observe the limits of agreement. • Main reason • lack of verification of the assumptions; • wrong verification of the assumptions.5 • Least fulfilled assumption • verifying if the differences follow a normal distribution • 5 - Bland JM and Altman DG. Applying the Right Statistics: Analyses of Measurement Studies. Ultrasound in Obstetrics and Gynecology, 2003; 22, 85-93.

There will be variations of the percentage of articles misapplying the method throughout the years • WHY? researchers started to notice that the method was being misapplied • HOW? they realized that two methods of clinical measurement that had passed the test of Bland-Altman in terms of agreement weren’t actually agreeing very much. • Example: didn’t agree when it came to higher values than the ones used for the test.5 • 5 - Bland JM and Altman DG. Applying the Right Statistics: Analyses of Measurement Studies. Ultrasound in Obstetrics and Gynecology, 2003; 22, 85-93.

The impact factor of a journal must have influence in the percentage of misapplications of the method present in the articles published there Why? > Impact Factor • > Quality > Attention to scientificcorrection

RESULTS

Reproducibility of the check list - To ensurethecorrectanalisisofthearticles twostudentsanalizedthesamearticle Of the 5 articles analyzed by two different students There was an agreement of 100% in all questions, except for the one that asked if the article had interpreted the outcome correctly according to the clinical needs, in which there was a disagreement relative to 1 article The two students which disagreed re-evaluated the question and came to an agreement.

What percentage of articles fit into each of the document types defined by ISI. n= 18360 Articles - 16230 Reviews - 291 Meeting abstracts - 70 Reprints - 2 Proceedingpapers - 1059 Notes - 121 Corrections/Addictions - 2 Correction - 1 Letters - 471

The Sample 70 (articlesandproceedingspapers) Out of those 56, 5 weren’t applications of the Bland and Altman limits of agreement, while 51 were.

THE MAIN FINDINGSof our study in regards to our original research question and aims:

Table 1. n(%) ofarticleswhichfullfilleachpointofthecheck-list.

… if the impact factor of a journal influences the percentage of articles published in it that apply the method correctly p>0,05!! Table 2 – Percentage of articles fulfilling each main point of the check list, divided according to the impact factor of the journal where they were published. We used a Chi-Square test to compare the percentages amongst the two levels of impact factor.LA – Limits of agreement.

…if, through the years, the percentage of articles applying the method incorrectly has varied p<0,05!! Table 3 – Percentage of articles fulfilling each main point of the check list, divided according to the year when they were published. We used a Chi-Square test to compare the percentages amongst the two levels of impact factor. LA – Limits of agreement.* - statistically significant.

…if the percentage of articles applying the method correctly varies according to whether it is used to obtain primary or secondary data. Table 4 – Percentage of articles fulfilling each main point of the check list, divided according to the type of data obtained by using the limits of agreement. We used a Chi-Square test to compare the percentages amongst the two levels of impact factor. LA – Limits of agreement.* - statistically significant.

DISCUSSION

“What is the percentage of articles in which the Bland-Altman method is applied incorrectly?” Interestingly, out of all the articles we analyzed, THERE WAS NOT ONE article which correctly applied the method in its entirety. The method seems to be mostly misappliedat the level of:  verifying the assumptions The least fulfilled assumption The 7 articles where this assumption was applied correctly - a mere 14%– also correctly fulfilled the first one So, the errors of articles that correctly applied the second assumption were only minor ones. Table 1. n(%) ofarticleswhichfullfilleachpointofthecheck-list.

… if the impact factor of a journal influences the percentage of articles published in it that apply the method correctly • The articles published in journals with a lower IF appear to have a higher percentage of correct applications! • The differences are however not statistically significant. p>0,05!! Table 2 – Percentage of articles fulfilling each main point of the check list, divided according to the impact factor of the journal where they were published. We used a Chi-Square test to compare the percentages amongst the two levels of impact factor.LA – Limits of agreement.

…if, through the years, the percentage of articles applying the method incorrectly has varied • In every single category, the articles published at a more recent date always have a higher percentage of correct application of the method • Only one of the results is not statistically significant. • With the passing of time authors have come to realize that sometimes the employment of the Bland-Altman method leads to incorrect findings • This would obviously lead the authors of more recent study to be more careful when employing the method. p<0,05!! Table 3 – Percentage of articles fulfilling each main point of the check list, divided according to the year when they were published. We used a Chi-Square test to compare the percentages amongst the two levels of impact factor. LA – Limits of agreement.* - statistically significant.

…if the percentage of articles applying the method correctly varies according to whether it is used to obtain primary or secondary data. • Articles which are using the method to obtain primary data have a higher percentage of correct application of the method than those that use it to obtain secondary data. • It is more likely for authors of an article to pay more attention to the correct employment of a scientific method if it is their main method or one of their main methods for acquiring data. Table 4 – Percentage of articles fulfilling each main point of the check list, divided according to the type of data obtained by using the limits of agreement. We used a Chi-Square test to compare the percentages amongst the two levels of impact factor. LA – Limits of agreement.* - statistically significant.

Limitationsofourwork • Relativelysmallsample • Humanerror • No otherworks to cross-referencewith

Acknowledgements • Professora Doutora Cristina Santos • Professor Doutor Altamiro Rodrigues da Costa Pereira • Mestre João Cláudio Antunes • Turma 4

The Bland-Altman LIMITS OF AGREEMENT: How Often HAVE THEY Been Misapplied?