Grading Strength of Evidence

Grading Strength of Evidence Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for Systematic Reviews Methods Guide www.ahrq.gov

Systematic Review Process Overview

Learning Objectives • To define what “grading strength of evidence (SOE)” is • To describe why grading SOE is important • To distinguish between grading SOE and rating the quality of individual articles • To list primary and additional domains for grading SOE • To describe options for scoring SOE domains • To describe how to score and present SOE grades

Grading Strength of Evidence • Is distinct from rating the quality of individual studies • Is generally used only to assess: • Major outcomes (benefits and harms) • Major comparisons, when relevant

Why Grade Strength of Evidence? • To facilitate use of systematic reviews by diverse decisionmakers and stakeholders • To give decisionmakers: • A comprehensive evaluation of the evidence • A sense of how much confidence they can place in the evidence • To foster transparency and documentation

Three Steps to Grading Strength of Evidence • Scoring four required domains • Risk of bias • Consistency • Directness • Precision • Considering, and possibly scoring, four additional domains • Dose-response association • Plausible confounders • Strength of association • Publication bias • Combining scores from required domains into a single strength-of-evidence score, taking scores on additional domains into account as needed

Four Required Domains: Risk of Bias • Concerns both study design and study conduct for individual studies, rated by usual methods • Assesses the aggregate quality of studies within each major study design and integrates those assessments into an overall risk-of-bias score • Risk-of-bias scores: • High — lowers strength-of-evidence grade • Medium • Low — raises strength-of-evidence grade

Four Required Domains: Consistency • Defined as the degree of similarity in the effect sizes of different studies within an evidence base • Consistent evidence bases: • Have the same direction of effect (same side of “no effect”) • Have a narrow range of effect sizes • Inconsistent evidence bases: • Have nonoverlapping confidence intervals • Have significant unexplained clinical or statistical heterogeneity

Four Required Domains: Consistency Scores • Only three possible scores for consistency: • Consistent (i.e., no inconsistency) • Inconsistent • Unknown or not applicable (single study cannot be assessed) • Meta-analysis: • Use appropriate tests, such as Cochran’s Q test or I2 statistics

Four Required Domains: Directness • Defined as whether the evidence being assessed: • Reflects a single, direct link between the interventions of interest and the ultimate health outcome under consideration • Relies on multiple links in a causal chain • If multiple links are involved, strength of evidence can be only as strong as the weakest link • Using analytic frameworks* is important *See the “Analytic Frameworks” module

Four Required Domains: Aspects of Indirectness • Intermediate or surrogate outcomes instead of health or patient-centered outcomes • Example: laboratory test results or radiographic findings versus patient-reported functional outcomes or death • Indirect comparisons rather than direct, head-to-head comparisons • Direct (e.g., A vs. B, A vs. C, and B vs. C): • Head-to-head studies in the evidence base • Generally assumes use of health outcomes, not surrogate/proxy outcomes • Better strength of evidence • Indirect (e.g., A vs. B, B vs. C, but not A vs. C): • No head-to-head studies that cover all interventions or outcomes of interest • Problematic situation for all types of comparisons • Strength-of-evidence grades not as strong as with direct evidence

Related Issue of Applicability* • Applicability is evaluated separately from directness for the Evidence-based Practice Center (EPC) program. • For decisionmakers, the applicability of evidence depends on the different interests of diverse groups. • A PICOS framework (patient populations, interventions, comparators, outcomes, and settings) is used for applicability assessment in the EPC program. • Although the EPC program separates applicability from strength-of-evidence grading, other systems that work with one decisionmaker may incorporate applicability issues into their evaluations of directness. *See the “Assessing Applicability” module

Four Required Domains: Directness Scores • Only two possible scores for directness: • Direct: • Evidence is based on a single link between the intervention and health outcomes • Indirect: • Evidence relies on: • Surrogate/proxy outcomes • More than one body of evidence • Both situations

Four Required Domains: Precision • Defined as the degree of certainty for estimate of effect with respect to a specific outcome • Is a complicated concept that: • Asks the question: • What can decisionmakers conclude about whether one treatment is, clinically speaking, inferior, superior, or equivalent (neither inferior nor superior) to another? • Includes considerations of: • Statistical significance for effect estimates • Confidence intervals for those effect estimates

Four Required Domains: Precision Scores • Are rated separately for each important outcome or comparison, including for any summary estimate of effect size • Only two scores are possible • Precise: estimate allows a clinically useful conclusion • Imprecise: confidence interval is so wide it could include clinically distinct (even conflicting) conclusions

Additional Domains • Four “discretionary” domains: • Dose-response association • Plausible confounders • Strength of association • Publication bias • Use when they are: • Applicable • Helpful in reaching conclusions about overall grades for strength of evidence

Additional Domains: Dose-Response Association • Pattern of a larger effect with greater exposure (dose, duration, adherence) either across or within studies • Rate if studies give levels of exposure

Additional Domains: Dose-Response Scores • Three scores are possible for dose-response: • Present: dose-response pattern observed • In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence. • Not present: no dose-response pattern observed (dose-response relationship not present) • Not applicable or not tested

Additional Domains: Plausible Confounding • In an observational study, sometimes plausible confounding factors work in the direction opposite that of the observed effect. • Had such “effect-weakening” confounders not been present, the observed effect would have been even larger than the one observed. • In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence. • Consider whether or not plausible confounding exists that would decrease the observed effect.

Additional Domains:Plausible Confounding Scores • Two scores are possible for plausible confounding: • Present: confounding factors that would decrease the observed effect may be present • Absent: confounding factors that would decrease the observed effect are not likely to be present

Additional Domains: Strength of Association • Magnitude of effect: • Defined as the likelihood that the observed effect is large enough that it cannot have occurred solely as a result of bias from potential confounding factors • Consider when effect size is particularly large

Additional Domains:Strength of Association Scores • Two scores are possible for strength of association: • Strong: large effect size that is unlikely to have occurred in the absence of a true effect of the intervention • In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence. • Weak: small enough effect size that it could have occurred solely as a result of bias from confounding factors

Additional Domains: Publication Bias • Studies may have been published selectively. • Example: only a small proportion of relevant trials or other studies has been published. • Estimated effects of an intervention that are based on published studies do not reflect true effect. • Publication bias may undermine the overall robustness of a body of evidence.

Additional Domains: Publication Bias Scores • Publication bias scores: • Need not be formally computed but can influence ratings of required domains • Should take these possible publication bias factors into account: • Rating for consistency • Calculating a summary confidence interval for an effect • Add comments on publication bias when circumstances suggest that relevant empirical findings, particularly negative or no-difference findings, have not been published or are not otherwise available.

Procedures for Assessing Domains • Use two or more reviewers with the appropriate clinical and methodological expertise. • Assess separately: • Each required domain (or each optional domain, as relevant) • Each major outcome, including benefits and harms • Resolve differences by consensus or mediation by an additional expert; consensus scores should appear in tables. • Record and maintain records of each reviewer's individual judgments about domains as background documentation.

Strength of Evidence Grades (I) • Reflect a global assessment that: • Takes the required domains directly into account • Incorporates judgments about the additional domains as needed • Aim to: • Provide “actionable” information for a variety of different users, readers, and stakeholders • Be transparent in how the strength-of-evidence grades are reached

Strength of Evidence Grades (II) • For each comparison of interest, rate the strength of evidence for: • Each major benefit (e.g., positive effects on health outcomes such as physical function or quality of life, or effects on laboratory measures or other surrogate variables) • Each major harm (ranging from rare, serious, or life-threatening adverse events to common but bothersome effects) • For both benefits and harms: • Focus on the outcomes most relevant to patients, clinicians, and policymakers

Strength of Evidence Grades and Definitions • High: High confidence that the evidence reflects the true effect. Further research is very unlikely to change our confidence in the estimate of effect. • Moderate: Moderate confidence that the evidence reflects the true effect. Further research may change our confidence in the estimate of effect and may change the estimate. • Low: Low confidence that the evidence reflects the true effect. Further research is likely to change the confidence in the estimate of effect and is likely to change the estimate. • Insufficient: Evidence either is unavailable or does not permit a conclusion.

Strength of Evidence Grades: Additional Points (I) • Using the high, moderate, or low strength-of-evidence grade: • Implies that a body of evidence actually exists • Is intended to convey how confident reviewers are about decisions that may be made based on evidence graded one way or another • Requires the use of only one designation, not a range (e.g., not “low to moderate”)

Strength of Evidence Grades:Additional Points (II) • The insufficient strength-of-evidence grade: • Is applied when: • Reviewers cannot draw conclusions about an outcome, comparison, or other question • Is appropriate when: • No evidence is available at all • Evidence is too insubstantial to permit conclusions to be drawn (e.g., opposing results from studies with a similar risk of bias; wide and overlapping confidence intervals)

Scoring and Reporting: General Guidance • Use different approaches to incorporate multiple domains into an overall strength-of-evidence grade • GRADE algorithm • Weighting system of the Evidence-based Practice Center • Some qualitative approach • Use (at least) two reviewers • Assess resulting interrater reliability for each domain score, and keep records

Guiding Principles: Risk of Bias • Risk of bias (given design and conduct of available studies) is the essential component in determining the strength-of-evidence grade. • First, consider which study design is most appropriate to reduce bias for each question. • Next, consider the risk of bias from available studies.

Guiding Principles: Risk of Bias Example • Drug comparisons in randomized controlled trials (RCTs), with either placebo or an active comparator as an appropriate design: • Evidence from well-conducted RCTs will have less risk of bias than evidence based on observational studies. • For RCTs, reviewers can start with a rating of low for risk of bias and change the assessment if the RCTs have important flaws. • For observational data, reviewers can start with a rating of high for risk of bias and change the assessment, depending upon how well studies were conducted.

Further Guidance: Principles for Scoring • Be explicit about how the evidence grade will be determined. • A point system for combining ratings of the domains • A qualitative consideration of the domains • Carefully document procedures. • Keep records of procedures and results for each review so that they may contribute to the overall expertise of the Evidence-based Practice Center and the science of grading evidence.

Further Guidance: Principles for Reporting (I) • Explain the rationale for the approach used and identify which domains were important in upgrading or downgrading the strength of evidence. • Explain judgments about the degree to which any additional domains altered the overall strength-of-evidence grade. • Provide enough detail within the report to ensure that users can grasp the methods.

Further Guidance: Principles for Reporting (II) • Use the terms high, moderate, low, or insufficient. • Do not use Roman numerals or other symbols. • Use or adapt the illustrative tabular approach to reporting (see the publications listed below for examples). • Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Methods Guide for Comparative Effectiveness Reviews. Rockville, MD: Agency for Healthcare Research and Quality, Posted August 2009. Available at: http://effectivehealthcare. ahrq.gov/ ehc/products/60/318/2009_0805_grading.pdf. • Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions —Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63:531-523.

Grading Strength of Evidence:Presentation of Results — Moderate and High Grades CI = confidence interval; RCT = randomized controlled trial

Grading Strength of Evidence:Presentation of Results — Insufficient and Low CI = confidence interval; RCT = randomized controlled trial

Comparison With the GRADE System • The grading system used by the Evidence-based Practice Centers (EPCs) is similar to the GRADE system. • The EPC grading system reflects the needs of AHRQ stakeholders for reviews on a wide variety of topics and not for recommendations or guidelines. • The main differences between the two grading systems: • The definitions of domains differ slightly; in the EPC system “directness” excludes “applicability,” which is handled separately. • In the EPC system, observational studies are considered to have less risk of bias for outcomes such as harms, which can raise the initial grade to “moderate.” • The definition of overall grade differs; the EPC system emphasizes confidence in estimate, whereas the GRADE system emphasizes effect of future research. • The EPC system permits three different ways to reach an overall strength-of -evidence grade; the GRADE formula has one.

Summary: Grading Strength of Evidence • Is a critical last step in analysis and presentation • Is done after the quality of articles is rated by at least two independent reviewers • Helps users of systematic reviews understand the body of evidence and how much confidence they can have in making decisions based on that evidence • Uses scores on four primary (mandatory) domains and four additional (discretionary) domains • Focuses on major outcomes and comparisons • Is denoted in terms of high, moderate, or low strength or insufficient evidence • Presents strength-of-evidence grades in tabular form

References • Atkins D, Best D, Briss PA, et al, for the GRADE Working Group. Grading quality of evidence and strength of recommendations. BMJ. 2004;328:1490. • Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Agency for Healthcare Research and Quality. Methods Guide for Comparative Effectiveness Reviews [posted July 2009]. Rockville, MD. Available at: http://effectivehealthcare. ahrq.gov/healthInfo.cfm?infotype=rr&ProcessID=60. • Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions —Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63:513-523.

Author • This presentation was prepared by Kathleen N. Lohr, Ph.D., a Distinguished Fellow at RTI International. • This module is based on an update of chapter 11 in version 1.0 of the Methods Guide for Comparative Effectiveness Reviews (updated chapter available at: http://effectivehealthcare.ahrq.gov/ehc/products/60/318/2009_0805_ grading.pdf ).

Grading Strength of Evidence