Accuracy, Reliability, and Validity of Freesurfer Measurements

Accuracy, Reliability, and Validity of Freesurfer Measurements David H. Salat salat@nmr.mgh.harvard.edu

Why Talk About This? • This is not meant to imply that everything is perfect in FreeSurfer processing • The information here should be used as a guide for how to assess the data in your own projects. • These are general theories, that apply to all types of data, structural, functional, cognitive, etc.

What is Accuracy? • Accuracy: the degree of closeness of a measured or calculated quantity to its actual (true) value (e.g. a physical property such as length or thickness) • MRI measures are indirect. We may be able to measure morphometry accurately given the contrast of the MR image, however, this contrast may differ from measurements from the actual tissue properties.

What is Reliability? • Measures obtained for the same individual on two different trials, typically close together in time to avoid a biological influence on the reliability measure • Reliability of a labeling procedure in the same scan (e.g. hippocampus; usually for manual labeling) • Reliability of the labeling procedure in the same subject on two different scans collected on the same scanner (automated procedures) • Reliability of the labeling procedure in the same subject on two different scans collected on two different scanners (multi-site studies) • Effect Reliability: Replication of the experiment in an independent sample.

What is Validity? • Validity: the extent to which an indirect measurement is representative of what it is supposed to measure. • For example, in fMRI we use blood flow as an indirect measure of neural activity. Is this a valid measure of neural activity?

Validity Examples • Internal validity: Strength of the overall experimental design, study sample size, analysis procedures, etc.? • External validity: Generalize to another sample? (replication) • Ecological validity: Applied in the real world outside of the experimental setting? (clinical application) • Construct validity: Totality of evidence? (do the data fit with what is known?) • Convergent validity: Correlation with other types of measures that it should theoretically be correlated with? (do the data correlate with ‘gold standards’) • Discriminant validity: Not correlated with measures it should not be correlated with? (intracranial volume/age)

Types of Error • Random Error: Unknown and unpredictable changes in the measurement • Should be unbiased • Accuracy, reliability, and validity all limited by error • Systematic error: Predictable offset or scaling of data • Typically comes from some more obvious aspect of the data acquisition/analysis (e.g. there is a global offeset of values at 3T as at 1.5T; this must be considered when combining data across scanners) • Can potentially be identified and corrected

Why is this important? • Sensitivity: Poor reliability increases variance across individuals and across timepoints. • Many studies would benefit from the ability to measure minute changes across time. • Interpretation: Validity is directly tied to interpretation. You may have a valid measure of ‘cortical thickness’, but ‘cortical thickness’ might not be a valid measure of degeneration

Cortical ReconstructionSubcortical Segmentation (Recon-all) Output Data Surfaces (computer models) Original T1 Data Volumes(labeled MRI images) Segmentation, parcellation, white matter parcellation Thickness, aparc, curv, sulc, jacobian VisualizationTksurfer Visualizationtkmedit Individual subjects Individual subjects Group comparisons/statistics Region of interestanalysis Region of interestanalysis Spreadsheet Stat software Spreadsheet Stat software Qdec, mri_glmfit

Accuracy and Validity of Spherical Averaging for Labeling Structural and Functional Anatomy Use of folding patterns to align subjects. Alternative to Talairach/MNI. Fischl et al., 1999

Anatomic Labeling Matching a manual anatomic label of the central sulcus across individuals.Bruce will talk about matching cytoarchetectonics. Fischl et al., 1999

Functional Labeling Matching a functional retinotopic labels of the visual fields across individuals. Fischl et al., 1999

Enhanced fMRI Statistical Power Averaging functional data across subjects on a cognitive task. Fischl et al., 1999

Cortical Thickness(Results fall within expected range) • Consistent with published findings: • crowns of gyri are thicker than the fundi of sulci • sensory areas are among the thinnest in the cortex. Fischl et al., 1999

Values match manual measurements from published imaging data Fischl et al., 1999

Manual Measurements • Age effects with automated procedures replicated with manual measures • Can only be done in regions where folds are appropriate • Calcarine also consistent values across studies (different scanners) Salat et al., 2004 Calcarine Orbitofrontal Kuperberg et al., 2003

Cortical Thickness Comparison with Postmortem Measures Rosas et al., 2002

Subcortical/Volumetric Segmentation: Automated measures are similar in size and region to manual measures, and predict who will develop AD Fischl et al., 2002

Cortical Parcellations: Compared to Manually Labeled Data • 1 volume and 2 surface based labeling schemes • Percent of subjects labeled correctly at each location across the surface. Volume Atlas Surface Atlas Surface Atlas 2 Fischl et al., 2004 Desikan et al., 2006

White matter Parcellation: same subjects scanned at different times • Most regions within 5% Salat et al., 2008

Comparison across time, scanner, field strength, number of scans, sequence type, scanner upgrade, and scanner manufacturer Han et al., 2006

Effects of Pulse Sequence, Voxel Geometry/resolution, and Parallel Imaging • Wonderlick et al, 2009: Parallel acceleration, increased spatial resolution, high bandwidth multiecho sequence. • Reliability high across imaging parameters. • Significant measurement bias observed between MPR and all isotropic sequences for all cortical regions and some subcortical structures. • Improvements in MRI acquisition technology do not compromise data reproducibility, but consistency should be maintained. • Jovicich et al., 2009: Averaging multiple acquisitions, B1 correction, acquisition sequence (MPRAGE vs. multi-echo-FLASH), scanner upgrades (Sonata-Avanto, Trio-TrioTIM), segmentation atlas (MPRAGE or multi-echo-FLASH) • Minimally affected by different manipulations • Volume measurements across platforms (Siemens Sonata vs. GE Signa) and field strengths (1.5 T vs. 3 T) result in bias but with comparable variance as within-scanner • Multi-site studies may not necessarily require a much larger sample to detect a specific effect.

Replication of Study Results:Split Sample • Concordant results are likely not due to statistical error • Current study with 5 samples used in prior literature assessing the replicability of cortical/subcortical (Fjell et al., 2009; Walhovd et al., 2009) Salat et al., 2004

Replicable WM Parcellation results across sex and hemisphere Men Women Salat et al., 2008

Replication of Effects in Same Participants Across Scanning Conditions Dickerson et al., 2008

Consistent Findings Across 4 samples Used To Identify Regions with Predictive Validity • Regional measures predict who will progress to AD. Dickerson et al., 2008

Conclusions • Any tool used for MR analysis should be rigorously tested for accuracy, reliability, and validity • Most of the measures from Freesurfer have good accuracy, reliability, and validity across a range of conditions • These results are dependent on optimal input data and correct implementation • These data provide confidence, but do not substitute for using similar procedures to check data from each new study

Cross Sequence Parameters • Different pairs of flip angles can be used for reliable measures Fischl et al., 2004

Accuracy, Reliability, and Validity of Freesurfer Measurements