Analysis of fMRI data using Support Vector Machine (SVM) Janaina Mourao-Miranda

Analysis of fMRI data using Support Vector Machine (SVM)Janaina Mourao-Miranda

Recently, pattern recognition methods have been used to analyze fMRI data with the goal of decoding the information represented in the subject’s brain at a particular time. • Carlson, T.A., Schrater, P., He, S. (2003) Patterns of activity in the categorical representations of objects. J Cogn Neurosci.. • Cox, D.D., Savoy, R.L. (2003). Functional magnetic resonance imaging (fMRI) "brain reading": detecting and classifying distributed patterns of fMRI activity in human visual cortex. Neuroimage. • Mourão-Miranda, J., Bokde, A. L.W., Born, C., Hampel, H., Stetter, S. (2005) Classifying brain states and determining the discriminating activation patterns: support vector machine on functional MRI data. NeuroImage • Davatzikos,C. Ruparel, K., Fan, Y., Shen, D.G., Acharyya, M., Loughead, J.W., Gur, R.C. and Langleben, D.D. (2005) Classifying spatial patterns of brain activity with machine learning methods: Application to lie detection. NeuroImage • Haynes, J.D. and Rees, G. (2005) Predicting the orientation of invisible stimuli from activity in human primary visual cortex. Nature Neuroscience. 8:686-91. • Kriegeskorte, N., Goebel, R. and Bandettini, P. (2006) Information-based functional brain mapping. PANAS. • LaConte, S., Strother, S., Cherkassky, V., Anderson, J. and Hu, X. (2005) Support vector machines for temporal classification of block design fMRI data. NeuroImage. • Mitchell, T.M., Hutchinson, R., Niculescu, R.S., Pereira, F., Wang, X., Just, M., Newman, S. (2004). Learning to Decode Cognitive States from Brain Images. Machine Learning. • Mourão-Miranda, J., Reynaud, E., McGlone, F., Calvert, G., Brammer, M. (2006) The impact of temporal compression and space selection on SVM analysis of single-subject and multi-subject fMRI data.. NeuroImage (accepted) • Norman, K.A., Polyn, S.M., Detre, G.J., Haxby, J.V. (2006) Beyond mind-reading: multivoxel pattern analysis of fMRI data. Trends in Cognitive Sciences. • Haynes, J.D. and Rees, G. (2006) Decoding mental states from brain activity in humans. Nature Reviews. Neuroscience.

Learning Methodology Automatic procedures that learn a task from a series of examples Learning/Training Generate a function or hypothesis f such that Training Examples: (X1, y1), (X2, y2), . . .,(Xn, yn) f f(xi) -> yi Test Prediction Test Example Xi f(Xi) = yi Pattern recognition is a field within the area of machine learning Supervised Learning Input: X1 X2 X3 Output y1 y2 y3 No mathematical model available

SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis • SVMs introduced by Boser, Guyon, Vapnik in COLT-92 • Powerful tool for statistical pattern recognition Machine Leaning Methods • Artificial Neural Networks • Decision Trees • Bayesian Networks • Support Vector Machines • ..

Face/Object Recognition fMRI data analysis Handwritten Digit Recognition Classification of Microarray Gene Expression Data Texture Classification Protein Structure Prediction SVM Aplications Support Vector Machine

Intensity BOLD signal Time fMRI Data Analyis Classical approach: Mass-univariate Analysis Input Output e.g. GLM Map: Activated regions task 1 vs. task 2 1. Voxel time series 2. Experimental Design Pattern recognition approach: Multivariate Analysis Input Output SVM - training … Volumes from task 1 … Map: Discriminating regions between task 1 and task 2 Volumes from task 2 SVM - test New example Prediction: task 1 or task 2

fMRI data as input to a classifier • Each fMRI volume is treated as a vector in a extremely high dimensional space • (~200,000 voxels or dimensions after the mask) fMRI volume feature vector (dimension = number of voxels)

task 1 task 2 task 1 task 2 4 2 volume in t1 volume in t3 volume in t4 volume in t2 volume in t2 2 volume in t1 volume in t4 4 volume in t3 w volume from a new subject Binary classification can be viewed as a task of finding a hyperplane L R task ? voxel 2 voxel 1

Projections onto the learning weight vector thr w w Projection of X1(t1) Simplest Approach: Fisher Linear Discriminant voxel 2 voxel 1 • The FLD classifies by projecting the training set on the axis that is defined by the difference between the center of mass for both classes, corrected for the within-class covariance.

w Optimal Hyperplane Which of the linear separators is optimal? (X1,+1) voxel 2 Data: <Xi,yi>, i=1,..,N Observations: Xi  Rd Labels: yi  {-1,+1} (X2,-1) voxel 1 • All hyperplanes in Rd are parameterized by a vector (w) and a constant b. • They can be expressed as w•X+b=0 • Our aim is to find such a hyperplane/decision function f(X)=sign(w•X+b), that correctly classify our data: f(X1)=+1 and f(X2)=-1

r  Optimal hyperplane: Largest Marging Classifier • Among all hyperplanes separating the data there is a unique optimal hyperplane, the one which presents the largest margin (the distance of the closest points to the hyperplane). • Let us consider that all test points are generated by adding bounded noise (r) to the training examples (test and training data are assumed to have been generate by the same underlying dependence). • If the optimal hyperplane has margin >r it will correctly separate the test points.

Support Vector Machine (SVM) w Data: <Xi,yi>, i=1,..,N Observations: Xi  Rd Labels: yi  {-1,+1} Support vectors d  Xi Optimal hyperplane Margin • The distance between the separating hyperplane and a sample Xiis d = |(w•Xi+b)|/||w|| • Assuming that a margin  exists, all training patterns obey the inequality yid ≥ , i=1,…,n • Substituting d into the previous equation yi|(w•Xi+b)|/||w|| ≥  • Thus maximizing the margin  is equivalent to minimizing the norm of w • To limit the number of solutions we fix the scale of the product ||w|| = 1 • Finding an optimal hyperplane is a quadratic optimization problem with linear constrains and can be formally stated as: • Determine w and b that minimize the functional (w) = ||w||2/2 • subject to the constraints yi[(w•Xi)+b) ≥ 1, i=1,…,n • The solution has the form: • w = ΣαiyiXi • b= wXi-yifor any Xisuch that αi 0 • The examples Xifor which αi > 0 are called the Support Vectors.

task1 task2 task1 task2 task1 task2 H: Hyperplane 0.45 0.5 2.5 1 1 2 2 0.89 1.5 0.3 4.5 3 4 1 w How to interpret the learning weight vector (Discriminating Volume)? Weight vector (Discriminating Volume) W = [0.45 0.89] • The value of each voxel in the discriminating volume indicates the importance of such voxel in differentiating between the two classes or brain states.

Advantage of using Multivariate Methods Voxel 2 Voxel 1 Voxel 1: there is no mean difference Voxel 2: there is mean difference Univariate analysis: only detects activation in voxel 2 SVM (Multivatiate analysis): gives weight for both voxels

Patter Recognition Method: General Procedure Pre-processing: Normalization Realignment Smooth Split data: training and test Dimensionality Reduction (e.g. PCA) and/or feature selection (e.g. ROI) SVM training and test • Output: • Accuracy • Disciminating Maps • (SVM weight vector)

Applications

Can we classify brain states using the whole brain information from different subjects?

Training Subjects Test Subject fMRI scanner ? Brain looking at a pleasant stimulus fMRI scanner fMRI scanner Brain looking at an unpleasant stimulus fMRI scanner Machine Learning Method: Support Vector Machine Brain looking at a pleasant stimulus fMRI scanner Brain looking at an unpleasant stimulus The subject was viewing a pleasant stimuli

Application I Number of subjects: 16 Tasks: Viewing unpleasant and pleasant pictures (6 blocks of 7 scans) • Pre-Processing Procedures • Motion correction, normalization to standard space, spatial filter. • Mask to select voxels inside the brain. • Leave one-out-test • Training: 15 subjects • Test: 1 subject • This procedure was repeated 16 times and the results (error rate) were averaged.

unpleasant pleasant 1.00 z=-18 z=-6 z=6 z=18 z=30 z=42 0.66 0.33 0.05 -0.05 -0.33 -0.66 -1.00 PCA Pre-processing SVM • Output: • Accuracy • Discriminating volume Spatial observations Spatial weight vector

Can we classify groups using the whole brain information from different subjects?

Pattern Classification of Brain Activity in Depression Collaboration with Cynthia H.Y. Fu TP=74% TN=63%

Can we improve the accuracy by averaging time points?

First Approach: Use single volumes as training examples Second Approach: Use the average of the volumes within the block as training examples (one example per block) Third Approach: Use block-specific estimators as training examples

(A) No Temporal Compression (B) Temporal Compression I (C) Temporal Compression II Split data: training and test Split data: training and test Split data: training and test GLM analysis: block-specific estimator SVD/PCA Average volumes within the blocks SVM training and test SVD/PCA SVD/PCA SVM training and test SVM training and test Unpleasant Neutral Pleasant Multi-subject Classifier Impact of temporal compression and spatial selection on SVM accuracy Whole data

Can we improve the accuracy by using ROIs?

Fourth Approach: Space restriction (training with the ROIs selected by GLM) Fifth Approach: Space restriction (training with the ROIs selected by SVM)

Split data: training and test Split data: training and test Split data: training and test GLM analysis SVM analysis SVD/PCA Select activated voxels based on GLM Select discriminating voxels based on SVM SVM training and test SVD/PCA SVD/PCA SVM training and test SVM training and test Unpleasant Neutral Pleasant Multi-subject Classifier Impact of temporal compression and spatial selection on SVM accuracy (C) Space Selection by the SVM (B) Space Selection by the GLM (A) Whole brain

Split data: training and test Split data: training and test Split data: training and test Split data: training and test GLM analysis GLM analysis Average volumes within the blocks SVD/PCA Select activated voxels based on GLM Select activated voxels based on GLM SVM training and test SVD/PCA SVD/PCA Average volumes within the blocks SVM training and test SVM training and test SVD/PCA SVM training and test Unpleasant Neutral Pleasant Single-subject Classifier Impact of temporal compression and spatial selection on SVM accuracy (C) Space Selection by the GLM (D) Temporal Compression + Space Selection (A) Whole Brain (B) Temporal Compression

Summary Multi-subject Classifier Single-subject Classifier

How similar are the results of the GLM and the SVM?

Contrast between viewing unpleasant (red scale) and neutral pictures (blue scale) (A) Standard GLM analysis Spmt Random Effect p-value < 0.001 (uncorrected) (B) SVM - Whole Data (C) SVM - Time Compressed Data

Contrast between viewing unpleasant (red scale) and pleasant pictures (blue scale) (A) Standard GLM analysis Spmt Random Effect p-value < 0.001 (uncorrected) (B) SVM - Whole Data (C) SVM - Time Compressed Data

(A) Standard GLM analysis Spmt Random Effect p-value < 0.001 (uncorrected) (B) SVM - Whole Data (C) SVM - Time Compressed Data Contrast between viewing neutral (red scale) and pleasant pictures (blue scale)

Does the classifier works for ER designs?

Block vs. ER designs Stimuli (pleasant, unpleasant and neutral pictures Block 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Time (TR=3s) Event 0 1 2 Time (TR=3s)

Unpleasant Neutral Pleasant Block Design vs. ER Whole data (B) Block as ER (C) Block (A) Event-related (ER) Leave-one-out Leave-one-out Leave-one-out SVD/PCA SVD/PCA SVD/PCA Average 2 volumes within the block Average all volumes within the block Average 2 volumes “within” the event SVM Training SVM Training SVM Training

Summary

Discriminating volume (SVM weight vector): unpleasant (red scale) x neutral (blue scale) Block Design ER Design

Can we make use of the temporal dimension in decoding?

Spatial Observation Unpleasant or Pleasant Stimuli Fixation

unpleasant Spatial weight vector Unpleasant Pleasant pleasant 1.00 0.66 z=-18 z=-6 z=6 z=18 z=30 z=42 0.33 0.05 -0.05 -0.33 -0.66 -1.00 Spatial SVM PCA Pre-processing SVM • Output: • Accuracy • Discriminating volume Spatial observations

Spatial Temporal Observation Duty Cycle Unpleasant or Pleasant Stimuli Fixation vt8 vt1 vt9 vt2 vt10 vt3 vt11 vt4 vt12 vt5 vt13 vt6 vt14 vt7 Vi = [v1 v2v3v4v5v6v7v8v9v10v11v12v13v14]

T8 T1 unpleasant T9 T2 pleasant T10 T3 1.00 T11 T4 0.45 0.22 0.05 -0.05 T12 T5 -0.22 Unpleasant -0.45 Pleasant T13 T6 -1.00 T14 T7 Spatiotemporal SVM: Block Design PCA Pre-processing SVM Spatiotemporal observations (4D data including all volumes within the duty cycle) • Output: • Accuracy • Dynamic Discriminating volume Training example: Whole duty cycle B

T1 Spatial-Temporal weight vector: Dynamic discriminating map T2 T3 1.00 unpleasant T4 0.45 0.22 0.05 -0.05 T5 -0.22 -0.45 pleasant T6 -1.00 T7 z=-18 z=-6 z=6 z=18 z=30 z=42

T8 Spatial-Temporal weight vector: Dynamic discriminating map T9 T10 1.00 T11 unpleasant 0.45 0.22 0.05 -0.05 T12 -0.22 -0.45 pleasant T13 -1.00 T14

unpleasant pleasant 1.00 0.45 0.22 0.05 -0.05 -0.22 -0.45 -1.00 T5 z=-18 A C B D

unpleasant pleasant 1.00 0.45 0.22 0.05 -0.05 -0.22 -0.45 -1.00 A T5 B z=-6

unpleasant pleasant 1.00 0.45 0.22 0.05 -0.05 -0.22 -0.45 -1.00 T5 z=-6 C E D F

The Brain Image Analysis Unit (BIAU)

Analysis of fMRI data using Support Vector Machine (SVM) Janaina Mourao-Miranda