Dr. Gari D. Clifford, University Lecturer & Associate Director,

Information Driven Healthcare:Data Visualization & Classification Lecture 1: Introduction & preprocessing Centre for Doctoral Training in Healthcare Innovation Dr. Gari D. Clifford, University Lecturer & Associate Director, Centre for Doctoral Training in Healthcare Innovation, Institute of Biomedical Engineering, University of Oxford

The course • A practical overview of (a subset of) classifiers and visualization tools • Data preparation, PCA, K-means Clustering, KNN • Statistics, regression, LDA, logistic regression • Neural Networks • Gaussian Mixture models, EM • Support Vector Machines • Labs – try to • Classify flowers (classic dataset), … then • Predict mortality in the ICU! (... & publish if you do well!)

Workload • Two lectures each morning • Five 4-hour labs (each afternoon) • Read one article each eve (optional)

Assessment /assignments • Class interaction • Lab diary – write up notes as you perform investigations – submit lab code (m-file) and Word/OO doc answering the questions at 5pm each day … No paper please! • Absolutely no homework! • ... but you can write a paper afterwards if your results are good!

Course texts • Ian Nabney, Netlab, Algorithms for Pattern Recognition, in their series Advances in Pattern Recognition. Springer (2001) ISBN 1-85233-440-1 http://www.ncrg.aston.ac.uk/netlab/book.php • Christopher M. Bishop, Pattern Recognition and Machine Learning Springer (2006) ISBN 0-38-731073-8 http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm • Press, Teukolsky, Vetterling & Flannery, Numerical Recipes in C, the Art of Scientific Computing, 2nd Edition, Cambrige University Press, 1992. [Ch. 2.6, 10.5(p414-417), 11.0(p465-460), 15.4(p671-688), 15.5(p681-688), 15.6&15.7(p689-700)] Online at http://www.nrbook.com/a/bookcpdf.php • L. Tarassenko, A Guide to Neural Computation, John Wiley & Sons (February 1998) Ch. 7 (p77-101) • Ian Nabney, Netlab2? - when available!

Syllabus – Week 1 • Monday Data exploration [GDC] • Lecture 1: (9.30:10.30am) Introduction, probabilities, entropy, preprocessing, normalization, segmenting data (PCA, ICA) • Lecture 2: (11am-12pm) Feature extraction, visualization, (K-means, SOM, GTM, Neuroscale). • Lab 1 (1-5pm) Preprocessing of data & visualization - segmentation (train, test, evaluation), PCA & K-means with 2 classes • Reading for tomorrow: Bishop PRML, Ch4.1 p179-196, Ch4.3.2 p205-206, Ch2.3.7 p102-103,691, Netlab Ch3.5-3.6 p101-107 • Tuesday Clinical Statistics & Classifiers[IS] • Lecture 3: (9.30:10.30am) Clinical statistics: t-test, X2 test, Wilcoxon rank sum test, Linear regression, bootstrap, jackknife. • Lecture 4: (11am-12pm) Clinical classifiers: LDA, KNN, Logistic Regression • Lab 2 (1-5pm) – P-values, statistical testing, LDA, KNN and Logistic regression. • Reading for tomorrow: Netlab: Ch5.1-5.6 p165-167, Ch6 p191-221 • Wednesday Optimization and Neural Networks[GDC] • Lecture 5: (9.30:10.30am) ANNs - RBFs and MLPs - choosing an architecture, balancing the data. • Lecture 6: (11am-12pm) Training & optimization, N-fold validation. • Lab 3 (1-5pm) Training an MLP to classify flower types and then mortality - partitioning and balancing data, • Reading for tomorrow: Netlab: Ch3.1-3.4 p79-100 • Thursday Probabilistic Methods[DAC] • Lecture 7: (9.30:10.30am) GMM, MCMC, Density Estimation • Lecture 8: (11am-12pm) EM, Variation Bayes, missing data • Lab 4 (1-5pm) GMM and EM • Reading for tomorrow: Bishop: Ch7 p325-345 (SVM) • Friday Support Vector Machines[CO/GDC] • Lecture 9: (9.30:10.30am) SVMs and constrained optimization • Lecture 10: (11am-12pm) Wrap-up • Lab 5 (1-5pm) Use SVM toolbox and vary 2 parameters for regression & classification (1 class death and then alive), then 2 class.

Overview of data for lab • You will be given two datasets: • 1. A simple dataset for learning – Fisher’s Iris dataset • 2. A complex ICU database (if this works – publish!!!) • In each lab you will use dataset 1 to understand the problem, then dataset 2 to see how you can apply this to more challenging data

So let’s start … what are we doing? • Trying to learn classes from data so when we see new data, we can make a good guess concerning it’s class membership • (e.g. is this patient part of the set of people likely to die and if so, can we change his/her treatment) • How do we do this? • Supervised – use labelled data to train an algorithm. • Unsupervised – use heuristics or metrics to look for clusters in data (K-means clustering, KNN, SOMs, GMM, …)

Data preprocessing/manipulation • Filter data to remove outliers (reject obvious large/small values) • Zero-mean, unit variance data if parameters are not in same units! • Compress data into lower dimensions to reduce workload or to visualize data relationships • Rotate data, or expand into higher dimensions to improve the separation between classes.

The curse of dimensionality • Richard Bellman (1953) coined the term The Curse of Dimensionality (or Hughes effect) • It’s the problem caused by the exponential increase in volume associated with adding extra dimensions to a (mathematical) space. muppet.wikia.com • Bellman gives the following example: • Given 100 evenly-spaced sample points suffice to sample a unit interval with no more than 0.01 distance between points; • An equivalent sampling of a 10D unit hypercube with a lattice with a spacing of 0.01 between adjacent points would require 1020 sample points: • Therefore, at this spatial sampling resolution, the 10-dimensional hypercube is a factor of 1018 ‘larger’ than the unit interval.

So what does that mean for us? • Need to think about how much data we have and how many parameters we use. • “Rule of thumb”: need to have at least 10 training samples of each class per input feature dimension (although this depends on separability of data and can be up to 30 for complex problems and as low as 2-5 for simple problems [*]) • So for the Iris dataset – we have 4 measured features on 50 examples of each of the three classes … so we have enough! • For ICU data we have 1400 patients, 970 survived and 430 died … so taking the minimum of these we could use up to 43 of the 112 features • Generally though you need more data … • Or you compress the data into a smaller number of dimensions • [*] Thomas G. Van Niel, Tim R. McVicarb and Bisun Datt, On the relationship between training sample size and data dimensionality: Monte Carlo analysis of broadband multi-temporal classification, Remote Sensing of Environment, Volume 98, Issue 4, 30 October 2005, Pages 468-480 doi:10.1016/j.rse.2005.08.011

Principal Component Analysis (PCA) • Standard signal/noise separation method • Compress data into lower dimensions to reduce workload or to visualize data relationships • Rotate data to improve the separation between classes • Also known as Karhunen-Loève (KL) transform or the Hotelling transform or Singular Value Decomposition (SVD) – although SVD is actually a mathematical method of PCA

Principal Component Analysis (PCA) • A form of Blind source Separation – an observation, X ,can be broken down into a mixing matrix , A , and a set of basis functions , Z : X=AZ • Second orderdecorrelation = independence • Find a set of orthogonal axes in the data (independence metric = variance) • Project data onto these axes to decorrelate • Independence is forced onto the data through the orthogonality of axes

Two dimensional example • Where are the principal components? • Hint: axes of maximum variation, and orthogonal

Two dimensional example • Gives best axis to project minimum RMS error • Data becomes ‘sphered’ or whitened / decorrelated

Singular Value Decomposition (SVD) • Decompose observation X=AZ into…. • X=USVT • S is a diagonal matrix of singular values with elements arranged in descending order of magnitude (the singular spectrum) • The columns of V are the eigenvectors of C=XTX(the orthogonal subspace … dot(vi,vj)=0 ) … they ‘demix’ or rotate the data • Uis the matrix of projections of X onto the eigenvectors of C … the ‘source’ estimates

SVD – matrix algebra Decompose observation X=AZ into…. X=USVT

Eigenspectrum of decomposition S = singular matrix … zeros except on the leading diagonal Sij (i=j) are the eigenvalues½ Placed in order of descending magnitude Correspond to the magnitude of projected data along each eigenvector Eigenvectors are the axes of maximal variation in the data Eigenspectrum= Plot of eigenvalues [stem(diag(S).^2)] Variance = power (analogous to Fourier components in power spectra)

SVD: Method for PCA See BSS notes and example at end of presentation

SVD for noise/signal separation • To perform SVD filtering of a signal, use a truncated SVD decomposition (using the first p eigenvectors) • Y=USpVT • [Reduce the dimensionality of the data by discarding noise projections Snoise=0 • Then reconstruct the data with just the signal subsapce] Most of the signal is contained in the first few principal components. Discarding these and projecting back into the original observation space effects a noise-filtering or a noise/signal separation

e.g. l1 x x = u1 u2 l2 v1 v2 • Imagine a ‘spectral decomposition’ of the matrix:

SVD – Dimensionality reduction • How exactly is dimension reduction performed? • A: Set the smallest singular values to zero: x x =

SVD – Dimensionality reduction x x • … note approximation sign ~

SVD - Dimensionality reduction • … and resultant matrix is an approximation using only 3 eigenvectors ~

Real ECG data example S2 X n Xp =USpVT Xp … p=2 Xp … p=4

Recap - PCA • Second orderdecorrelation = independence • Find a set of orthogonal axes in the data (independence metric = variance) • Project data onto these axes to decorrelate • Independence is forced onto the data through the orthogonality of axes • Conventional noise / signal separation technique • Often used as a method of initializing weights for neural network and other learning algorithms (see Wed lectures).

Appendix • Worked example (see lecture notes) • http://www.robots.ox.ac.uk/~gari/cdt/IDH/docs/ch14_ICASVDnotes_2009.pdf

Worked example

Dr. Gari D. Clifford, University Lecturer & Associate Director,