1 / 48

CLASSIFICATION

CLASSIFICATION. Periodic Table of Elements. 1789 Lavosier 1869 Mendelev. Measures of similarity i) distance ii) angular (correlation). Var 2. x k T d kl = || x T k - x T l || x l T. angular. Var 1. Variable space.

tallys
Download Presentation

CLASSIFICATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CLASSIFICATION

  2. Periodic Table of Elements

  3. 1789 Lavosier 1869 Mendelev

  4. Measures of similarityi) distanceii) angular (correlation)

  5. Var 2 xkT dkl = || xTk-xTl|| xlT angular Var 1 Variable space • Two objects plotted in the two-dimensional variable space. The difference between the object vectors is defined as the Euclidean distance between the objects, dkl

  6. Measuring similarityDistance i) Euclidean ii) Minkowski (“Manhatten”, “taxis”)iii) Mahalanobis (correlated variables)

  7. X2 p1 Euclidean p2 X1 Distance Euclidean: Manhattan:

  8. Classification using distance: Nearest neighbor(s) define the membership of an object. KNN (K nearest neighbors) K = 1 K = 3

  9. X2 X1 Classification X1 and X2 is uncorrelated, cov(X1, X2) = 0 for both subsets (classes) => can use KNN to measure similarity

  10. Classification X2 PC1 Class 4 PC2 Class 3 Class 2 Class 1 X1 Univariate classification can NOT provide a good separation between class 1 and class 2. Bivariate classification (KNN) provides separation. For class 3 and class 4, PC analysis provides excellent separation on PC2.

  11. Classification X2 X1 X1 and X2 is correlated, cov(X1, X2)  0 for both “classes” (high X1 => high X2) KNN fails, but PC analysis provides the correct classification

  12. Classification Cluster methods like KNN (K nearest neighbors) use all the data in the calculation of distances. Drawback: No separation of noise from information Cure: Use scores from major PCs

  13. VARIABLE CORRELATIONAND SIMILARITYBETWEEN OBJECTS

  14. CORRELATION&SIMILARITY Var 2 Var 1 Variable space

  15. CORRELATION&SIMILARITY PCclass 2 Var 2 PCclass 1 Var 1 Variable space SUPERVISED COMPARISON (SIMCA)

  16. CORRELATION-SIMILARITY PC2 Var 2 PC1 Var 1 Variable space UNSUPERVISED COMPARISON (PCA)

  17. CORRELATION&SIMILARITY Var 2 eTk xcT xTk Var 1 Variable Space

  18. CORRELATION&SIMILARITY Unsupervised: PCA - score plot Fuzzy clustering Supervised: SIMCA

  19. 0 10 20 30 KM CORRELATION-SIMILARITY Characterisation and Correlation of crude oils…. Kvalheim et al. (1985) Anal. Chem.

  20. CORRELATION&SIMILARITY Sample 1 Sample 2 Sample N

  21. t2 4 11 4 3 11 8 10 3 8 9 9 6 PC2 6 2 13 2 13 5 14 7 1 1 t1 PC1 CORRELATION&SIMILARITY SCORE PLOT

  22. Soft Independent Modelling of Class Analogies (SIMCA)

  23. SIMCA Model (Covar. pattern) Residuals (Unique variance, noise) Data (Variance) + = Angular correlation Distance

  24. Data matrix Variables Objects 1 2 3 4 ………… …………...M 1 2 3 . . . . N N+1 N+N’ Class 1 Training set (Reference set) Xki Class 2 Class Q Unassigned objects Test set SIMCA Class - group of similar objects Object - sample, individual Variable - feature, characteristics, attribute

  25. Data matrix Peak area Chromatogram 1 2 3 4 ………… …………...M 1 2 3 . . . . N N+1 N+N’ Oil field 1 Training set (Reference set Xki Oil field 2 Oil field Q New samples Test set SIMCA

  26. 3* xki = xi + eki x’k = x’ + e’k 3 x 2* 2 1* 1 3’ p1 xki = xi + tkp’i + eki x’k = x’ + tkp’ + e’k 3 x 2’ 2 1’ 1 PC MODELS

  27. p2 3’ xki = xi + tkp’i + eki x’k = x’ + tk1p’1 + tk2p’2 + e’k 3 X 2’ 2 1’ 1 p1 PC MODELS

  28. XC = XC + TCP`C + EC information (structure) noise PRINCIPAL COMPONENT CLASS MODEL k = 1,2,…,N (object,sample) i = 1,2,…,N (variable) a = 1,2,….,A (principal component c = 1,2,----,C (class)

  29. 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 PC MODELS • Deletion pattern for objects in the leave-out-one group-of elements-at-a-time cross validation procedure developed by Wold

  30. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 i)Calculate scores and loadings for PC a+1; ta+1 and p`a+1, excluding elements in one group ii) Predict values for elements eki, a+1 = tk,a+1 p`a+1,i iii) Sum over the elements iv) Repeat i)-iii) for all the other groups of elements v) Compare with Adjust for degrees of freedom 7 8 CROSS VALIDATING PC MODELS

  31. Smax p= 0.01 Smax p= 0.05 PC 1 1-component PC model

  32. Smax S0 PC 1 Residual Standard Deviation (RSD) Mean RSD of class: RSD of object:

  33. smax PC 1 1/2st 1/2st tmax tmin tupper tlower

  34. CLASSIFICATION OF A NEW OBJECT i) Fit object to the class model ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test)

  35. CLASSIFICATION OF A NEW OBJECT i) Fit object to the class model Defines For a = 1,2,...,A Calculate the residuals to the object: ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test) <= Fcritical => k  class q > Fcritical => k  class q

  36. smax PC 1 1/2st 1/2st tmax tmin tupper tlower Objects outside the model

  37. RSDmax tmin - 1/2st tmax tl tmin tmax+ 1/2st sl sk tlower Detection of atypical objects PC 1 Object k: Sk > RSDmax => k is outside the class Object l: tl is outside the “normal area”, {tmin-1/2st, tmax+1/2st} => Calculate the distance to the extreme point, that is, sl > RSDmax => l is outside the class

  38. Detection of outliers 1. Score plots 2. DIXON-TESTS on eachLATENT VARIABLE, 3. Normal plots of scores for eachLATENT VARIABLE 4. Test of residuals, F-test (class model)

  39. MODELLEING POWERDISCRIMINATION POWER

  40. MODELLEING POWER The variables contribution to the class model q (intra-class variation) MPiq= 1 - Sqi,A/ Sqi,0 MPi= 1.0 => the variable i is completely explained by the class model MPi= 0.0 => the variable i does NOT contribute to the class model

  41. DISCRIMINATION POWER The variables ability to separate two class models (inter-class variation) DPr,qi = 1.0 => no discrimination power DPr,qi > 3-4 => “Good” discrimination power

  42. Class q sk(q) k sl(q) sk(r) l sl(r) Class r SEPARATION BETWEEN CLASSES Worst ratio: ,lr Class distance: => “good separation”

  43. POLISHED CLASSES 1) Remove “outliers” 2) Remove variables with both low MP < 0.3-0.4 and low DP < 2-3

  44. How does SIMCA separate from other multivariate methods? i) Models systematic intra-class variation (angular correlation) ii) Assuming normally distributed population, the residuals can be used to decide class belonging (F-test)! iii) “Closed” models iv) Considers correlation, important for large data sets v) SIMCA separates noise from systematic (predictive) variation in each class

  45. Separating surface Latent Data Analysis (LDA) • New classes ? • Outliers • Asymmetric case? • Looking for dissimilarities

  46. x2 ? f1(x1,x2) ? f2(x1,x2) ? ? x1 MISSING DATA

  47. WHEN DOES SIMCA WORK? 1. Similarity between objects in the same class, homogenous data. 2. Some relevant variables for the problem in question (MP, DP) 3. At least 5 objects, 3 variables.

  48. Read Raw-data Eliminate variables with low modelling and discriminated power Pretreatment of data Square Root, Normalise and more Select Subset/Class Evaluation of subsets Variable Weighting Standardise Yes Remodel? Cross validated PC-model “Polished” subsets Outliers? Yes Fit new objects Yes More Classes? ALGORITHM FOR SIMCA MODELLING

More Related