1 / 36

BINF 733 Spring 2005 Statistical Methods of Outlier Detection

Sir Francis Bacon Novum Organum 1620. For he that knows the ways of nature will more easily observe her deviations; and on the other hand he that knows her deviations will more accurately describe her ways. . Sir Francis Bacon Revisited. To identify outliers we need some sort of model to start wit

palmer
Download Presentation

BINF 733 Spring 2005 Statistical Methods of Outlier Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. BINF 733 Spring 2005 Statistical Methods of Outlier Detection Jeff Solka Ph.D. Jennifer Weller Ph.D.

    2. Sir Francis Bacon – Novum Organum 1620 For he that knows the ways of nature will more easily observe her deviations; and on the other hand he that knows her deviations will more accurately describe her ways.

    3. Sir Francis Bacon Revisited To identify outliers we need some sort of model to start with. We can do a better job at identifying our model if we first remove the outliers. The process of outlier identification/model building is an iterative process.

    4. What is an Outlier? Given a set of observations X an outlier is an observations that is an element of this set but which is inconsistent with the majority of the data.

    5. Manifestation of Outliers in Gene Expression Data Given a set of replicate arrays the replicates can be used to identify an aberrant spot. Xgi = transformed and normalized spot intensity measurements for the gth gene on the ith array An outlier is an observation Xgi that is markedly different from his fellow observations

    6. Nonresistent Rules for Outlier Identification

    7. The z-score Rule Grubbs’ Test The z-score rule (Grubbs test). Calculate a z-score zgi for every observation: Where and sg are the mean and standard deviation of the gth gene. Call Xgj an outlier is |zgj| is larger say greater than five

    8. The CV Rule The CV Rule – Call the furthest observation Xgi from the mean, , and outlier if the coefficient of variation CVg exceeds some prespecified cutoff.

    9. Problems With the z-score and CV Methods of Outlier Detection They are both based on measures that are heavily influenced by outliers, the mean and the standard deviation. Masking – An outlier remains undetected because it is hidden by it’s own influence on the methodologies parameters or else by another adjacent outlier. Swamping – A normal observation is classified as an outlier due to the presence of an unrelated outlier or outliers.

    10. Resistant Rules for Outlier Detection

    11. One Approach to Crafting Resistant Rules for Outlier Detection Based on outlier resistant statistical measures Median Median absolute deviation from the median

    12. The Resistant z-score Rule The resistant z-score rule. Calculate a resistant z-score, z*gi for every observation using and are the median and MAD of the gth gene. Call Xgi and outlier if |z*gi| is large, say, greater than five.

    13. Problem of Too Few Replicates Microarray experiments usually have little replication Median and MAD are not dependable estimates of the location and scale of the data

    14. A Strategy for the Problem of Too Few Replicates - I With microarray data there is a relationship between the median and MAD across all of the genes Assume this relationship is a true relationship s2g = f(mg) Use this to compute a smoothed version of MAD, , that will be more stable as it “boorows strength” from similarly expressing genes

    15. A Strategy for the Problem of Too Few Replicates - II

    16. A Strategy for the Problem of Too Few Replicates - III The revised z-score rule Call Xgi an outlier if the computed score is large say greater than five

    17. Mahalanobis’ Distance for Outlier Detection

    18. Advantages of the Mahalanobis’ Distance Approach Mahalanobis' distance identifies observations which lie far away from the centre of the data cloud, giving less weight to variables with large variances or to groups of highly correlated variables (Joliffe, 1986). This distance is often preferred to the Euclidean distance which ignores the covariance structure and thus treats all variables equally.

    19. A Circle Becomes an Ellipse Based on the Mahalanobis’ Distance

    20. A Test Statistic for the Mahalanobis’ Distance

    21. Principal Components Huber (1985) cites two main reasons why principal components are interesting projections first, in the case of clustered data, the leading principal axes pick projections with good separations; secondly, the leading principal components collect the systematic structure of the data. Thus, the first principal component reflects the first major linear trend, the second principal component, the second major linear trend, etc. So, if an observation is located far away from any of the major linear trends it can be considered an outlier.

    22. Clustering and Outlier Detection Cluster Analysis can be used for outlier detection.  Outliers may emerge as singletons or as small clusters far removed from the others.  To do  outlier detection at the same time as clustering the main body of the data, use enough clusters to represent both the main body of the data and the outliers. 

    23. Fisher Iris Data 150 Cases 5 variables Sepal length Sepal width Petal length Petal width Species (3 types)

    24. Iris data

    25. Line Example

    26. Data Image of the Interpoint Distance Matrix of the Line Example

    27. Body Weight Brain Weight Data

    28. Stackloss Dataset

    29. Data Image for the Mahalanobis Distance

    30. Data Image for the Mahalanobis Distance Where the Covariance in the Mahalanobis Distance Calculation is Constructed Using Observations 4 - 21

    31. An Artificial Dataset from Rousseeuw and Leroy [1987]

    32. A Particularly Onerous Elliptical Dataset

    33. Euclidean and Mahalanobis Data Images of the Ellipse Data

    34. Pairs Plot and Data Image for 5 Dimensional Sphere Case

    35. Artificial Nose Dataset Fiber optic artificial olfactory system 19 fibers x 2 wavelengths 60 times/inhalation = 2280 Each data point resides in R2280

    36. Artificial Nose Data Image of TCE Present

    37. References Afifi, A.A., and Azen, S.P. (1972), Statistical analysis: a computer oriented approach, Academic Press, New York. Barnett, V. and T. Lewis (1994) Outliers in Statistical Data. New Your: Wiley Huber, P.J. (1985), Projection pursuit, The Annals of Statistics, 13(2), 435-475. David J. Marchette and Jeffrey L. Solka Using data images for outlier detection  Computational Statistics & Data Analysis, Volume 43, Issue 4, 28 August 2003, Pages 541-552 Joliffe, I.T. (1986) Principal Component Analysis, Springer-Verlag, New York. Robust Regression and Outlier Detection (Wiley Series in Probability and Statistics) by Peter J. Rousseeuw, Annick M. Leroy , Wiley-Interscience (September 19, 2003)

More Related