310 likes | 352 Views
Learn how to deal with outliers in clustering using mean-shift approach and medoid-shift noise removal. Experiment with noisy data and explore outlier detection methods for improved clustering results.
E N D
Mean-shift outlier detection Jiawei Yang Susanto Rahardja Pasi Fränti 18.11.2018
How to deal with outliers in clustering Approch 1: Cluster Approch 2: Outlier removal Cluster Approch 3: Outlier removal Cluster Mean-shift approch: Meanshift Cluster
K-nearest neighbors k=4 K-neighborhood Point
Expected result Mean of neighbors Point
Result with noise point Mean of neighbors Noise point K-neighborhood
Part I: Noise removal Fränti and Yang, "Medoid-shift noise removal to improve clustering", Int. Conf. Artificial Intelligence and Soft Computing (CAISC), June 2018.
Mean-shift process Move points to the means of their neighbors
Mean or Medoid? Mean-shift Medoid-shift
Medoid-shift algorithm Fränti and Yang, "Medoid-shift noise removal to improve clustering", Int. Conf. Artificial Intelligence and Soft Computing (CAISC), June 2018. • REPEAT 3 TIMES • Calculate kNN(x) • Calculate medoid M of the neighbors • Replace point x by the medoid M
Effect on clustering result Noisy original Iteration 1 CI=4 CI=4 Iteration 2 Iteration 3 CI=3 CI=0
Datasets P. Fränti and S. Sieranoja, "K-means properties on six clustering benchmark datasets", Applied Intelligence, 2018. k=35 k=20 k=50 N=5000 N=5000 k=15 k=15 N=5000 N=5000 k=15 k=15
Noise types Random noise Random noise S1 S3 S3 S1 Data-dependent noise Data-dependent noise
Results for noise type 1 Centroid Index (CI) CI reduces from 5.1 to 1.4 KM: K-means with random initialization RS: P. Fränti, "Efficiency of random swap clustering", Journal of Big Data, 5:13, 1-29, 2018.
Results for noise type 2 Centroid Index (CI) CI reduces from 5.0 to 1.1 KM: K-means with random initialization RS: P. Fränti, "Efficiency of random swap clustering", Journal of Big Data, 5:13, 1-29, 2018.
Part II: Outlier detection Yang, Rahardja, Fränti, "Mean-shift outlier detection", Int. Conf. Fuzzy Systems and Data Mining (FSDM), November 2018.
Outlier scores D=|x-y| 203 12 k=4
Mean-shift for Outlier Detection • Algorithm: MOD • Calculate Mean-shift y for every point x • Outlier score Di = |yi - xi| • SD = Standard deviation of all Di • IF Di > SD THEN xi is OUTLIER
Result after three iterations x D=|x-y3| y3 y1 y2
Outlier scores D=|x-y| 69 203 65 36 13 64 26 44 28 14 40 30 120 187 12 4 46 27 16 58 118 83 95 SD=52
Distribution of the scores Outliers Normal points
Results for Noise type 1 A priori noise level 7% Automatically estimated (SD)
Results for Noise type 2 A priori noise level 7% Automatically estimated (SD)
Conclusions • Why to use: • Simple but effective! • How it performs: • Outperforms existing methods! • Usefulness: • No threshold parameter needed! 98-99%