1 / 10

RRN a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 a 13 a 14 a 15

Knorr-Ng Qualitative Outlier Method: User chooses k and d (d is a qualitative parameter) (e.g., d=2 k=3). Find those points, p, such that the d-Disk at p, D d (p), contains < k other points. d=0, 2-disk-count=1. d=0, 2-disk-count=2.

mingan
Download Presentation

RRN a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 a 13 a 14 a 15

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Knorr-Ng Qualitative Outlier Method: User choosesk and d (d is a qualitative parameter)(e.g., d=2 k=3). Find those points, p, such that the d-Disk at p, Dd(p), contains < k other points. d=0, 2-disk-count=1 d=0, 2-disk-count=2 d=0, 2-disk-count=3, t1t2 not 3-2-outlier 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 Nested Loop (brute force) method:  p, find 3-Nearest Nbrs. (Hamming Distance, d(x,y)= # of mismatches on a1 thru a9) p=0=1 0 1 0 0 0 1 1 0 RRN a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 2 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 3 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 4 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 5 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 6 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 7 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 8 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 9 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 10 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 11 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 12 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 13 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 14 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 15 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1

  2. Knorr-Ng Qualitative Outlier Method: User choosesk(e.g., =3) and d (e.g., =2). Find those points, p, such that the d-Disk at p, Dd(p), contains < k other points. Nested Loop (brute force) method:  p, find 3-Nearest Nbrs. (Hamming Distance, d(x,y)= # of mismatches on a1 thru a9) p=1=1 0 1 0 0 0 1 1 0 =2=3=0 , so they not 3-2-outliers either RRN a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 2 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 3 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 4 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 5 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 6 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 7 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 8 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 9 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 10 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 11 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 12 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 13 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 14 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 15 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1

  3. Ptree: |D2(t0)| < 3 |D2(t0)|  3 ? So the 1234567 slice of t0 contains 4 nbrs! Next we AND a8. If 12345678 slice is too populous, removing a9 will always result in a too populous slice. If so, then we check if 123456789 slice is too populous. If so, done (no outlier). So there might be a killer pruning method in which we find population of the 9-dim slice, if too populus, done; else find the populations of all 8-dim slices (there are only 9 of them). If any are too populus, prune AND program accordingly. So the 1234567 slice of t0 contains 4 nbrs! Next we AND a8. If 12345678 slice is too populous, removing a9 will always result in a too populous slice. If so, then we check if 123456789 slice is too populous. If so, done (no outlier). So there might be a killer pruning method in which we find the population of the 9-dim slice, if it’s too populous, done; else find the populations of all 8-dim slices (there are only 9 of them). If any are too populus, prune AND program accordingly. Ptree method: t0 is an outlier if for t0 if rc(P)  3, where P= a3^a4^a5^a6^a7^a8^a9 v a2^a4^a5^a6^a7^a8^a9 v a2^a3^a5^a6^a7^a8^a9 v a2^a3^a4^a6^a7^a8^a9 v a2^a3^a4^a5^a7^a8^a9 v a2^a3^a4^a5^a6^a8^a9 v a2^a3^a4^a5^a6^a7^a9 v a2^a3^a4^a5^a6^a7^a8 v a1^a4^a5^a6^a7^a8^a9 v a1^a3^a5^a6^a7^a8^a9 v a1^a3^a4^a6^a7^a8^a9 v a1^a3^a4^a5^a7^a8^a9 v a1^a3^a4^a5^a6^a8^a9 v a1^a3^a4^a5^a6^a7^a9 v a1^a3^a4^a5^a6^a7^a8 a1^a2^a5^a6^a7^a8^a9 v a1^a2^a4^a6^a7^a8^a9 v a1^a2^a4^a5^a7^a8^a9 v a1^a2^a4^a5^a6^a8^a9 v a1^a2^a4^a5^a6^a7^a9 v a1^a2^a4^a5^a6^a7^a8 a1^a2^a3^a6^a7^a8^a9 v a1^a2^a3^a5^a7^a8^a9 v a1^a2^a3^a5^a6^a8^a9 v a1^a2^a3^a5^a6^a7^a9 v a1^a2^a3^a5^a6^a7^a8 a1^a2^a3^a4^a7^a8^a9 v a1^a2^a3^a4^a6^a8^a9 v a1^a2^a3^a4^a6^a7^a9 v a1^a2^a3^a4^a6^a7^a8 v a1^a2^a3^a4^a5^a8^a9 v a1^a2^a3^a4^a5^a7^a9 v a1^a2^a3^a4^a5^a7^a8 v a1^a2^a3^a4^a5^a6^a9 v a1^a2^a3^a4^a5^a6^a8 v a1^a2^a3^a4^a5^a6^a7 a1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 a2 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 a3 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 a4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 a7 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 a8 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 a9 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 a2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 a3 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 a4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 a10 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 a15 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 RRN a1 a2 a3 a4 a5 a6 a7 a8 a9 a10a11 a12 a13a14a15 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 2 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 3 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 4 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 5 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 6 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 7 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 8 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 9 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 10 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 11 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 12 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 13 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 14 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 15 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 a1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 a7 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 a8 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 a9 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 t0=1 0 1 0 0 0 1 1 0

  4. Sridhar-Rastogi Quantitative Outlier Method: User choosesk and n (n is a qualitative parameter: find the n least dense points as outliers)(e.g., n=2 k=3). d(x,y)= # of mismatches on a1 thru a9 Find the n points with the furthest k-closest-nbrs (kCN). Need to find the kth nearest nbr of each point (nested loop –ish! Sort of building the pairwise distance matrix) What we see is that outlier detection is inherently O(n2)! Is there a way to get the complexity linear using vertical methods? Is there a vertical structure on the whole table that will help? Our basic Ptree methods attempt to reduce nested loop algs to O(n*logn) but using the tree structure which has logn levels (log2n for 1-D Ptrees, log4n for 2-D Ptrees, etc.). It seem at least possible that there is a way to make one big tree, reducing the complexity to (logn)2 o o o o o o o o o o o o o o o o o o o o o o o Note that the 2 blue points would be considered the n=2 k=5 outliers, but they are quite different in terms of being “outlying” RRN a1 a2 a3 a4 a5 a6 a7 a8 a9 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 2 1 0 1 0 0 0 1 1 0 3 1 0 1 0 0 0 1 1 0 4 0 1 1 0 1 1 0 0 0 5 0 1 1 0 1 1 0 0 0 6 0 1 0 0 1 0 0 0 1 7 0 1 0 0 1 0 0 0 1 8 0 1 0 0 1 0 0 0 1 9 0 1 0 0 1 0 0 0 1 10 0 1 0 1 0 0 1 1 0 11 0 1 0 1 0 0 1 1 0 12 0 1 0 1 0 0 1 1 0 13 0 1 0 1 0 0 1 1 0 14 1 0 1 0 1 0 0 0 1 15 0 0 1 1 0 0 1 1 0

  5. Kriegel-Brunig Method: (like S-R but it depends upon the local density as to the choice of n. a3^a4^a5^a6^a7^a8^a9 v a2^a4^a5^a6^a7^a8^a9 v a2^a3^a5^a6^a7^a8^a9 v a2^a3^a4^a6^a7^a8^a9 v a2^a3^a4^a5^a7^a8^a9 v a2^a3^a4^a5^a6^a8^a9 v a2^a3^a4^a5^a6^a7^a9 v a2^a3^a4^a5^a6^a7^a8 v a1^a4^a5^a6^a7^a8^a9 v a1^a3^a5^a6^a7^a8^a9 v a1^a3^a4^a6^a7^a8^a9 v a1^a3^a4^a5^a7^a8^a9 v a1^a3^a4^a5^a6^a8^a9 v a1^a3^a4^a5^a6^a7^a9 v a1^a3^a4^a5^a6^a7^a8 a1^a2^a5^a6^a7^a8^a9 v a1^a2^a4^a6^a7^a8^a9 v a1^a2^a4^a5^a7^a8^a9 v a1^a2^a4^a5^a6^a8^a9 v a1^a2^a4^a5^a6^a7^a9 v a1^a2^a4^a5^a6^a7^a8 a1^a2^a3^a6^a7^a8^a9 v a1^a2^a3^a5^a7^a8^a9 v a1^a2^a3^a5^a6^a8^a9 v a1^a2^a3^a5^a6^a7^a9 v a1^a2^a3^a5^a6^a7^a8 a1^a2^a3^a4^a7^a8^a9 v a1^a2^a3^a4^a6^a8^a9 v a1^a2^a3^a4^a6^a7^a9 v a1^a2^a3^a4^a6^a7^a8 v a1^a2^a3^a4^a5^a8^a9 v a1^a2^a3^a4^a5^a7^a9 v a1^a2^a3^a4^a5^a7^a8 v a1^a2^a3^a4^a5^a6^a9 v a1^a2^a3^a4^a5^a6^a8 v a1^a2^a3^a4^a5^a6^a7 Which is to say: leave out a1 a2v a1 a3v a1 a4v a1 a5v a1 a6v a1 a7v a1 a8va1 a9v a2 a3v a2 a4v a2 a5v a2 a6v a2 a7v a2 a8va2 a9v a3 a4v a3 a5v a3 a6v a3 a7v a3 a8va3 a9v a4 a5v a4 a6v a4 a7v a4 a8va4 a9v a5 a6v a5 a7v a5 a8va5 a9v a6 a7v a6 a8va6 a9v a7 a8va7 a9v a8 a9v Then leave out singles: a1 v a2v a3v a4v a5v a6v a7v a8va9v RRN a1 a2 a3 a4 a5 a6 a7 a8 a9 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 2 1 0 1 0 0 0 1 1 0 3 1 0 1 0 0 0 1 1 0 4 0 1 1 0 1 1 0 0 0 5 0 1 1 0 1 1 0 0 0 6 0 1 0 0 1 0 0 0 1 7 0 1 0 0 1 0 0 0 1 8 0 1 0 0 1 0 0 0 1 9 0 1 0 0 1 0 0 0 1 10 0 1 0 1 0 0 1 1 0 11 0 1 0 1 0 0 1 1 0 12 0 1 0 1 0 0 1 1 0 13 0 1 0 1 0 0 1 1 0 14 1 0 1 0 1 0 0 0 1 15 0 0 1 1 0 0 1 1 0 Ptree method: (for Boolean data, but generalizes to any Manhattan distance) Choose a max distance, d=#mismatches. Compute the AND/OR program for it. Then compute AND/OR progs for each smaller size disk as a modification of it. This method still assumes each pt will be examined as a potential outlier in turn. Variations: (In which all outliers are found at one time) d=2: compute all combos with 2 left out, then 1 left out, in some optimal way. compute all combos with 1 left out, then 2 left out, in some optimal way.

  6. 1 2 3 4 5 6 7 8 9 5 10 8 5 7 2 9 9 5 root signal 1 0 0 1 1 1 0 0 1 cnt=0 rs w 7-8 margin 1 0 - 1 - 1 0 0 1 cnt=0 RRN a1 a2 a3 a4 a5 a6 a7 a8 a9 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 2 1 0 1 0 0 0 1 1 0 3 1 0 1 0 0 0 1 1 0 4 0 1 1 0 1 1 0 0 0 5 0 1 1 0 1 1 0 0 0 6 0 1 0 0 1 0 0 0 1 7 0 1 0 0 1 0 0 0 1 8 0 1 0 0 1 0 0 0 1 9 0 1 0 0 1 0 0 0 1 10 0 1 0 1 0 0 1 1 0 11 0 1 0 1 0 0 1 1 0 12 0 1 0 1 0 0 1 1 0 13 0 1 0 1 0 0 1 1 0 14 1 0 1 0 1 0 0 0 1 15 0 0 1 1 0 0 1 1 0 rs w 6-9 margin 1 0 - 1 - 1 - - 1 cnt=0 rs w 5-10 margin - - - - - 1 - - - cnt=2

  7. 1 2 3 4 5 6 7 8 9 4 4 6 0 4 2 4 4 2qid=0signal 0 0 0 1 0 1 0 0 1 cnt=0 3-4 signal - - 0 1 - 1 - - 1 cnt=0 RRN a1 a2 a3 a4 a5 a6 a7 a8 a9 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 2 1 0 1 0 0 0 1 1 0 3 1 0 1 0 0 0 1 1 0 4 0 1 1 0 1 1 0 0 0 5 0 1 1 0 1 1 0 0 0 6 0 1 0 0 1 0 0 0 1 7 0 1 0 0 1 0 0 0 1 8 0 1 0 0 1 0 0 0 1 9 0 1 0 0 1 0 0 0 1 10 0 1 0 1 0 0 1 1 0 11 0 1 0 1 0 0 1 1 0 12 0 1 0 1 0 0 1 1 0 13 0 1 0 1 0 0 1 1 0 14 1 0 1 0 1 0 0 0 1 15 0 0 1 1 0 0 1 1 0 2-5 signal - - 0 1 - - - - - cnt=0 1-6 signal - - - 1 - - - - - cnt=0 Should eliminate all 0’s and max’es first? signal 0 0 0 - 0 1 0 0 1 cnt=0 3-4 signal - - 0 - - 1 - - 1 cnt=0 2-5 signal - - 0 - - - - - - cnt=2

  8. 1 2 3 4 5 6 7 8 9 1 6 2 5 3 0 5 5 3qid=1signal 1 0 1 0 1 - 0 0 1 cnt=1 3-4 signal 1 0 1 0 - - 0 0 - cnt=1 RRN a1 a2 a3 a4 a5 a6 a7 a8 a9 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 2 1 0 1 0 0 0 1 1 0 3 1 0 1 0 0 0 1 1 0 4 0 1 1 0 1 1 0 0 0 5 0 1 1 0 1 1 0 0 0 6 0 1 0 0 1 0 0 0 1 7 0 1 0 0 1 0 0 0 1 8 0 1 0 0 1 0 0 0 1 9 0 1 0 0 1 0 0 0 1 10 0 1 0 1 0 0 1 1 0 11 0 1 0 1 0 0 1 1 0 12 0 1 0 1 0 0 1 1 0 13 0 1 0 1 0 0 1 1 0 14 1 0 1 0 1 0 0 0 1 15 0 0 1 1 0 0 1 1 0 2-5 signal 1 0 - - - - - - - cnt=1 1-6 signal - - - - - - - - - cnt=8

  9. Maybe we should look for dense clusters first (to prune from the outlier search). Therefore looking for a signal match. 1 2 3 4 5 6 7 8 9 5 10 8 5 7 2 9 9 5 root signal 0 1 1 0 0 0 1 1 0 cnt=0 rs w 7-8 margin 0 1 - 0 - 0 1 1 0 cnt=0 RRN a1 a2 a3 a4 a5 a6 a7 a8 a9 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 2 1 0 1 0 0 0 1 1 0 3 1 0 1 0 0 0 1 1 0 4 0 1 1 0 1 1 0 0 0 5 0 1 1 0 1 1 0 0 0 6 0 1 0 0 1 0 0 0 1 7 0 1 0 0 1 0 0 0 1 8 0 1 0 0 1 0 0 0 1 9 0 1 0 0 1 0 0 0 1 10 0 1 0 1 0 0 1 1 0 11 0 1 0 1 0 0 1 1 0 12 0 1 0 1 0 0 1 1 0 13 0 1 0 1 0 0 1 1 0 14 1 0 1 0 1 0 0 0 1 15 0 0 1 1 0 0 1 1 0 rs w 6-9 margin 0 1 - 0 - 0 - - 0 cnt=0 rs w 5-10 margin - - - - - 0 - - - cnt=14

  10. So this signal is in a dense area. 1 2 3 4 5 6 7 8 9 4 4 6 0 4 2 4 4 2qid=0signal 1 1 1 0 1 0 1 1 0 cnt=0 3-4 signal - - 1 0 - 0 - - 0 cnt=4 RRN a1 a2 a3 a4 a5 a6 a7 a8 a9 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 2 1 0 1 0 0 0 1 1 0 3 1 0 1 0 0 0 1 1 0 4 0 1 1 0 1 1 0 0 0 5 0 1 1 0 1 1 0 0 0 6 0 1 0 0 1 0 0 0 1 7 0 1 0 0 1 0 0 0 1 8 0 1 0 0 1 0 0 0 1 9 0 1 0 0 1 0 0 0 1 10 0 1 0 1 0 0 1 1 0 11 0 1 0 1 0 0 1 1 0 12 0 1 0 1 0 0 1 1 0 13 0 1 0 1 0 0 1 1 0 14 1 0 1 0 1 0 0 0 1 15 0 0 1 1 0 0 1 1 0 2-5 signal - - 1 0 - - - - - cnt=6 1-6 signal - - - 0 - - - - - cnt=8

More Related