Distance and Similarity Measures in Data Analytics

Distance and Similarity measuresWhat does “close” Mean?Separation Boundaries

Distance Metric Measures the dissimilarity between two data points. A metric is a fctn, d, of 2 points X and Y, such that d(X, Y)is positive definite: if (X  Y), d(X, Y) > 0 if (X = Y), d(X, Y) = 0 d(X, Y) issymmetric: d(X, Y) = d(Y, X) d(X, Y) satisfies triangle inequality:d(X, Y) + d(Y, Z)  d(X, Z)

Standard Distance Metrics Minkowski distance or Lp distance, Manhattan distance, (P = 1) Euclidian distance, (P = 2) Max distance, (P = )

An Example Y (6,4) Z X (2,1) A two-dimensional space: Manhattan, d1(X,Y)= XZ+ ZY =4+3 = 7 Euclidian, d2(X,Y)= XY = 5 Max, d(X,Y)= Max(XZ, ZY) = XZ = 4 d1d2 d For any positive integer p,

HOBbit Similarity These notes contain NDSU confidential & Proprietary material. Patents pending on bSQ, Ptree technology Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x1: 0 1 10 1 0 0 1 x2: 0 1 0 11 1 0 1 y1: 0 1 11 1 1 0 1 y2: 0 1 0 1 0 0 0 0 HOBbitS(x1, y1) = 3 HOBbitS(x2, y2) = 4 Higher Order Bit (HOBbit) similarity: HOBbitS(A, B) = A, B: two scalars (integer) ai, bi :ith bit of A and B (left to right) m : number of bits

HOBbit Distance (High Order Bifurcation bit) Example: Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x1: 0 1 10 1 0 0 1 x2: 0 1 0 11 1 0 1 y1: 0 1 11 1 1 0 1 y2: 0 1 0 1 0 0 0 0 HOBbitS(x1, y1) = 3 HOBbitS(x2, y2) = 4 dv(x1, y1) = 8 – 3 = 5 dv(x2, y2) = 8 – 4 = 4 HOBbit distance between two scalar value A and B:dv(A, B)= m – HOBbit(A, B) HOBbit distance for X and Y: In our example (considering 2-dim data): dh(X, Y) = max (5, 4) = 5

HOBbit Distance Is a Metric HOBbit distance is positive definite if (X = Y), = 0 if (XY), > 0 HOBbit distance is symmetric HOBbit distance holds triangle inequality

Neighborhood of a Point 2r 2r 2r 2r X X X X T T T T Neighborhood of a target point, T, is a set of points, S, such thatXSif and only if d(T, X) r Manhattan Euclidian Max HOBbit If Xis a point on the boundary, d(T, X) = r

Decision Boundary Manhattan Euclidian Max Max Euclidian Manhattan  > 45  < 45 X A A A A A R1 B B B B B d(A,X) d(B,X) R2 D decision boundary between points A and B, is the locus of the point X satisfying d(A, X) = d(B, X) Decision boundary for HOBbit Distance is perpendicular to axis that makes max distance Decision boundaries for Manhattan, Euclidean and max distance

Minkowski Metrics ? Lp-metrics (aka: Minkowski metrics) dp(X,Y) = (i=1 to n wi|xi - yi|p)1/p (weights, wi assumed =1)Unit DisksBoundary p=1 (Manhattan) p=2 (Euclidean) p=3,4,… . . P= (chessboard) P=½,⅓, ¼, … dmax≡ max|xi - yi|  d≡ limp  dp(X,Y). Proof (sort of) limp  { i=1 to n aip }1/p ‎ max(ai) ≡b. For p large enough, other aip << bp since y=xp increasingly concave, so i=1 to n aip  k*bp(k=duplicity of b in the sum), so {i=1 to n aip }1/p  k1/p*b and k1/p1

P>1Lpmetrics q x1 y1 x2 y2 Lq distance x to y 2 .5 0 .5 0 .7071067812 4 .5 0 .5 0 .5946035575 9 .5 0 .5 0 .5400298694 100 .5 0 .5 0 .503477775 MAX .5 0 .5 0 .5 x y q x1 y1 x2 y2 Lq distance x to y 2 .71 0 .71 0 1.0 3 .71 0 .71 0 .8908987181 7 .71 0 .71 0 .7807091822 100 .71 0 .71 0 .7120250978 MAX .71 0 .71 0 .7071067812 x y q x1 y1 x2 y2 Lq distance x to y 2 .99 0 .99 0 1.4000714267 8 .99 0 .99 0 1.0796026553 100 .99 0 .99 0 .9968859946 1000 .99 0 .99 0 .9906864536 MAX .99 0 .99 0 .99 x y x q x1 y1 x2 y2 Lq distance x to y 2 1 0 1 0 1.4142135624 9 1 0 1 0 1.0800597389 100 1 0 1 0 1.0069555501 1000 1 0 1 0 1.0006933875 MAX 1 0 1 0 1 y q x1 y1 x2 y2 Lq distance x to y 2 .9 0 .1 0 .9055385138 9 .9 0 .1 0 .9000000003 100 .9 0 .1 0 .9 1000 .9 0 .1 0 .9 MAX .9 0 .1 0 .9 y x x q x1 y1 x2 y2 Lq distance x to y 2 3 0 3 0 4.2426406871 3 3 0 3 0 3.7797631497 8 3 0 3 0 3.271523198 100 3 0 3 0 3.0208666502 MAX 3 0 3 0 3 y x q x1 y1 x2 y2 Lq distance x to y 6 90 0 45 0 90.232863532 9 90 0 45 0 90.019514317 100 90 0 45 0 90 MAX 90 0 45 0 90 y

x P<1Lpmetrics q x1 y1 x2 y2 Lq distance x to y 1 .1 0 .1 0 .2 .8 .1 0 .1 0 .238 .4 .1 0 .1 0 .566 .2 .1 0 .1 0 3.2 .1 .1 0 .1 0 102 .04 .1 0 .1 0 3355443 .02 .1 0 .1 0 112589990684263 .01 .1 0 .1 0 1.2676 E+29 2 .1 0 .1 0 .141421356 x y y q x1 y1 x2 y2 Lq distance x to y 1 .5 0 .5 0 1 .8 .5 0 .5 0 1.19 .4 .5 0 .5 0 2.83 .2 .5 0 .5 0 16 .1 .5 0 .5 0 512 .04 .5 0 .5 0 16777216 .02 .5 0 .5 0 5.63 E+14 .01 .5 0 .5 0 6.34 E+29 2 .5 0 .5 0 .7071 q x1 y1 x2 y2 Lq distance x to y 1 .9 0 0.1 0 1 .8 .9 0 0.1 0 1.098 .4 .9 0 0.1 0 2.1445 .2 .9 0 0.1 0 10.82 .1 .9 0 0.1 0 326.27 .04 .9 0 0.1 0 10312196.962 .02 .9 0 0.1 0 341871052443154 .01 .9 0 0.1 0 3.8 E+29 2 .9 0 0.1 0 .906 y x d1/p(X,Y) = (i=1 to n |xi - yi|1/p)p P<1 For p=0 (lim as p0), Lp doesn’t exist (Does not converge.)

Other Interesting Metrics Canberra metric: dc(X,Y) = (i=1 to n |xi – yi| / (xi + yi) normalized manhattan distance Square Cord metric: dsc(X,Y) = i=1 to n( xi – yi )2 Already discussed as Lp with p=1/2 Squared Chi-squared metric: dchi(X,Y) = i=1 to n (xi – yi)2/ (xi + yi) Scalar Product metric: dchi(X,Y) = X • Y = i=1 to n xi * yi Hyperbolic metrics: (which map infinite space 1-1 onto a sphere) Which are rotationally invariant? Translation invariant? Other? Some notes on distance functions can be found at http://www.cs.ndsu.NoDak.edu/~datasurg/distance_similarity.pdf

LDS - Local DisSimilarity Measure Measures the dissimilarity don’t have to be full distance metrics: A metric is a fctn, d, of 2 points X and Y, such that d(X, Y)is positive definite: if (X  Y), d(X, Y) > 0 if (X = Y), d(X, Y) = 0 d(X, Y) issymmetric: d(X, Y) = d(Y, X) d(X, Y) satisfies triangle inequality:d(X, Y) + d(Y, Z)  d(X, Z) They can be dissimilarity measures satisfying only definiteness and symmetry. Or they can be only what we will call “local dissimilarities or LDS” satisfying positive definiteness. The fact that an LDS is not symmetric means that y can be the nearest nbr of x even though x is not the nearest nbr of y. This is the case, for instance in Kriging (in which dissimilarity can vary with angle as well as center).

In this Bioinformatics Data Warehouse (BDW) model, we can define a LDS usefully as follows (actually used in the DataSURG winning submission to the ACM KDD-Cup competition in 2002: http://www.cs.ndsu.nodak.edu/~datasurg/kddcup02 ) In this competition we were given a training set of data on the yeast genes (the yeast slice of the GeneOrgDimTbl) with ~3000 genes and ~1700 atttibutes (1700 basic Ptrees). We were to classify unclassified gene samples as Y/N (what the Yes or No meant is not important here). The approach was to find all nearest training neighbors of an unclassified sample, g, and let them vote. The definition of nearest came from an LDS:  unclassified gene, g, define fg:TNonNegReals fg (x) = gi=1wixiwi are weights. This assumes that a training point should not be considered close unless it agrees with g where g is 1 (where g is 0 it doesn’t matter). This definition solves the curse of dimensionality (for sparse data anyway). It also worked! Possible improvements include figuring some of the g=o attributes into the LDS definition (e.g., lethality, stop codon type, duplicity, …) by simply including Not-lethal, for instance, as an attribute along with lethal, etc. Also one can step out from the sample more finely too. One could use horizontal ANDs of Ptrees for the initial nearest neighbor set, but then, if that set is small, use vertical scans from there onward (one can do vertical scans with vertical data! How?) t h w a y GeneOrgDimTbl __g0___g1___g2___g3 a o3/met /reg /met /sig /\p n c t i o n /____/____/____/____/ \ u o2/met /reg /met /reg /\ /\f m l e x /____/____/____/____/ \/bio o o1/met /met /met /sig /\ /\ /\c c a l i z a t i o n /____/____/____/____/ \/syn/hy\ o o0/reg /reg /met /sig /\ /\ /\ /\l o t e i n c l a s s /____/____/____/____/ \/bio/oxy/pl\ r \ \ \ \ \ /\ /\ /\ /\p \____\____\____\____\/syn/ \ \ \ \ \ / GeneDimTbl \____\____\ __g0___g1___g2___g3_ \ \ /1483/1672/1209/2134/GO-id \ /____/____/____/____/ /mip1/mip2/mip5/mip7/MIPS-link /____/____/____/____/ /EMB1/EMB4/EMB9/EMB2/EMBL-link /____/____/____/____/ /ML-2/ML-4/ML-8/ML-3/Medline-link1 Name__Speces__Vertibrate/____/____/____/____/ OrgDimTbl / /homo / / / / / /| o3/human/sapian / 1 / 5 / 8 / 9 / 6 / | /_____/_______/_________/____/____/____/____/ | / /drosoph/ / / / / /| | o2/fly /pseudod/ 0 / 10 / 1 / 3 / 7 / | /| /_____/_______/_________/____/____/____/____/ |/ | / /limulus/ / / / / /| | | o1/crab / polyli/ 0 / 8 / 4 / 4 / 9 / | /|9/| /_____/_______/_________/____/____/____/____/ |/ |/ | / /muscili/ / / / / /| | | | o0/mouse/muscilu/ 1 / / / / / | /|2/|8/| /_____/_______/_________/__g0/__g1/__g2/__g3/ |/ |/ |/ | N/| | | | | | | | / / | 0 | 2 | 1 | 0 e0 /|8/|0/|7/ MN/|1|____|____|____|____|/ |/ |/ |/ / | | | | | | | | / ExperimentDimTbl HY/|h|/| 8 | 5 | 1 | 1 e1 /|2/|8/ ================ / | |0|____|____|____|____|/ |/ |/ SA/|c|/| | | | | | | / / | |h|/| 0 | 0 | 0 | 0 e2 /|8/ AD/|a|/| |1|____|____|____|____|/ |/ / | |s|/| | | | | | / ED/|2|/| |a|/| 2 | 3 | 8 | 8 e3 / / | |b|/| |1|____|____|____|____|/ /|3|/| |c|/| / 17 / 12 / 9 / 1 / STZ/ | |2|/| |a|/__1_/__0_/__0_/_0__/o0 GeneOrgDimTbl / A|/| |a|/| / 2 / 7 / 12 / 17 / ============= CITY/|M |2|/| |s|/__1_/__1_/__1_/__0_/o1 / | /| |4|/| / 11 / 10 / 12 / 9 / (Chromosome#, STR/|C|/ |/| |a|/__0_/__0_/__1_/__1_/o2 p-arm/q-arm) / | | D|2|/| / 9 / 8 / 6 / 14 / UNIV/|1|/|N/| |4|/__0_/__0_/__1_/__0_/o3 / | | |/ |/| /| | | | | PI/|H|/|F| D|2|/ |_0__|_1__|_1__|_0__|g3 / | | |/|N/| / | | | | L/|A|/|3| |/A|/ |_1__|_1__|_0__|g2 /___________________ | | | | | | | | |_0__|_1__|g1 | 1 | 0 | 1 | 0 |e0 | | |____|____|____|____| |_0__|g0 | | | | | : | 0 | 0 | 1 | 1 |e1 Interaction table |____|____|____|____| (unipartite symmetric fact cube) | | | | | (e.g., ProteinProteinInteractionGraph | 1 | 0 | 1 | 1 |e2 or GeneAttributeSimilarityGraph) |____|____|____|____| | | | | | | 1 | 0 | 1 | 0 |e3 ExpGeneDimTbl (0=only coding sequence) |____|____|____|____| ============= :g0 g1 g2 g3

Correlation in Business Intelligence Business Intelligence (BI) problems are very interesting data mining problems. Typical BI training data sets are large and extremely unbalanced among classes. The Business Intelligence problem is a classification problem. Given a history of customer events (purchase or rentals) and satisfaction levels (ratings), one classifies potential future customer events as to most likely satisfaction level. This information is then used to suggest to customers what item they might like to purchase or rent next. Clearly this is the central need for any Internet based company or any company seeking to do targeted advertising (all companies?). The classification training set or Business Intelligence Data Set, BIDS, usually consists of the past history of customer satisfaction rating events (the rating of a product purchased or rented. In the Netflix case, it is a 1-5 rating of a rented movie). Thus, at it’s simplest, the training set looks something like BIDS(customerId, productID, rating), where the features are customerID and productID and the class label is rating. Very often there are other features such as DataOfEvent, etc. Of course there are many other single entity features that might be useful to the classification process, such as Customer Features (name, address, ethnicity, age,…) and Product features (type, DataOfCreation, Creator, color, weight, …).

Correlation in Business Intelligence Together these form a Data Warehouse in the Star Model in which the central fact is BIDS and the Dimension star points are the Customer and Product feature files. We will keep it simple and work only with BIDS. In the more complex case, the Dimension files are usually joined into the central fact file to form the Training Set. It is assumed that the data is in vertical format. Horizontal format is the standard in the industry and is ubiquitous. However, it is the authors contention (proved valid by the success of the method) that vertical formatting is superior for many data mining applications, whereas for standard data processing, horizontal formatting is still best. Horizontal Data formatting simply means that the data on an entity type is collected into horizontal records (of fields), one for each entity instance (e.g., employee records, one for each employee). In order to process horizontal data, scans down the collection (file) of these horizontal records are required. Of course, these horizontal files can be indexed, providing efficient, alternate entry points, but still, scans are usually necessary (full inversion of the file will eliminate the need for scans but that is extremely expensive to do and maintain, and impossible for large, volatile horizontal files.).

Correlation in Business Intelligence Vertical Data formatting simply means that the data on an entity is collected into vertical slices by feature or by bit position of feature (or by some other vertical slicing of a coding scheme applied to a feature or features of the entity; such as value slicing in which each individual value in a feature column is bitmapped). In this BIDS classifier, variant forms of Nearest Neighbor Vote based classification were combined with a vertical data structure (Predicate Tree or P-tree ) for efficient processing. Even for small data sets, this would result in a large number of horizontal training database scans to arrive at an acceptable solution. The use of a vertical data structure (P-tree), provides acceptable computation times (as opposed to scanning horizontal record data sets).

Correlation in Business Intelligence: BACKGROUNDSimilarity and Relevance Analysis Classification models that consider all attributes equally, such as the classical K-Nearest-Neighbor classification (KNN), work well when all attributes are similar in their relevance to the classification task[14]. This is, however, often not the case. The problem is particularly pronounced for situations with a large number of attributes in which some are known to be irrelevant. Many solutions have been proposed that weight dimensions according to their relevance to the classification problem. The weighting can be derived as part of the algorithm [5]. In an alternative strategy the attribute dimensions are scaled, using a evolutionary algorithm to optimize the classification accuracy of a separate algorithm, such as KNN [21]. In an effort to reduce the number of classification voters and to reduce the dimension of the vector space over which these class votes are calculated, judicious customer (also called “user”) and item (also called “movie”) selection proved extremely important. That is to say, if one lets all neighboring users vote over all neighboring movies, the resulting predicted class (rating) turns out be grossly in error. It is also important to note that the vertical representation facilitates efficient attribute relevance in situations with a large number of attributes. Attribute relevance for a particular attribute with respect to the class attribute can be achieved by only reading those two data columns compared to reading all the long rows of data in a horizontal approach.

Correlation in Business Intelligence: A Correlation Based Nearest Neighbor Pruning • This Correlation for each pair of data points in a data set can be computed rapidly using vertical P-trees[2]. For example, for each given pair of user, (u,v), the this Correlation (also called 1 Perpendicular Correlation), or PC(u,v), is • [ SQRT{(v-u-vbar+ubar)o(v-u-vbar+ubar) /(a + nb)} ]2 • where o is the inner product of the ratings of users, u and v, across n selected movies and a and b are tunable parameters. This approach avoids the requirement of computing all Euclidian distances between the sample and all the other points in the training data, in arriving at the Nearest Neighbor set. But more importantly, it produces a select set of neighbors which, when allowed to vote for the most likely rating, produces very low error. Later in this paper, an explanation will be given of the motivation for this correlation measure and a pictorial description of it will also be provided.

Correlation in Business Intelligence: REFERENCES 1. Abidin T., Perrizo W., SMART-TV: A Fast and Scalable Nearest Neighbor Based Classifier for Data Mining. SAC , 2006. 2. Abidin, T., Perera, A., Serazi,M., Perrizo,W., Vertical Set Square Distance, CATA-2005 3. Q. Ding, M. Khan, A. Roy, W. Perrizo, P-tree Algebra, Proceedings of the ACM Sym. on App. Comp., pp. 426-431, 2002. 4. Bandyopadhyay, S. Muthy, C.A., Pattern Classification Using Genetic Algs. Pattern Recognition Letters, V16, (1995). 5. Cost, S. Salzberg, S., A weighted nearest neighbor algorithm for learning with symbolicfeatures, Machine Learning, 1993 6. DataSURG, P-tree Application Programming Interface Documentation, http://midas.cs.ndsu.nodak.edu/~datasurg/ptree/ 7. Ding, Q., Ding, Q., Perrizo, W., “ARM on RSI Using Ptrees,” Pacific-Asia KDD Conf., pp. 66-79, Taipei, May2002. 8. Duch W, Grudzi ´N.K., and Diercksen G., Neural Minimal Distance Methods., World Congress of Comp. int. IJCNN’98 9. Goldberg, D.E., Genetic Algorithms in Search Optimization, and Machine Learning, Addison Wesley, 1989. 10. Guerra-Salcedo C., Whitley D., Feature Selection mechanisms for ensemble creation, AAAI Workshop, 1999. 11. Jain, A..; Zongker, D. Feature Selection: Evaluation, Application, and Small Sample Perf. IEEE TPAMI, V9:2, 1997 12. Khan M., Ding Q., Perrizo W., k-NN Classification on Spatial Data Streams Using P-trees, Springer LNAI 2002 13. Khan,M., Ding, Q., Perrizo,W., K-Nearest Neighbor Classification, PAKDD, pp. 517-528, 2002. 14. Krishnaiah, P.R., Kanal L.N., Handbook of statistics 2. North Holland, 1982. 15. Kuncheva, L.I., and Jain, L.C.: Designing Classifier Fusion Systems by Genetic Algs. IEEE Trans Ev. Comp. V33 2000 16. Lane, T., ACM Knowledge Discovery and Data Mining Cup 2006, http://www.kdd2006.com/kddcup.html 17. Martin-Bautista M.J., and Vila M.A.: A survey of genetic feature selection in mining issues. Congress on Ev. Comp. 1999 18. Perera, Perrizo, W., et al, Vertical Set Square Distance Based Clustering, Intelligent and Adaptive Systems and SE 2004. 19. Perera, A., Perrizo W., et al, P-tree Classification of Yeast Gene Deletion Data, SIGKDD Explorations, V4:2 2002. 20. Perera A., Perrizo W., Vertical K-Median Clustering, Conference on Computers and Their Applications 2006. 21. Punch, W.F., et al, Further research on feature selection and classification using genetic algorithms, Conf. on Gas, 1993. 22. Rahal, I. Perrizo, W., An Optimized for KNN Text Categorization using P-Trees. Symp. on Applied Computing, 2004. 23. Raymer, M.L. et al, Dim Reduction Using Genetic Algorithms. IEEE Trans Evolutionary Comp, Vol. 4, (2000) 164-171 24. Serazi, M. Perera, A., W., Perrizo,W., et al, DataMIME, ACM SIGMOD, Paris, France, June 2004. 25. Vafaie, H. De Jong, K.: Robust feature Selection algorithms, Tools with AI, 1993. 26. R. A. Fisher, Multiple Measurements in Axonomic Problems. Annals of Eugenics 7, pp. 179-188, 1936

Distance and Similarity Measures in Data Analytics