1 / 20

Measuring Distance

Measuring Distance. Input for Multidimensional Scaling and Clustering. Distances and Similarities. Both are ways of measuring how similar two objects are Distances increase as objects are less similar. The distance of an object to itself is 0

leia
Download Presentation

Measuring Distance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring Distance Input for Multidimensional Scaling and Clustering

  2. Distances and Similarities • Both are ways of measuring how similar two objects are • Distances increase as objects are less similar. The distance of an object to itself is 0 • Similarities increase as objects are more similar. The similarity of an object to itself is the maximum value for the similarity measure

  3. Distance Examples • Mileage between two towns measured in straight line (Euclidian) distance (“as the crow flies”), as driving distance, or as great circle (spherical) distance • Instead of geographic locations we can treat measurements such as length, width, and thickness of an artifact as defining its position

  4. Similarity Examples • The number of characteristics two objects have in common (cultural traits, genes, presence/absence traits) • Similarity measures can be converted to distances by subtracting each similarity from the maximum possible similarity

  5. Interval/Ratio Measures • Manhattan Distance (or City Block, 1-norm) • Euclidian Distance (and Squared Euclidian Distance, 2-norm) • Minkowski Distance (p-norm) • Chebyshev Distance (Maximum Distance, infinite norm)

  6. Definitions

  7. Counts • Ecologists use counts of species between plots to analyze compositional changes in community structure • Bray-Curtis compares the number of specimens and number of overlapping species

  8. DefinitionsBray Curtis Dissimilarity Note: If samples j and k are percentages, then the denominator becomes 200.

  9. Ordinal Measures • Few measures specifically for rank data, but rank correlation coefficients (spearman, Kendall) can be used

  10. Dichotomies • Can use interval/ratio measures • Numerous options based on 2x2 table • Many similarity measures based on weighting of presence/presence and absence/absence • Subtract from 1 to create distances

  11. Definitions Simple Matching Coefficient: (a+d)/(a+b+c+d) Jacard’s Coefficient (asymmetric binary): a/(a+b+c) Phi and Yule’s Q measures of association ade4 and proxy have many different options for dichotomies

  12. Nominal Variables • Similarity can be measured with chi-square based measures • Convert to multiple dichotomies • E.g. Temper: Sand, Silt, Gravel becomes three variables: TSand, TSilt, Tgravel • Then use measures for dichotomies/ metric variables

  13. Multiple Types • Gower’s Index is the only one that computes a similarity index using variables with different levels of measurement. Take the mean of the variables: • Presence/Absence – Jaccard • Categorical – 1 if the same, 0 if not • Interval/Ratio/Ranks – absolute difference divided by range

  14. Issues • Weighting – how to weight variables with different variances – standardization, weighting • Correlations between variables – how (and whether) to take correlations into account (Mahalanobis Distances)

  15. Distance Matrix • For simple analyses, dist() in base R provides euclidean, maximum, manhattan, canberra, binary (Jaccard), and minkowski • Other packages including different measures: Many others. See packages ade4, amap, cluster, ecodist, labdsv, proxy, and vegan

  16. # Load Darl # Rcmdr to create scatterplot matrix > Euclid <- dist(Darl[,2:5]) > Euclid 35-3043 35-2871 35-2866 36-3619 36-3520 35-2871 11.437657 35-2866 5.380520 6.542935 36-3619 14.621217 3.682391 9.570266 36-3520 15.309148 4.068169 10.163661 1.757840 36-3036 7.760155 4.442972 2.495997 7.195832 7.860662 > scatterplot(Width~Length, reg.line=lm, smooth=FALSE, spread=FALSE, pch=16, id.n=6, boxplots=FALSE, ellipse=TRUE, grid=FALSE, data=Darl) > mahalanobis(Darl[,2:3], mean(Darl[,2:3]), cov=cov(Darl[,2:3])) 35-3043 35-2871 35-2866 36-3619 36-3520 36-3036 2.2577596 1.8173684 0.4641912 2.9652763 1.7527347 0.7426699

  17. > install.packages("ecodist") > library(ecodist) > Mahal <- distance(Darl[,2:3], method="mahalanobis") > Mahal 35-3043 35-2871 35-2866 36-3619 36-3520 35-2871 4.9367446 35-2866 0.6900956 2.8905096 36-3619 8.5903617 7.5849187 4.7250487 36-3520 6.8826044 0.6084649 3.6631704 4.9720621 36-3036 2.4467510 4.8835727 0.8163226 1.9192663 4.3901066

  18. # Rcmdr > .PC <- princomp(~Length+Weight, cor=TRUE, data=Darl) > Darl$PC1 <- .PC$scores[,1] > Darl$PC2 <- .PC$scores[,2] # Typed commands > PCDist <- dist(Darl[,6:7]) > PCDist 35-3043 35-2871 35-2866 36-3619 36-3520 35-2871 2.5498737 35-2866 2.1968323 1.1918768 36-3619 3.7858013 1.2539806 1.9883494 36-3520 4.2220041 1.8034110 2.1957351 0.7029308 36-3036 2.6677120 0.9201698 0.5717135 1.4339465 1.6290415 > scatterplot(PC2~PC1, reg.line=FALSE, smooth=FALSE, spread=FALSE, grid=FALSE, boxplots=FALSE, pch=16, ellipse=TRUE, id.n=6, span=0.5, data=Darl) [1] "35-3043" "35-2866" "36-3619" "36-3520" "35-2871" "36-3036"

More Related