Chapter 1

104 Views

Download Presentation
## Chapter 1

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Chapter 1**Introduction to Clustering**Section 1.1**Introduction**Objectives**• Introduce clustering and unsupervised learning. • Explain the various forms of cluster analysis. • Outline several key distance metrics used as estimates of experimental unit similarity.**Definition**• “Cluster analysis is a set of methods for constructing a (hopefully) sensible and informative classification of an initially unclassified set of data, using the variable values observed on each individual.” • B. S. Everitt (1998), “The Cambridge Dictionary of Statistics”**Unsupervised Learning**Learning without a priori knowledge about the classification of samples; learning without a teacher. Kohonen (1995), “Self-Organizing Maps”**Section 1.2**Types of Clustering**Objectives**• Distinguish between the two major classes of clustering methods: • hierarchical clustering • optimization (partitive) clustering.**Hierarchical Clustering**Iteration Agglomerative Divisive 1 2 3 4**(error)**(error) (error) Propagation of Errors Iteration 1 2 3 4**Old location**X X X X X X X X X New location X X X “Seeds” Observations Initial State Final State Optimization (Partitive) Clustering**Heuristic Search**• Find an initial partition of the n objects into g groups. • Calculate the change in the error function produced by moving each observation from its own cluster to another group. • Make the change resulting in the greatest improvement in the error function. • Repeat steps 2 and 3 until no move results in improvement.**Section 1.3**Similarity Metrics**Objectives**• Define similarity and what comprises a good measure of similarity. • Describe a variety of similarity metrics.**What Is Similarity?**• Although the concept of similarity is fundamental to our thinking, it is also often difficult to precisely quantify. • Which is more similar to a duck: a crow or a penguin? • The metric that you choose to operationalize similarity (for example, Euclidean distance or Pearson correlation) often impacts the clusters you recover.**What Makes a Good Similarity Metric?**• The following principles have been identified as a foundation of any good similarity metric: • symmetry: d(x,y) = d(y,x) • non-identical distinguishability: if d(x,y) 0 then x y • identical non-distinguishability: if d(x,y) = 0 then x = y • Some popular similarity metrics (for example, correlation) fail to meet one or more of these criteria.**Euclidean Distance Similarity Metric**• Pythagorean Theorem: The square of the hypotenuse is equal to the sum of the squares of the other two sides. (x1, x2) x2 (0, 0) x1**(x1,x2)**City Block Distance Similarity Metric • City block (Manhattan) distance is the distance between two points measured along axes at right angles. (w1,w2)**.**. . . . . . . . Marie . . Marie . . . . • No Similarity . . . . . . . . . . . . . . . . . . . . Tom . Jerry . . . • Dissimilar Jerry Correlation Similarity Metrics • Similar Tom**The Problem with Correlation**• VariableObservation 1Observation 2 • x1 5 51 • x2 4 42 • x3 3 33 • x4 2 24 • x5 1 15 • Mean 3 33 • Std. Dev. 1.5811 14.2302 • The correlation between observations 1 and 2 is a perfect 1.0, but are the observations really similar?**Density Estimate Based Similarity Metrics**• Clusters can be seen as areas of increased observation density. Similarity is a function of the distance between the identified density bubbles (hyper-spheres). similarity Density Estimate 2 (Cluster 2) Density Estimate 1 (Cluster 1)**Hamming Distance Similarity Metric**• 1 2 3 4 5 … 17 • Gene A0 1 10010 0 10 01 1 1 001 • Gene B 0 1 11000 0 11 11 1 1 011 • DH = 0 0 0 1 0 1 0 0 0 11 0 0 0 0 1 0 = 5 • Gene expression levels under 17 conditions • (low=0, high=1)**The DISTANCE Procedure**• General form of the DISTANCE procedure: • Both the PROC DISTANCE statement and the VAR statement are required. PROCDISTANCEMETHOD=method <options> ; COPY variables; VAR level (variables < / option-list >) ; RUN;**Generating Distancesch1s3d1**• This demonstration illustrates the impact on cluster formation of two distance metrics generated by the DISTANCE procedure.**Section 1.4**Classification Performance**Objectives**• Use classification matrices to determine the quality of a proposed cluster solution. • Use the chi-square and Cramer’s V statistic to assess the relative strength of the derived association.**No Solution**Typical Solution Quality of the Cluster Solution Perfect Solution**Probability**Frequency Probability of Cluster Assignment The probability that a cluster number represents a given class is given by the cluster’s proportion of the row total.**The Chi-Square Statistic**• The chi-square statistic (and associated probability) • determine whether an association exists • depend on sample size • do not measure the strength of the association.**S**T R O N G WEAK 0 1 C R A M E R ' S V S T A T I S T I C Measuring Strength of an Association Cramer’s V ranges from -1 to 1 for 2X2 tables.