clustering n.
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering PowerPoint Presentation
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 20

Clustering - PowerPoint PPT Presentation

  • Uploaded on

Clustering. Instructor: Max Welling ICS 178 Machine Learning & Data Mining. Unsupervised Learning. In supervised learning we were given attributes & targets (e.g. class labels). In unsupervised learning we are only given attributes. Our task is to discover structure in the data.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Clustering' - guy-clements

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


Instructor: Max Welling

ICS 178 Machine Learning & Data Mining

unsupervised learning
Unsupervised Learning
  • In supervised learning we were given attributes & targets (e.g. class labels).
  • In unsupervised learning we are only given attributes.
  • Our task is to discover structure in the data.
  • Example: the data may be structured in clusters:

Is this a good clustering?

why discover structure
Why Discover Structure ?
  • Often, the result of an unsupervised learning algorithm is a new representation
  • for the same data. This new representation should be more meaningful
  • and could be used for further processing (e.g. classification).
    • Clustering: The new representation is now given by the label of a
  • cluster to which the data-point belongs.
  • This tells us which data-cases are similar to each other.
  • The new representation is smaller and hence more convenient computationally.
    • Clustering: Each data-case is now encoded by its cluster label. This is a lot
    • cheaper than its attribute values.
  • CF: We can group the users into user-communities or/and the movies into
  • movie genres. If we need to predict something we simply pick the average
  • rating in the group.
clustering k means
Clustering: K-means
  • We iterate two operations:
    • 1. Update the assignment of data-cases to clusters
    • 2. Update the location of the cluster.
  • Denote the assignment of data-case “i” to cluster “c”.
  • Denote the position of cluster “c” in a d-dimensional space.
  • Denote the location of data-case i
  • Then iterate until convergence:
    • 1. For each data-case, compute distances to each cluster and pick the closest one:
    • 2. For each cluster location, compute the mean location of all data-cases
    • assigned to it:

Nr. of data-cases in cluster c

Set of data-cases assigned to cluster c

k means
  • Cost function:
  • Each step in k-means decreases this cost function.
  • Often initialization is very important since there are very many local minima in C.
  • Relatively good initialization: place cluster locations on K randomly chosen data-cases.
  • How to choose K?
  • Add complexity term: and minimize also over K
vector quantization
Vector Quantization
  • K-means divides the space up in a Voronoi tesselation.
  • Every point on a tile is summarized by the code-book vector “+”.
  • This clearly allows for data compression !
mixtures of gaussians
Mixtures of Gaussians
  • K-means assigns each data-case to exactly 1 cluster. But what if
  • clusters are overlapping?
  • Maybe we are uncertain as to which cluster it really belongs.
  • The mixtures of Gaussians algorithm assigns data-cases to cluster with
  • a certain probability.
mog clustering
MoG Clustering

Covariance determines

the shape of these contours

  • Idea: fit these Gaussian densities to the data, one per cluster.
em algorithm e step
EM Algorithm: E-step
  • “r” is the probability that data-case “i” belongs to cluster “c”.
  • is the a priori probability of being assigned to cluster “c”.
  • Note that if the Gaussian has high probability on data-case “i”
  • (i.e. the bell-shape is on top of the data-case) then it claims high
  • responsibility for this data-case.
  • The denominator is just to normalize all responsibilities to 1:
em algorithm m step
EM Algorithm: M-Step

total responsibility claimed by cluster “c”

expected fraction of data-cases assigned to this cluster

weighted sample mean where every data-case is weighted

according to the probability that it belongs to that cluster.

weighted sample covariance

em mog
  • EM comes from “expectation maximization”. We won’t go through the derivation.
  • If we are forced to decide, we should assign a data-case to the cluster which
  • claims highest responsibility.
  • For a new data-case, we should compute responsibilities as in the E-step
  • and pick the cluster with the largest responsibility.
  • E and M steps should be iterated until convergence (which is guaranteed).
  • Every step increases the following objective function (which is the total
  • log-probability of the data under the model we are learning):
agglomerative hierarchical clustering
Agglomerative Hierarchical Clustering

Every data-case is a cluster

  • Define a “distance” between clusters (later).
  • Initially, every data-case is its own cluster.
  • At each iteration, compute the distances
  • between all existing clusters (you can store
  • distances and avoid their re-computation).
  • Merge the closest clusters into 1 single cluster.
  • Update you “dendrogram”.
iteration 3
Iteration 3
  • This way you build a hierarchy.
  • Complexity Order (why?)

produces minimal spanning tree.

avoids elongated clusters.

gene expression data micro array data
Gene Expression DataMicro-array Data
  • The expression level of genes is
  • tested under different experimental
  • conditions.
  • We like to find the genes which
  • co-express in a subset of conditions.
  • Both genes and conditions are
  • clustered and shown as dendrograms.
exercise i
Exercise I
  • Imagine I have run a clustering algorithm on some data describing 3
  • attributes of cars: height, weight, length.
  • I have found two clusters. An expert comes by and tells you that class 1 is
  • really Ferrari’s while class 2 is Hummers.
  • A new data-case (car) is presented, i.e. you get to see the height, weight, length.
  • Describe how you can use the output of your clustering, including the information
  • obtained from the expert to classify the new car as a Ferrari or a Hummer.
  • Be very precise: use an equation or pseudo-code to describe what to do.
  • You add the new car to the dataset and run the K-means starting at its converged
  • assignments and cluster means obtained from before. Is it possible that the
  • assignments of the old data change due to the addition of the new data-case?
exercise ii
Exercise II
  • We classify data according to the 3-nearest neighbors (3-NN) rule.
  • Explain in detail how this works.
  • Which decision surface do you think is smoother: the one for 1-NN or for
  • 100-NN? Explain.
  • Is k-NN a parametric or non-parametric method.
  • Give an important property of non-parametric classification method.
  • We will do linear regression on data of the form (Xn,Yn) where Xn and Yn are
  • real values: Yn = AXn+b+n
  • where A,b are parameters and n is the noise variable.
  • Provide the equation for the total Error of the data-items.
  • We want to minimize the Error. With respect to what ?
  • You are given a new attribute Xnew. What would you predict for Ynew.