Download Presentation
## Clustering and NLP

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Clustering and NLP**Slides by me, Sidhartha Shakya, Pedro Domingos, D. Gunopulos, A.L. Yuille, Andrew Moore, and others NLP**Outline**• Clustering Overview • Sample Clustering Techniques for NLP • K-means • Agglomerative • Model-based (EM) NLP**What is clustering?**• Given a collection of objects, clustering is a procedure that detects the presence of distinct groups, and assign objects to groups.**Why should we care about clustering?**• Clustering is a basic step in most data mining procedures: Examples : • Clustering movie viewers for movie ranking. • Clustering proteins by their functionality. • Clustering text documents for content similarity.**Clustering as Data Exploration**Clustering is one of the most widely used tool for exploratory data analysis. Social Sciences Biology Astronomy Computer Science . . All apply clustering to gain a first understanding of the structure of large data sets.**There are Many Clustering Tasks**“Clustering” is an ill defined problem • There are many different clustering tasks, leading to different clustering paradigms:**There are Many Clustering Tasks**“Clustering” is an ill defined problem • There are many different clustering tasks, leading to different clustering paradigms:**Issues**The clustering problem: Given a set of objects, find groups of similar objects • What is similar? Define appropriate metrics • What makes a good group? Groups that contain the highest average similarity between all pairs? Groups that are most separated from neighboring groups? 3. How can you evaluate a clustering algorithm?**Formal Definition**Given a data set S and a clustering “objective” function f, find a partition P of S that maximizes (or minimizes) f(P). A partition is a set of subsets of S such that the subsets don’t intersect, and their union is equal to S. NLP**Sample Objective Functions**• Objective 1: Minimize the average distance between points in the same cluster • Objective 2: Maximize the margin (smallest distance) between neighboring clusters • Objective 3 (Minimum Description Length): Minimize the number of bits needed to describe the clustering and the number of bits needed to describe the points in each cluster. NLP**More Issues**• Having an objective function f gives a way of evaluating a clustering. But the real f is usually not known! • Efficiency Comparing N points to each other means making O(N2) comparisons. • Curse of Dimensionality The more features in your data, the more likely the clustering algorithm is to get it wrong. NLP**Clustering as “Unsupervised” Learning**Input Output H = space of boolean functions f = X1Λ ⌐X3Λ ⌐X4 NLP**Clustering as “Unsupervised” Learning**Clustering is just like ML, except ….: Input Output H = space of boolean functions f = X1Λ ⌐X3Λ ⌐X4 NLP**Clustering as “Unsupervised” Learning**• Supervised learning has: • Labeled training examples • A space Y of possible labels • Unsupervised learning has: • Unlabeled training examples • No information (or limited information) about the space of possible labels NLP**Some Notes on Complexity**• The ML example used a space of Boolean functions of N Boolean variables • 22^N+1 possible functions • But many possibilities are eliminated by training data and assumptions • How many possible clusterings? • ~2N * K / K!, for K clusters (K>1) • No possibilities eliminated by training data • Need to search for a good one efficiently! NLP**Clustering Problem Formulation**• General Assumptions • Each data item is a tuple (vector) • Values of tuples are nominal, ordinal or numerical • Similarity (or Distance) function is provided • For pure numerical tuples, for example: • Sim(di,dj) = di,kdj,k • sim (di,dj) = cos(di,dj) • …and many more (slide after next)**Similarity Measures in Data Analysis**• For Ordinal Values • E.g. "small," "medium," "large," "X-large" • Convert to numerical assuming constant …on a normalized [0,1] scale, where: max(v)=1, min(v)=0, others interpolate • E.g. "small"=0, "medium"=0.33, etc. • Then, use numerical similarity measures • Or, use similarity matrix (see next slide)**Similarity Measures (cont.)**• For Nominal Values • E.g. "Boston", "LA", "Pittsburgh", or "male", "female", or "diffuse", "globular", "spiral", "pinwheel" • Binary rule: If di, = dj,k, then sim = 1, else 0 • Use underlying sematic property: E.g. Sim(Boston, LA) = dist(Boston, LA)-1, or Sim(Boston, LA) = (|size(Boston) - size(LA)| )/Max(size(cities)) • Or, use similarity Matrix**Similarity Matrix**tiny little small medium large huge tiny 1.0 0.8 0.7 0.5 0.2 0.0 little 1.0 0.9 0.7 0.3 0.1 small 1.0 0.7 0.3 0.2 medium 1.0 0.5 0.3 large 1.0 0.8 huge 1.0 • Diagonal must be 1.0 • Monotonicity property must hold • No linearity (value interpolation) assumed • Qualitative Transitive property must hold**Document Clustering Techniques**• Similarity or Distance Measure:Alternative Choices • Cosine similarity • Euclidean distance • Kernel functions, e.g., • Language Modeling P(y|modelx) where x and y are documents**Document Clustering Techniques**• Kullback Leibler distance ("relative entropy")**Some Clustering Methods**• K-Means and K-medoids algorithms: • CLARANS, [Ng and Han, VLDB 1994] • Hierarchical algorithms • CURE, [Guha et al, SIGMOD 1998] • BIRCH, [Zhang et al, SIGMOD 1996] • CHAMELEON, [Kapyris et al, COMPUTER, 32] • Density based algorithms • DENCLUE, [Hinneburg, Keim, KDD 1998] • DBSCAN, [Ester et al, KDD 96] • Clustering with obstacles, [Tung et al, ICDE 2001]**K-Means**NLP**K-means and K-medoids algorithms**• Objective function: Minimize the sum of square distances of points to a cluster representative (centroid) • Efficient iterative algorithms (O(n))**K-Means Clustering**• Select K seed centroidss.t. d(ci,cj) > dmin 2. Assign points to clusters by minimum distance to centroid 3. Compute new cluster centroids: 4. Iterate steps 2 & 3 until no points change clusters**K-Means Clustering: Initial Data Points**Step 1: Select k random seeds s.t. d(ci,cj) > dmin Initial Seeds (k=3)**K-Means Clustering: First-Pass Clusters**Step 2: Assign points to clusters by min dist. Initial Seeds**K-Means Clustering: Seeds Centroids**Step 3: Compute new cluster centroids: New Centroids**K-Means Clustering: Second Pass Clusters**Step 4: Recompute Centroids**K-Means Clustering: Iterate Until Stability**New Centroids And so on.**Question**If space of possible clusterings is exponential, why is it that K-Means can find one in O(n) time? NLP**Problems with K-means type algorithms**• Clusters are approximately spherical • High dimensionality is a problem • The value of K is an input parameter**Hierarchical Clustering**• Quadratic algorithms • Running time can be improved using sampling [Guha et al, SIGMOD 1998] [Kollios et al, ICDE 2001]**Hierarchical Agglomerative Clustering**• Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries**Hierarchical Agglomerative Clustering**• Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries**Hierarchical Agglomerative Clustering**• Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries**Hierarchical Agglomerative Clustering**Hierarchical agglomerative clustering gives a hierarchy of clusters • This makes it easier to explore the set of possible k-cluster values to choose the best number of clusters 3 4 5 Information Retrieval and Digital Libraries**High density variations**• Intuitively “correct” clustering Information Retrieval and Digital Libraries**High density variations**• Intuitively “correct” clustering • HAC-generated clusters Information Retrieval and Digital Libraries**Document Clustering Techniques**• Example. Group documents based on similarity Similarity matrix: Thresholding at similarity value of .9 yields: complete graph C1 = {1,4,5}, namely Complete Linkage connected graph C2={1,4,5,6}, namely Single Linkage For clustering we need three things: • A similarity measure for pairwise comparison between documents • A clustering criterion (complete Link, Single Ling,…) • A clustering algorithm**Document Clustering Techniques**• Clustering Criterion: Alternative Linkages • Single-link ('nearest neighbor"): • Complete-link: • Average-link ("group average clustering") or GAC):**Hierarchical Agglomerative Clustering Methods**• Generic Agglomerative Procedure (Salton '89): • result in nested clusters via iterations • Compute all pairwise document-document similarity coefficients • Place each of n documents into a class of its own • Merge the two most similar clusters into one; - replace the two clusters by the new cluster - recompute intercluster similarity scores w.r.t. the new cluster • Repeat the above step until there are only k clusters left (note k could = 1).**Group Agglomerative Clustering**2 1 5 4 3 6 9 7 8**Expectation-Maximization**Information Retrieval and Digital Libraries**Clustering as Model Selection**Let’s look at clustering as a probabilistic modeling problem: I have some set of clusters C1, C2, and C3. Each one has a certain probability distribution for generating points: P(xi | C1), P(xi | C2), P(xi | C3) NLP**Clustering as Model Selection**How can I determine which points belong to which cluster? Cluster for xi = argmaxj P(xi | Cj) So, all I need is to figure out what P(xi | Cj) is, for each i and j. But without training data! How can I do that? NLP