Lecture 7: Clustering and Decomposition of a Document to Authorial Components

Lecture 7: Clustering and Decomposition of a Document to Authorial Components Moshe Koppel and NavotAkiva

Clustering • The idea is to define “natural” clusters • Position in space depends on feature set • Need to have some notion of what you want in order to choose right feature types • e.g., cluster by topic: content words • e.g., cluster by author: style features • With many features, clusters aren’t so elegant

How Many Clusters? • Some methods decide in advance • Some methods try to optimize k based on data • e.g., in previous example, would want 3 clusters • Caution: some optimization criteria biased towards one giant cluster or many tiny ones

Flat Clustering: K-Means Algorithm Select K random docs {s1, s2,… sK} as seeds. Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cjsuch that dist(xi, sj) is minimal. For each cluster cj: sj= (cj)(Update the seeds to the centroid of each cluster)

Pick seeds Reassign clusters Compute centroids Reassign clusters x x Compute centroids x x x x K-Means Example (K=2) Reassign clusters Converged!

Termination conditions • Several possibilities, e.g., • A fixed number of iterations. • Doc partition unchanged. • Centroid positions don’t change.

Convergence • Why should the K-means algorithm ever reach a fixed point? • A state in which clusters don’t change. • K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. • EM is known to converge. • Number of iterations could be large.

Convergence of K-Means • Define goodness measure of cluster k as sum of squared distances from cluster centroid: Gk = Σi (di – ck)2 (sum over all di in cluster k) G = Σk Gk • Reassignment monotonically decreases G since each vector is assigned to the closest centroid.

Dependence on Seed Choice • Results depend on random seed selection. • Some tricks for avoiding bad seeds: • Choose seed far from any already chosen • Try out multiple seed sets If you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} and {C,F}

A Problem… • The methods we’ve discussed so far don’t always give very satisfying results. • Have a look at this example….

Ng et al On Spectral clustering: analysis and algorithm

Spectral Methods • Because of examples like that, spectral methods have been developed. • Here’s the idea: Imagine weighted edges between points, where weight reflects similarity. Zero out all edges except k nearest neighbors with high weights. • Now make the “cheapest” cuts that result in separate components.

Ng et al On Spectral clustering: analysis and algorithm

Spectral Clustering Properly formulated, the min Ncut problem is NP-hard. But there are some nice linear algebra tricks we can use to get reasonable approximate solutions: • Pre-processing • Construct a matrix representation of the dataset. • Decomposition • Compute eigenvalues and eigenvectors of the matrix. • Map each point to a lower-dimensional representation based on one or more eigenvectors. • Grouping • Assign points to two or more clusters, based on the new representation.

New Problem: Separating Document Components • Often documents consist of multiple authorial components. • Our object is to tease apart the components of a composite document.

Basic Idea • Divide the document into natural chunks (e.g., chapters, paragraphs) • Vectorize chunks using some feature set • Cluster the vectors

Classic Example: The Bible • Great historic and cultural interest • Much prior research on components • Has been manually tagged in every conceivable way

Test Case Let’s throw the chapters of Ezekiel and Jeremiah into a hat and see if we can separate them out. • Each is presumably the work of a single (primary) author. • There’s no reason to think it’s the same author. • They are thematically quite similar, though far from identical.

Clustering Jeremiah+Ezekiel • Chunks = chapters (no sequence info) • Features = bag of words • Cluster method = Ncut • # clusters = 2

Results using all words • Not too exciting. We must be picking up thematic or genre-related differences that cross books. • Let’s try using only function words.

Results using function words • Not any better. • Let’s try a new approach.

A Better Idea • Exploit the fact that different authors use different synonyms for the same idea (e.g., makel/mateh). • It would be really convenient if it turned out that Jeremiah and Ezekiel made consistently different choices for various synsets. • Note that words aren’t synonyms, rather word senses are synonyms. (For example mateh=staff is a synonym of makel, but mateh=tribe is not.)

Automatically Finding Synonyms • There are various clever methods for identifying synsets, but most are not exact enough for our purpose. • Conveniently, for the Bible, we have many useful tools, including careful translations and manual sense tagging. • We identify as synonyms word senses that are translated into the same English word (e.g., makel=staff and mateh=staff). (We even get sense disambiguation for free.) • Due to polysemy (in English), this method overshoots. We manually delete mistakes. (This is the only manual intervention we will ever do.)

Synonym Method • The usual similarity measures (e.g., cosine, inverse Euclidean distance) aren’t quite right. • If two docs use the same synonym, they are similar. • If two docs use opposite synonyms, they are different. • If one of the docs uses one of the synonyms, but the other doesn’t, cosine would regard them as different. But are they? • For measuring similarity, we only consider synsets represented in both docs.

Results using synonyms • Now we’re getting somewhere. • But we’re not done yet…

Core of a Cluster • Some chapters are in a cluster because they really belong there; some just have to be somewhere. • Let’s consider only chapters near one centroid and far from the other. These are the “cores” of the respective clusters. 36 2 12 3 36 7 4

Cluster cores

Cluster cores Ezekiel 1, 10

Expanding the Core • Now that we have a core, we can use supervised methods (e.g., SVM) to learn a boundary. • In fact, we can use function words as our features. • Using synonyms will just get us back where we started. • And besides, FW are generally very reliable for supervised authorship attribution.

SVM expansion of core • Two training examples are “misclassed” by SVM. • Incredibly, these are Ezekiel 1 and 10, which were part of Jeremiah core, but are on Ezekiel side of optimal SVM boundary. • The only exception is Ezekiel 42, a non-core chapter which lies in the SVM margin.

Decomposing Unsegmented Text • Until now, we’ve assumed that our units (chapters) were all pure Ezekiel or pure Jeremiah. That’s a major unrealistic assumption. • Let’s create a munged “Jer-iel” book by alternately taking a random number of verses from each book until we run out of verses. • Can we tease that apart?

Chunks Might Be Mixed xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Decomposing Unsegmented Text • Start with some arbitrary chunking of the munged text. (Some of these chunks will be mixed.) • Now proceed exactly as above, until you have cluster cores. • With a bit of luck, the chunks in the core will be the “pure” chunks.

A Munged Document xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Decomposing Unsegmented Text • As above, use SVM to learn a classifier, using the core chunks as training examples. • But at the last step, classify individual verses instead of chapters. • Some smoothing techniques are necessary to: • Deal with the sparseness of features in individual verses • Take advantage of serial correlation

A Munged Document xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Results on Munged Books

Multiple Books • The same method works if we create a munged book out of k>2 constituent books. • BUT, we need to be given the k. • So we don’t yet know how many components there are, only how to best split into a given number of components.

Open Question • Can we extend this method so it can be used also for texts that don’t have canonical translations like the Bible? • Can we use this method to identify plagiarized sections of a text?

Lecture 7: Clustering and Decomposition of a Document to Authorial Components

Lecture 7: Clustering and Decomposition of a Document to Authorial Components

Presentation Transcript

The Implementation Advance Planning Document

Eigen Decomposition and Singular Value Decomposition

Clustering of non-numerical data

Matrix Decomposition and its Application in Statistics

Lecture 9: Gene expression analysis/Clustering

Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage

LECTURE 11 EXPLOSIVES AND PROPELLANTS

Post-Mortem Changes

807 - TEXT ANALYTICS

Lecture 8 OOP v.s . FP, Subtyping

An Introduction To Matrix Decomposition and Graphical Model

Clustering IV

Computational Movement Analysis Lecture 3: Curve simplification Joachim Gudmundsson

الجلسة الرابعة التحليل العنقودي Clustering Analysis تشرح لكل الفئات

Recursive Bipartite Spectral Clustering for Document Categorization

Fall 2004, CIS, Temple University CIS527: Data Warehousing, Filtering, and Mining Lecture 6

Combining Linear Programming Based Decomposition Techniques with Constraint Programming

Clustering Methods

Small Galaxy Groups Clustering and the Evolution of Galaxy Clustering

Lecture 4 Protein Function prediction using network concepts Hierarchical Clustering

Segmentation and Clustering