Loading in 2 Seconds...

Lecture 7: Clustering and Decomposition of a Document to Authorial Components

Loading in 2 Seconds...

- 89 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Lecture 7: Clustering and Decomposition of a Document to Authorial Components' - oralee

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Lecture 7: Clustering and Decomposition of a Document to Authorial Components

Moshe Koppel and NavotAkiva

Clustering

- The idea is to define “natural” clusters
- Position in space depends on feature set
- Need to have some notion of what you want in order to choose right feature types
- e.g., cluster by topic: content words
- e.g., cluster by author: style features
- With many features, clusters aren’t so elegant

How Many Clusters?

- Some methods decide in advance
- Some methods try to optimize k based on data
- e.g., in previous example, would want 3 clusters
- Caution: some optimization criteria biased towards one giant cluster or many tiny ones

Flat Clustering: K-Means Algorithm

Select K random docs {s1, s2,… sK} as seeds.

Until clustering converges or other stopping criterion:

For each doc di:

Assign di to the cluster cjsuch that dist(xi, sj) is minimal.

For each cluster cj:

sj= (cj)(Update the seeds to the centroid of each cluster)

Reassign clusters

Compute centroids

Reassign clusters

x

x

Compute centroids

x

x

x

x

K-Means Example (K=2)Reassign clusters

Converged!

Termination conditions

- Several possibilities, e.g.,
- A fixed number of iterations.
- Doc partition unchanged.
- Centroid positions don’t change.

Convergence

- Why should the K-means algorithm ever reach a fixed point?
- A state in which clusters don’t change.
- K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.
- EM is known to converge.
- Number of iterations could be large.

Convergence of K-Means

- Define goodness measure of cluster k as sum of squared distances from cluster centroid:

Gk = Σi (di – ck)2 (sum over all di in cluster k)

G = Σk Gk

- Reassignment monotonically decreases G since each vector is assigned to the closest centroid.

Dependence on Seed Choice

- Results depend on random seed selection.
- Some tricks for avoiding bad seeds:
- Choose seed far from any already chosen
- Try out multiple seed sets

If you start with B and E as centroids

you converge to {A,B,C} and {D,E,F}

If you start with D and F you converge to {A,B,D,E} and {C,F}

A Problem…

- The methods we’ve discussed so far don’t always give very satisfying results.
- Have a look at this example….

Spectral Methods

- Because of examples like that, spectral methods have been developed.
- Here’s the idea: Imagine weighted edges between points, where weight reflects similarity. Zero out all edges except k nearest neighbors with high weights.
- Now make the “cheapest” cuts that result in separate components.

Spectral Clustering

Properly formulated, the min Ncut problem is NP-hard. But there are some nice linear algebra tricks we can use to get reasonable approximate solutions:

- Pre-processing
- Construct a matrix representation of the dataset.
- Decomposition
- Compute eigenvalues and eigenvectors of the matrix.
- Map each point to a lower-dimensional representation based on one or more eigenvectors.
- Grouping
- Assign points to two or more clusters, based on the new representation.

New Problem: Separating Document Components

- Often documents consist of multiple authorial components.
- Our object is to tease apart the components of a composite document.

Basic Idea

- Divide the document into natural chunks (e.g., chapters, paragraphs)
- Vectorize chunks using some feature set
- Cluster the vectors

Classic Example: The Bible

- Great historic and cultural interest
- Much prior research on components
- Has been manually tagged in every conceivable way

Test Case

Let’s throw the chapters of Ezekiel and Jeremiah into a hat and see if we can separate them out.

- Each is presumably the work of a single (primary) author.
- There’s no reason to think it’s the same author.
- They are thematically quite similar, though far from identical.

Clustering Jeremiah+Ezekiel

- Chunks = chapters (no sequence info)
- Features = bag of words
- Cluster method = Ncut
- # clusters = 2

Results using all words

- Not too exciting. We must be picking up thematic or genre-related differences that cross books.
- Let’s try using only function words.

Results using function words

- Not any better.
- Let’s try a new approach.

A Better Idea

- Exploit the fact that different authors use different synonyms for the same idea (e.g., makel/mateh).
- It would be really convenient if it turned out that Jeremiah and Ezekiel made consistently different choices for various synsets.
- Note that words aren’t synonyms, rather word senses are synonyms. (For example mateh=staff is a synonym of makel, but mateh=tribe is not.)

Automatically Finding Synonyms

- There are various clever methods for identifying synsets, but most are not exact enough for our purpose.
- Conveniently, for the Bible, we have many useful tools, including careful translations and manual sense tagging.
- We identify as synonyms word senses that are translated into the same English word (e.g., makel=staff and mateh=staff). (We even get sense disambiguation for free.)
- Due to polysemy (in English), this method overshoots. We manually delete mistakes. (This is the only manual intervention we will ever do.)

Synonym Method

- The usual similarity measures (e.g., cosine, inverse Euclidean distance) aren’t quite right.
- If two docs use the same synonym, they are similar.
- If two docs use opposite synonyms, they are different.
- If one of the docs uses one of the synonyms, but the other doesn’t, cosine would regard them as different. But are they?
- For measuring similarity, we only consider synsets represented in both docs.

Results using synonyms

- Now we’re getting somewhere.
- But we’re not done yet…

Core of a Cluster

- Some chapters are in a cluster because they really belong there; some just have to be somewhere.
- Let’s consider only chapters near one centroid and far from the other. These are the “cores” of the respective clusters.

36

2

12

3

36

7

4

Cluster cores

Ezekiel 1, 10

Expanding the Core

- Now that we have a core, we can use supervised methods (e.g., SVM) to learn a boundary.
- In fact, we can use function words as our features.
- Using synonyms will just get us back where we started.
- And besides, FW are generally very reliable for supervised authorship attribution.

SVM expansion of core

- Two training examples are “misclassed” by SVM.
- Incredibly, these are Ezekiel 1 and 10, which were part of Jeremiah core, but are on Ezekiel side of optimal SVM boundary.
- The only exception is Ezekiel 42, a non-core chapter which lies in the SVM margin.

Decomposing Unsegmented Text

- Until now, we’ve assumed that our units (chapters) were all pure Ezekiel or pure Jeremiah. That’s a major unrealistic assumption.
- Let’s create a munged “Jer-iel” book by alternately taking a random number of verses from each book until we run out of verses.
- Can we tease that apart?

Chunks Might Be Mixed

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Chunks Might Be Mixed

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Decomposing Unsegmented Text

- Start with some arbitrary chunking of the munged text. (Some of these chunks will be mixed.)
- Now proceed exactly as above, until you have cluster cores.
- With a bit of luck, the chunks in the core will be the “pure” chunks.

A Munged Document

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Decomposing Unsegmented Text

- As above, use SVM to learn a classifier, using the core chunks as training examples.
- But at the last step, classify individual verses instead of chapters.
- Some smoothing techniques are necessary to:
- Deal with the sparseness of features in individual verses
- Take advantage of serial correlation

A Munged Document

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

A Munged Document

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

A Munged Document

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Multiple Books

- The same method works if we create a munged book out of k>2 constituent books.
- BUT, we need to be given the k.
- So we don’t yet know how many components there are, only how to best split into a given number of components.

Open Question

- Can we extend this method so it can be used also for texts that don’t have canonical translations like the Bible?
- Can we use this method to identify plagiarized sections of a text?

Download Presentation

Connecting to Server..