- By
**booth** - Follow User

- 108 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'HCC class lecture 14 comments' - booth

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

HCC classlecture 14 comments

John Canny3/9/05

Clustering: LSA again

- The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents)
- Columns are distinct terms
- Matrix elements are term counts (x tfidf weight)
- The idea is to “Factor” this matrix into A D B:

Themes

Terms

Terms

D

B

=

Textblocks

M

A

Textblocks

LSA again

- A encodes the representation of each text block in a space of themes.
- B encodes each theme with term weights. It can be used to explicitly describe the theme.

Themes

Terms

Terms

D

B

=

Textblocks

M

A

Textblocks

LSA limitations

- LSA has a few assumptions that don’t make much sense:
- If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices.
- LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian.
- SVD forces themes to be orthogonal in the A and B matrices. Why should they be?

Non-negative Matrix Factorization

- NMF deals with non-negativity and orthogonality, but still uses gaussian statistics:
- If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices.
- LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian.
- SVD forces themes to be orthogonal in the A and B matrices. Why should they be?

LSA again

- The consequences are:
- LSA themes are not meaningful beyond the first few (the ones with strongest singular value).
- LSA is largely insensitive to the choice of semantic space (most 300-dim spaces will do).

NMF

- The corresponding properties:
- NMF components track themes well (up to 30 or more).
- The NMF components can be used directly as topic markers, so the choice is important.

NMF

- NMF is an umbrella term for several algorithms.
- The one in this paper uses least squares to match the original term matrix. i.e. it minimizes:

(M – AB)2

- Another natural metric is the KL or Kullback-Liebler divergence.The KL-divergence between two probability distributions p and q is:

p log p/q

- Another natural version of NMF uses KL-divergence between M and its approximation as A B.

NMF

- KL-divergence is usually a more accurate way to compare probability distributions.
- However, in clustering applications, the quality of fit to the probability distribution is secondary to the quality of the clusters.
- KL-divergence NMF performs well for smoothing (extrapolation) tasks, but not as well as least-squares for clustering.
- The reasons are not entirely clear, but it may simply be an artifact of the basic NMF recurrences, which find only locally-optimal matches.

A Simpler Text Summarizer

- A simpler text summarizer based on inter-sentence analysis did as well as any of the custom systems on the DUC-2002 dataset (Document Understanding Conference).
- This algorithm called “TextRank” was based on a graphical analysis of the similarity graph between sentences in the text.

A Simpler Text Summarizer

- Vertices in the graph represent sentences, edge weights are similarity between sentences:

S1

S2

S7

S3

S6

S4

S5

Textrank

- TextRank computes vertex strength using a variant of Google’s Pagerank. It gives the probability of being at a vertex during a long random walk on the graph.

S1

S2

S7

S3

S6

S4

S5

Textrank

- The highest-ranked vertices comprise the summary.
- Textrank achieved the same summary performance as the best single-sentence summarizers at DUC-2002. (TextRank appeared in ACL 2004)

Discussion Topics

T1: The best text analysis algorithms for a variety of tasks seem to use numerical (BOW or graphical models) of texts. Discuss what information these representations capture and why they might be effective.

Download Presentation

Connecting to Server..