1 / 15

# HCC class lecture 14 comments - PowerPoint PPT Presentation

HCC class lecture 14 comments. John Canny 3/9/05. Administrivia. Clustering: LSA again. The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents) Columns are distinct terms Matrix elements are term counts (x tfidf weight)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Clustering: LSA again
• The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents)
• Columns are distinct terms
• Matrix elements are term counts (x tfidf weight)
• The idea is to “Factor” this matrix into A D B:

Themes

Terms

Terms

D

B

=

Textblocks

M

A

Textblocks

LSA again
• A encodes the representation of each text block in a space of themes.
• B encodes each theme with term weights. It can be used to explicitly describe the theme.

Themes

Terms

Terms

D

B

=

Textblocks

M

A

Textblocks

LSA limitations
• LSA has a few assumptions that don’t make much sense:
• If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices.
• LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian.
• SVD forces themes to be orthogonal in the A and B matrices. Why should they be?
Non-negative Matrix Factorization
• NMF deals with non-negativity and orthogonality, but still uses gaussian statistics:
• If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices.
• LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian.
• SVD forces themes to be orthogonal in the A and B matrices. Why should they be?
LSA again
• The consequences are:
• LSA themes are not meaningful beyond the first few (the ones with strongest singular value).
• LSA is largely insensitive to the choice of semantic space (most 300-dim spaces will do).
NMF
• The corresponding properties:
• NMF components track themes well (up to 30 or more).
• The NMF components can be used directly as topic markers, so the choice is important.
NMF
• NMF is an umbrella term for several algorithms.
• The one in this paper uses least squares to match the original term matrix. i.e. it minimizes:

(M – AB)2

• Another natural metric is the KL or Kullback-Liebler divergence.The KL-divergence between two probability distributions p and q is:

 p log p/q

• Another natural version of NMF uses KL-divergence between M and its approximation as A B.
NMF
• KL-divergence is usually a more accurate way to compare probability distributions.
• However, in clustering applications, the quality of fit to the probability distribution is secondary to the quality of the clusters.
• KL-divergence NMF performs well for smoothing (extrapolation) tasks, but not as well as least-squares for clustering.
• The reasons are not entirely clear, but it may simply be an artifact of the basic NMF recurrences, which find only locally-optimal matches.
A Simpler Text Summarizer
• A simpler text summarizer based on inter-sentence analysis did as well as any of the custom systems on the DUC-2002 dataset (Document Understanding Conference).
• This algorithm called “TextRank” was based on a graphical analysis of the similarity graph between sentences in the text.
A Simpler Text Summarizer
• Vertices in the graph represent sentences, edge weights are similarity between sentences:

S1

S2

S7

S3

S6

S4

S5

Textrank
• TextRank computes vertex strength using a variant of Google’s Pagerank. It gives the probability of being at a vertex during a long random walk on the graph.

S1

S2

S7

S3

S6

S4

S5

Textrank
• The highest-ranked vertices comprise the summary.
• Textrank achieved the same summary performance as the best single-sentence summarizers at DUC-2002. (TextRank appeared in ACL 2004)
Discussion Topics

T1: The best text analysis algorithms for a variety of tasks seem to use numerical (BOW or graphical models) of texts. Discuss what information these representations capture and why they might be effective.