1 / 15

# HCC class lecture 14 comments - PowerPoint PPT Presentation

HCC class lecture 14 comments. John Canny 3/9/05. Administrivia. Clustering: LSA again. The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents) Columns are distinct terms Matrix elements are term counts (x tfidf weight)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' HCC class lecture 14 comments' - booth

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

John Canny3/9/05

• The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents)

• Columns are distinct terms

• Matrix elements are term counts (x tfidf weight)

• The idea is to “Factor” this matrix into A D B:

Themes

Terms

Terms

D

B

=

Textblocks

M

A

Textblocks

• A encodes the representation of each text block in a space of themes.

• B encodes each theme with term weights. It can be used to explicitly describe the theme.

Themes

Terms

Terms

D

B

=

Textblocks

M

A

Textblocks

• LSA has a few assumptions that don’t make much sense:

• If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices.

• LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian.

• SVD forces themes to be orthogonal in the A and B matrices. Why should they be?

• NMF deals with non-negativity and orthogonality, but still uses gaussian statistics:

• If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices.

• LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian.

• SVD forces themes to be orthogonal in the A and B matrices. Why should they be?

• The consequences are:

• LSA themes are not meaningful beyond the first few (the ones with strongest singular value).

• LSA is largely insensitive to the choice of semantic space (most 300-dim spaces will do).

• The corresponding properties:

• NMF components track themes well (up to 30 or more).

• The NMF components can be used directly as topic markers, so the choice is important.

• NMF is an umbrella term for several algorithms.

• The one in this paper uses least squares to match the original term matrix. i.e. it minimizes:

(M – AB)2

• Another natural metric is the KL or Kullback-Liebler divergence.The KL-divergence between two probability distributions p and q is:

 p log p/q

• Another natural version of NMF uses KL-divergence between M and its approximation as A B.

• KL-divergence is usually a more accurate way to compare probability distributions.

• However, in clustering applications, the quality of fit to the probability distribution is secondary to the quality of the clusters.

• KL-divergence NMF performs well for smoothing (extrapolation) tasks, but not as well as least-squares for clustering.

• The reasons are not entirely clear, but it may simply be an artifact of the basic NMF recurrences, which find only locally-optimal matches.

• A simpler text summarizer based on inter-sentence analysis did as well as any of the custom systems on the DUC-2002 dataset (Document Understanding Conference).

• This algorithm called “TextRank” was based on a graphical analysis of the similarity graph between sentences in the text.

• Vertices in the graph represent sentences, edge weights are similarity between sentences:

S1

S2

S7

S3

S6

S4

S5

• TextRank computes vertex strength using a variant of Google’s Pagerank. It gives the probability of being at a vertex during a long random walk on the graph.

S1

S2

S7

S3

S6

S4

S5

• The highest-ranked vertices comprise the summary.

• Textrank achieved the same summary performance as the best single-sentence summarizers at DUC-2002. (TextRank appeared in ACL 2004)

T1: The best text analysis algorithms for a variety of tasks seem to use numerical (BOW or graphical models) of texts. Discuss what information these representations capture and why they might be effective.