Hcc class lecture 14 comments
Download
1 / 15

HCC class lecture 14 comments - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

HCC class lecture 14 comments. John Canny 3/9/05. Administrivia. Clustering: LSA again. The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents) Columns are distinct terms Matrix elements are term counts (x tfidf weight)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' HCC class lecture 14 comments' - booth


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Hcc class lecture 14 comments
HCC classlecture 14 comments

John Canny3/9/05



Clustering lsa again
Clustering: LSA again

  • The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents)

  • Columns are distinct terms

  • Matrix elements are term counts (x tfidf weight)

  • The idea is to “Factor” this matrix into A D B:

Themes

Terms

Terms

D

B

=

Textblocks

M

A

Textblocks


Lsa again
LSA again

  • A encodes the representation of each text block in a space of themes.

  • B encodes each theme with term weights. It can be used to explicitly describe the theme.

Themes

Terms

Terms

D

B

=

Textblocks

M

A

Textblocks


Lsa limitations
LSA limitations

  • LSA has a few assumptions that don’t make much sense:

    • If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices.

    • LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian.

    • SVD forces themes to be orthogonal in the A and B matrices. Why should they be?


Non negative matrix factorization
Non-negative Matrix Factorization

  • NMF deals with non-negativity and orthogonality, but still uses gaussian statistics:

    • If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices.

    • LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian.

    • SVD forces themes to be orthogonal in the A and B matrices. Why should they be?


Lsa again1
LSA again

  • The consequences are:

    • LSA themes are not meaningful beyond the first few (the ones with strongest singular value).

    • LSA is largely insensitive to the choice of semantic space (most 300-dim spaces will do).


NMF

  • The corresponding properties:

    • NMF components track themes well (up to 30 or more).

    • The NMF components can be used directly as topic markers, so the choice is important.


NMF

  • NMF is an umbrella term for several algorithms.

  • The one in this paper uses least squares to match the original term matrix. i.e. it minimizes:

    (M – AB)2

  • Another natural metric is the KL or Kullback-Liebler divergence.The KL-divergence between two probability distributions p and q is:

     p log p/q

  • Another natural version of NMF uses KL-divergence between M and its approximation as A B.


NMF

  • KL-divergence is usually a more accurate way to compare probability distributions.

  • However, in clustering applications, the quality of fit to the probability distribution is secondary to the quality of the clusters.

  • KL-divergence NMF performs well for smoothing (extrapolation) tasks, but not as well as least-squares for clustering.

  • The reasons are not entirely clear, but it may simply be an artifact of the basic NMF recurrences, which find only locally-optimal matches.


A simpler text summarizer
A Simpler Text Summarizer

  • A simpler text summarizer based on inter-sentence analysis did as well as any of the custom systems on the DUC-2002 dataset (Document Understanding Conference).

  • This algorithm called “TextRank” was based on a graphical analysis of the similarity graph between sentences in the text.


A simpler text summarizer1
A Simpler Text Summarizer

  • Vertices in the graph represent sentences, edge weights are similarity between sentences:

S1

S2

S7

S3

S6

S4

S5


Textrank
Textrank

  • TextRank computes vertex strength using a variant of Google’s Pagerank. It gives the probability of being at a vertex during a long random walk on the graph.

S1

S2

S7

S3

S6

S4

S5


Textrank1
Textrank

  • The highest-ranked vertices comprise the summary.

  • Textrank achieved the same summary performance as the best single-sentence summarizers at DUC-2002. (TextRank appeared in ACL 2004)


Discussion topics
Discussion Topics

T1: The best text analysis algorithms for a variety of tasks seem to use numerical (BOW or graphical models) of texts. Discuss what information these representations capture and why they might be effective.


ad