hcc class lecture 14 comments
Download
Skip this Video
Download Presentation
HCC class lecture 14 comments

Loading in 2 Seconds...

play fullscreen
1 / 15

HCC class lecture 14 comments - PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on

HCC class lecture 14 comments. John Canny 3/9/05. Administrivia. Clustering: LSA again. The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents) Columns are distinct terms Matrix elements are term counts (x tfidf weight)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'HCC class lecture 14 comments' - booth


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
clustering lsa again
Clustering: LSA again
  • The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents)
  • Columns are distinct terms
  • Matrix elements are term counts (x tfidf weight)
  • The idea is to “Factor” this matrix into A D B:

Themes

Terms

Terms

D

B

=

Textblocks

M

A

Textblocks

lsa again
LSA again
  • A encodes the representation of each text block in a space of themes.
  • B encodes each theme with term weights. It can be used to explicitly describe the theme.

Themes

Terms

Terms

D

B

=

Textblocks

M

A

Textblocks

lsa limitations
LSA limitations
  • LSA has a few assumptions that don’t make much sense:
    • If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices.
    • LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian.
    • SVD forces themes to be orthogonal in the A and B matrices. Why should they be?
non negative matrix factorization
Non-negative Matrix Factorization
  • NMF deals with non-negativity and orthogonality, but still uses gaussian statistics:
    • If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices.
    • LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian.
    • SVD forces themes to be orthogonal in the A and B matrices. Why should they be?
lsa again1
LSA again
  • The consequences are:
    • LSA themes are not meaningful beyond the first few (the ones with strongest singular value).
    • LSA is largely insensitive to the choice of semantic space (most 300-dim spaces will do).
slide8
NMF
  • The corresponding properties:
    • NMF components track themes well (up to 30 or more).
    • The NMF components can be used directly as topic markers, so the choice is important.
slide9
NMF
  • NMF is an umbrella term for several algorithms.
  • The one in this paper uses least squares to match the original term matrix. i.e. it minimizes:

(M – AB)2

  • Another natural metric is the KL or Kullback-Liebler divergence.The KL-divergence between two probability distributions p and q is:

 p log p/q

  • Another natural version of NMF uses KL-divergence between M and its approximation as A B.
slide10
NMF
  • KL-divergence is usually a more accurate way to compare probability distributions.
  • However, in clustering applications, the quality of fit to the probability distribution is secondary to the quality of the clusters.
  • KL-divergence NMF performs well for smoothing (extrapolation) tasks, but not as well as least-squares for clustering.
  • The reasons are not entirely clear, but it may simply be an artifact of the basic NMF recurrences, which find only locally-optimal matches.
a simpler text summarizer
A Simpler Text Summarizer
  • A simpler text summarizer based on inter-sentence analysis did as well as any of the custom systems on the DUC-2002 dataset (Document Understanding Conference).
  • This algorithm called “TextRank” was based on a graphical analysis of the similarity graph between sentences in the text.
a simpler text summarizer1
A Simpler Text Summarizer
  • Vertices in the graph represent sentences, edge weights are similarity between sentences:

S1

S2

S7

S3

S6

S4

S5

textrank
Textrank
  • TextRank computes vertex strength using a variant of Google’s Pagerank. It gives the probability of being at a vertex during a long random walk on the graph.

S1

S2

S7

S3

S6

S4

S5

textrank1
Textrank
  • The highest-ranked vertices comprise the summary.
  • Textrank achieved the same summary performance as the best single-sentence summarizers at DUC-2002. (TextRank appeared in ACL 2004)
discussion topics
Discussion Topics

T1: The best text analysis algorithms for a variety of tasks seem to use numerical (BOW or graphical models) of texts. Discuss what information these representations capture and why they might be effective.

ad