Term weighting and vector representation of text

1 / 19

# Term weighting and vector representation of text - PowerPoint PPT Presentation

Term weighting and vector representation of text. Lecture 3. Last lecture. Tokenization Token, type, term distinctions Case normalization Stemming and lemmatization Stopwords 30 most common words account for 30% of the tokens in written text . Last lecture: empirical laws.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Term weighting and vector representation of text' - cade

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Term weighting and vector representation of text

Lecture 3

Last lecture
• Tokenization
• Token, type, term distinctions
• Case normalization
• Stemming and lemmatization
• Stopwords
• 30 most common words account for 30% of the tokens in written text
Last lecture: empirical laws
• Heap’s law: estimating the size of the vocabulary
• Linear in the number of tokens in log-log space
• Zipf’s law: frequency distribution of terms
This class
• Term weighting
• Tf.idf
• Vector representations of text
• Computing similarity
• Relevance
• Redundancy identification
Term frequency and weighting
• A word that appears often in a document is probably very descriptive of what the document is about
• Assign to each term in a document a weight for that term, that depends on the number of occurrences of the that term in the document
• Term frequency (tf)
• Assign the weight to be equal to the number of occurrences of term t in document d
Bag of words model
• A document can now be viewed as the collection of terms in it and their associated weight
• Mary is smarter than John
• John is smarter than Mary
• Equivalent in the bag of words model
Problems with term frequency
• Stop words
• Semantically vacuous
• Auto industry
• “Auto” or “car” would not be at all indicative about what a document/sentence is about
• Need a mechanism for attenuating the effect of terms that occur too often in the collection to be meaningful for relevance/meaning determination
• Reduce the tf weight of a term by a factor that grows with the collection frequency
• More common for this purpose is document frequency (how many documents in the collection contain the term)
Inverse document frequency
• N number of documents in the collection
• N = 1000; df[the] = 1000; idf[the] = 0
• N = 1000; df[some] = 100; idf[some] = 2.3
• N = 1000; df[car] = 10; idf[car] = 4.6
• N = 1000; df[merger] = 1; idf[merger] = 6.9
it.idf weighting
• Highest when t occurs many times within a small number of documents
• Thus lending high discriminating power to those documents
• Lower when the term occurs fewer times in a document, or occurs in many documents
• Thus offering a less pronounced relevance signal
• Lowest when the term occurs in virtually all documents
Document vector space representation
• Each document is viewed as a vector with one component corresponding to each term in the dictionary
• The value of each component is the tf-idf score for that word
• For dictionary terms that do not occur in the document, the weights are 0
• Magnitude of the vector difference
• Problematic when the length of the documents is different
The effect of the denominator is thus to length-normalize the initial document representation vectors to unit vectors
• So cosine similarity is the dot product of the normalized versions of the two documents
Tf modifications
• It is unlikely that twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence
Problems with the normalization
• A change in the stop word list can dramatically alter term weightings
• A document may contain an outlier term