vocabulary size and term distribution tokenization text normalization and stemming l.
Skip this Video
Loading SlideShow in 5 Seconds..
Vocabulary size and term distribution: tokenization, text normalization and stemming PowerPoint Presentation
Download Presentation
Vocabulary size and term distribution: tokenization, text normalization and stemming

Loading in 2 Seconds...

play fullscreen
1 / 26

Vocabulary size and term distribution: tokenization, text normalization and stemming - PowerPoint PPT Presentation

  • Uploaded on

Vocabulary size and term distribution: tokenization, text normalization and stemming. Lecture 2. Overview. Getting started: tokenization, stemming, compounds end of sentence Collection vocabulary Terms, tokens, types Vocabulary size Term distribution Stop words

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Vocabulary size and term distribution: tokenization, text normalization and stemming' - merrill

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
  • Getting started:
    • tokenization, stemming, compounds
    • end of sentence
  • Collection vocabulary
    • Terms, tokens, types
    • Vocabulary size
    • Term distribution
  • Stop words
  • Vector representation of text and term weighting
  • Friends, Romans, Countrymen, lend me your ears;
  • Friends | Romans | Countrymen | lend | me your | ears

Token an instance of a sequence of characters that are grouped together as a useful semantic unit for processing

Type the class of all tokens containing the same character sequence

Term type that is included in the system dictionary (normalized)

Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.

How to handle special cases involving apostrophes, hyphens etc?

C++, C#, URLs, emails, phone numbers, dates

San Francisco, Los Angeles

Issues of tokenization are language specific
    • Requires the language to be known
  • Language identification based on classifiers that use short character subsequences as features is highly effective
    • Most languages have distinctive signature patterns
very important for information retrieval
Very important for information retrieval
  • Splitting tokens on spaces can cause bad retrieval results
    • Search for York University, returns pages containing new york university
  • German: compound nouns
    • Retrieval systems for German greatly benefit fron the use of compound-splitter module
    • Checks if a word can be subdivided into words that appear in the vocabulary
  • East Asian Languages (Chinese, Japanese, Korean, Thai)
    • Text is written without any spaces between words
stop words
Stop words
  • Very common words that have no discriminatory power
building a stop word list
Building a stop word list
  • Sort terms by collection frequency and take the most frequent
    • In a collection about insurance practices, “insurance” would be a stop word
  • Why do we need stop lists
    • Smaller indices for information retrieval
    • Better approximation of importance for summarization etc
  • Use problematic in phrasal searches
Trend in IR systems over time
    • Large stop lists (200-300 terms)
    • Very small stop lists (7-12 terms)
    • No stop list whatsoever
    • The 30 most common words account for 30% of the tokens in written text
  • Good compression techniques for indices
  • Term weighting leads to very common words having little impact for document represenation
  • Token normalization
    • Canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens
    • U.S.A vs USA
    • Anti-discriminatory vs antidiscriminatory
    • Car vs automobile?
normalization sensitive to query
Normalization sensitive to query

Query term Terms that should match

Windows Windows

windows Windows, windows, window

Window window, windows

capitalization case folding
Capitalization/case folding
  • Good for
    • Allow instances of Automobile at the beginning of a sentence to match with a query of automobile
    • Helps a search engine when most users type ferrari when they are interested in a Ferrari car
  • Bad for
    • Proper names vs common nouns
    • General Motors, Associated Press, Black
  • Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning
  • In IR, lowercasing is most practical because of the way users issue their queries
other languages
Other languages
  • 60% of webpages are in english
    • Less than one third of Internet users speak English
    • Less than 10% of the world’s population primarily speak English
  • Only about one third of blog posts are in English
stemming and lemmatization
Stemming and lemmatization
  • Organize, organizes, organizing
  • Democracy, democratic, democratization

Am, are, is  be

Car, cars, car’s, cars’ ==? car

    • Crude heuristic process that chops off the ends of the words
      • Democratic  democa
  • Lemmatization
    • Use of vocabulary and morphological analysis, returns the base form of a word (lemma)
      • Democratic  democracy
      • Sang  sing
porter stemmer
Porter stemmer
  • Most common algorithm for stemming English
    • 5 phases of word reduction
    • SSES  SS
      • caresses  caress
    • IES  I
      • ponies  poni
    • SS  SS
    • S 
      • cats  cat
    • EMENT 
      • replacement  replac
      • cement  cement
vocabulary size
Vocabulary size
  • Dictionaries
    • 600,000+ words
  • But they do not include names of people, locations, products etc
heap s law estimating the number of terms
Heap’s law: estimating the number of terms

M vocabulary size (number of terms)

T number of tokens

30 < k < 100

b = 0.5

Linear relation between vocabulary size and number of tokens in log-log space

zipf s law modeling the distribution of terms
Zipf’s law: modeling the distribution of terms
  • The collection frequency of the ith most common term is proportional to 1/i
  • If the most frequent term occurs cf1 then the second most frequent term has half as many occurrences, the third most frequent term has a third as many, etc
problems with the normalization
Problems with the normalization
  • A change in the stop word list can dramatically alter term weightings
  • A document may contain an outlier term