CES 514 Data Mining Feb 17, 2010 Lecture 3: The dictionary and tolerant retrieval. Ch. 2. Recap of the previous lecture. The type/token distinction Terms are normalized types put in the dictionary Tokenization problems: Hyphens, apostrophes, compounds, Chinese
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Lecture 3: The dictionary and tolerant retrieval
Can retrieve all words in range: nom ≤ w < non.
Exercise: from this, how can we enumerate all terms
meeting the wild-card query pro*cent?
Type your search terms, use ‘*’ if you need to.
E.g., Alex* will match Alexander.
Did you mean “flew from Heathrow”?
because no docs matched the query phrase.
E.g., Herman becomes H655.
Will hermann generate the same code?
(SPELL(moriset) /3 toron*to) OR SOUNDEX(chaikofski)