1 / 19

Information Retrieval

This text discusses the process of tokenization and indexing in information retrieval, including issues such as document parsing, linguistic modules, token streams, and inverted indexing. It also covers topics like stemming, case folding, and the use of thesauri in search engines.

amyb
Download Presentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval Document Parsing

  2. Tokenizer Friends Romans Countrymen Token stream. Linguistic modules friend roman countryman Modified tokens. Indexer Inverted index. Basic indexing pipeline Documents to be indexed. Friends, Romans, countrymen.

  3. Parsing a document • What format is it in? • pdf/word/excel/html? • What language is it in? • What character set is in use? • Plain ASCII, UTF-8, UTF-16,… Each of these is a classification problem, with many complications…

  4. Katakana Hiragana Kanji “Romaji” Tokenization: Issues • Chinese/Japanese no spaces between words: • Not always guaranteed a unique tokenization • Dates/amounts in multiple formats フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) What about DNA sequences ? ACCCGGTACGCAC... Definition of Tokens  What you can search !!

  5. Is this the German “mit”? Morgen will ich in MIT … Case folding • Reduce all letters to lower case • exception: upper case (in mid-sentence?) • e.g.,General Motors • USA vs. usa

  6. Stemming • Reduce terms to their “roots” • language dependent • e.g., automate(s), automatic, automationall reduced toautomat. • e.g., casa, casalinga, casata, casamatta, casolare, casamento, casale, rincasare, case reduced tocas

  7. sses ss, ies i, ational ate, tional tion Porter’s algorithm • Commonest algorithm for stemming English • Conventions + 5 phases of reductions • phases applied sequentially • each phase consists of a set of commands • sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. • Full morphologial analysis  modest benefit !!

  8. Thesauri • Handle synonyms and homonyms • Hand-constructed equivalence classes • e.g., car = automobile • e.g., macchina = automobile = spider • List of words important for a given domain • For each word it specifies a list of correlated words (usually, synonyms, polysemic or phrases for complex concepts). • Co-occurrence Pattern: BT (broader term), NT (narrower term) • Vehicle (BT)  Car  Fiat 500 (NT) How to use it in SE ??

  9. Dmoz Directory

  10. Yahoo! Directory

  11. Information Retrieval Statistical Properties of Documents

  12. Statistical properties of texts • Token are not distributed uniformly • They follow the so called “Zipf Law” • Few tokens are very frequent • A middle sized set has medium frequency • Many are rare • The first 100 tokens sum up to 50% of the text • Many of these tokens are stopwords

  13. f = c |T| / r a = 1.52.0 a Sum after the k-th element is ≤ fkk/(z-1) The Zipf Law, in detail • K-th most frequent term has frequency approximately 1/k; or the product of the frequency (f) of a token and its rank (r) is almost a constant r * f = c |T| f = c |T| / r General Law For the initial top-elements is a constant

  14. An example of “Zipf curve”

  15. Zipf’s law log-log plot

  16. Consequences of Zipf Law • Do exist many not frequent tokens that do not discriminate. These are the so called “stop words” • English: to, from, on, and, the, ... • Italian: a, per, il, in, un,… • Do exist many tokens that occur once in a text and thus are poor to discriminate (error?). • English: Calpurnia • Italian: Precipitevolissimevolmente (o, paklo) Words with medium frequency  Words that discriminate

  17. Other statistical properties of texts • The number of distinct tokens grows as • The so called “Heaps Law” (|T|b where b<1) • Hence the token length is (log |T|) • Interesting words are the ones with • Medium frequency (Luhn)

  18. Frequency vs. Term significance (Luhn)

More Related