1 / 30

Information Retrieval in Text Part I

Information Retrieval in Text Part I. Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval . SIAM 1999. Reading Assignment: Chapters 1 and 2. Outline . Introduction Basic Process of Information Retrieval Content Representation

patricia
Download Presentation

Information Retrieval in Text Part I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval in TextPart I • Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM 1999. • Reading Assignment: Chapters 1 and 2.

  2. Outline • Introduction • Basic Process of Information Retrieval • Content Representation • Document Purification and Analysis • Item Normalization • Index Construction • Manual Indexing • Automatic Indexing • Inverted File Structures • Signature Files

  3. Introduction • Expectations from our search engines • Type principal, where one meant principle • Type Lanzcos, where one meant Lanczos • Type right and left, where one meant • Party associations • Traffic laws • Chaos • Find what we want from a gigantic collection of documents (handle the tsunami of data) • We are asking the computer to supply the information we want, rather than the information we asked for • Reference librarians are already good at that, asking the patron few questions before directing him to the results

  4. Introduction • An Information system consists of • database of documents • search engine • interface • search results

  5. Basic Process Of IR • Basic process of information retrieval can be described as : • Representing content of document • Document Purification and Analysis • Item Normalization • Index Construction • Representing User’s information • Query Representation • User Interface • Ranking and Relevance Feedback • The main objective of an IR system is to increase precision and recall, efficiently.

  6. Precision and Recall • Precision: how many of the documents retrieved by an algorithm are correct • Recall: how many of the documents that should have been retrieved by an algorithm were in fact retrieved • Average Precision

  7. Document Purification and Analysis • Unless documents are cleaned up (making sure every document has a title, begin and end, handle non-textual portions like images) wrong [portions of] documents may be retrieved

  8. Document Purification and Analysis • Taking HTML documents, for example, one needs to decide which “tags” to index • According to references published in 1997 and 1998, the following features are ignored in building a search engine index • <COMMENT> tags • <ALT TEXT> attribute • <META> tags • Image maps, frames, and some URLs

  9. Document Purification and Analysis • Usually, search engines extract • text, excluding punctuation, from title tags, header tags, and the first characters of an html file. This may include • The first 100 significant words • The first 20 lines per record • Search engines would ignore • invisible text • Text with smaller fonts • Words containing numbers

  10. Document Purification and Analysis • Text formatting • Use standard ASCII/Unicode • May need to convert certain formats to text or extract text information from them (e.g. postscript, pdf) • What about OCRed documents?

  11. Item Normalization • Words must be sliced and diced before being considered for index construction. This may include • Identification of processing tokens (words) • Characterizations of tokens • Stemming of tokens

  12. Item Normalization • Applying stop lists to the collections of processing tokens • ftp://ftp.cs.cornell.edu/pub/smart/english.stop • E.g. able, about, after, allow, became, been, before, certainly, clearly, enough, everywhere, etc. • What to do with singletons (words appearing once in a collection of documents)?

  13. Item Normalization • Stemming: Removing of suffixes, and sometimes prefixes, to reduce a word to its root form. • E.g. reformation, reformative, reformatory, reformed, and reformism can all be stemmed to • reform or form?????? • This saves considerable amount of space • However, one may lose the context of search • E.g. someone looking for reformation and some results refer to reformatories (reform schools) • Syntactic stemmers vs. dictionary-based stemmers

  14. Item Normalization • Stemming Advantages • Reduces diversity of word representations • Misspelled words are recognized • Handles plurals and common suffixes • Increases recall • Stemming Disadvantages • Retrieval of irrelevant documents (reduces precision) • Cannot be applied to proper nouns • Currently available stemmers • Al Stem: http://tides.umiacs.umd.edu/software.html • http://www.nongnu.org/aramorph/javadoc/gpl/pierrick/brihaye/aramorph/lucene/ArabicStemmer.html • Porter Stemmer: http://maya.cs.depaul.edu/~classes/ds575/porter.html • http://webscripts.softpedia.com/scriptDownload/Porter-Stemmer-Download-42859.html

  15. Index Construction • Manual Indexing • Automatic Indexing • Inverted File Structure • Signature Files • Vector Space Models

  16. Manual Indexing • Every document is catalogued based on some individual’s or group’s assessment of what that document is about, and an appropriate list of descriptive entries is generated. • Advantage • Human indexers can establish relationships and concepts between seemingly different topics that can be very useful to future readers • Broader, narrower and related subjects

  17. Manual Indexing • Disadvantages • Expensive • Time consuming (think of manually indexing the Web) • Can be subject to the background and personality of the indexer • Cleverdon reported that if two groups of people construct thesauri in a particular subject area, the overlap of index terms was about 60% • Moreover, if two indexers used the same thesaurus on the same document, common index terms that were shared were about 30%. • May not be reproducible in case of modification or loss of information

  18. Manual Indexing • Manual indexing has shifted its focus toward “the abstraction of concepts and judgments on the value of the information” G. Kowalski, 1997

  19. Manual Indexing • Yahoo! (up to 1999) • Instead of a web crawler, web masters submit URLs for Yahoo! to pursue. If Yahoo! thinks its appropriate, it is included in the index, otherwise not. • Around 30% acceptance rate. • What about sites fitting in more than one category? • However, increases precision as index size is small

  20. Manual Indexing • EMBASE (Elsavier Science’s Bibliographic Database) Excerpta Medica DataBASE • Covers pharmacology and biomedicine • Uses machine-aided indexing to work hand in hand with manual indexing • National Library of Medicine • Publishes MeSH (Medical Subject Headings) • Uses indexers to assign as many headings as necessary to characterize accurately the content of a journal article. • H. W. Wilson Company (Similar to MeSH appropoach)

  21. Automatic Indexing • Using algorithms/software to extract terms for indexing is the predominant method for processing documents from large repositories. • Consists of huge computerized robots crawling throughout the Web all day and night, collecting documents and indexing every word in the text. • Concepts may result from the index construction stage (as with vector space models), or may feed the index construction (as with inverted file structures and signature files), which is similar to manual indexing.

  22. Inverted File Structure • Consists of a document file, an inversion list and a dictionary. • Document File • Each document is given a unique identifier • Processing tokens within the document are identified • Dictionary • A sorted list of all unique words or processing tokens in the system and a pointer to the location of its inversion list. • May also include the frequency of each term in the collection (global frequency) • N-grams and PAT trees are well-known data structures for processing dictionaries • Inversion List • Contains the pointer from the term to which documents contain that term [and the position in that document].

  23. INVERSION LISTS DOCUMENTS DICTIONARY bit: 1, 3 byte: 1, 2, 4 computer: 1, 3, 4 memory: 2, 3 Figure 1: Inversion File Structure

  24. Inversion lists may also include the position within the document • May help in supporting queries of • Phrases (consecutive keywords) • Words within specified proximity

  25. Pros • Queries only interested in more recent information, only the latest databases need to be searched. • Provide Optimum Performance. • Concepts and their relationship can be stored.

  26. Cons • Space requirement for personal file system • Needs exact spelling

  27. Signature File • Signature file search is a linear scan of the compressed version of items producing a response time linear with respect to file size . • In Signature file indexing, each record is allocated a fixed-width signature, or bitstring, of w bits. • Each word that appears in the record is hashed a number of times to determine the bits in the signature that should be set

  28. Signature File • Any record whose signature has a 1-bit corresponding to every 1-bit in the query signature is a potential answer • Each such record must be fetched and checked directly against the query to determine whether it is a false match or a true match. • Many variants of signature file are available

  29. Signature Files

  30. Pros & Cons • Pros Support Ranked Queries • Cons Variety of parameters be fixed in advance Expensive for disjunctive queries Response time is unpredictable Not Scalable

More Related