1 / 33

Introduction to Text Retrieval

Introduction to Text Retrieval. CSE3201/4500 Information Retrieval Systems. Database Types. highly-structured. Relational DB. XML collections. Text Collections. Multimedia Collections. ill-structured. Ill-structured data. Attributes: Variable length records, fields

paul
Download Presentation

Introduction to Text Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Text Retrieval CSE3201/4500 Information Retrieval Systems (c) Maria Indrawan 2004

  2. Database Types highly-structured Relational DB XML collections Text Collections Multimedia Collections ill-structured (c) Maria Indrawan 2004

  3. Ill-structured data • Attributes: • Variable length records, fields • Repeated fields [non-normalised] • Mixed media • Often large • Often accessed by “novice” users • Need for both currency and completeness (c) Maria Indrawan 2004

  4. Information Retrieval • Information retrieval has been the term applied to such areas as: • text retrieval systems, library systems, citation retrieval systems, records management and archives, photo library applications etc. • These systems are typical of variable-length record systems • Text retrieval is a subset of Information Retrieval. • research articles may use the term IR = text retrieval, especially in the 70s,80s and 90s. (c) Maria Indrawan 2004

  5. Text Retrieval - Overview • Information retrieval • branch of database theory • specialises in managing retrieval of unstructured data • large amount of free format text. • Key problem: • How to retrieve the appropriate pieces of unstructured data (e.g. documents) in response to a more or less structured query. • Response to a query: • Does not answer the query directly • Identify relevant information. (c) Maria Indrawan 2004

  6. Text Retrieval Characteristics • large volume of document space • document space may/may not be structured. • query may not be structured. • exact matching, such as relational database, will not work effectively. • objects which are to be retrieved, usually represented by surrogate records. (c) Maria Indrawan 2004

  7. Surrogate Records • Most text retrieval systems rely on surrogate records rather than directly accessing the objects themselves. • The quality of the surrogate records often decides how well the system retrieves. • The structure of the surrogate records will affect how well they can be indexed or otherwise accessed. (c) Maria Indrawan 2004

  8. Text Retrieval Processes • Representation • Storage • Organization • Retrieval • Presentation (c) Maria Indrawan 2004

  9. Text Retrieval Processes Model (c) Maria Indrawan 2004

  10. Retrieval Process (c) Maria Indrawan 2004

  11. Document Natural Language Text ANALYSE Keywords - Stemming - Thesauri Replacement - (Weight Assignment) STORE Indexing (Document Analysis) (c) Maria Indrawan 2004

  12. Query Formulation • Controlled vocabulary: • keyword of query  keyword in document collection (c) Maria Indrawan 2004

  13. Indexing in Text Retrieval Systems (c) Maria Indrawan 2004

  14. Indexed Files in Traditional Databases • An index is a look up table which establishes a correspondence between a particular attribute (or attributes) and the address of the record in the file. • One named (physical) file - two logical files: • Data file - contains full data records • Index file - “records” consist of two fields: key value and address • Index file small - quick to search • Addresses obtained from the index enable direct access to the data file • Logically sequential access also via index (c) Maria Indrawan 2004

  15. Index Indexed Non-Sequential File Data Records (c) Maria Indrawan 2004

  16. Index Indexed Sequential File Data Records (c) Maria Indrawan 2004

  17. Indexing in Text Retrieval Systems Doc-2 (data record) Doc-1 (data record) (c) Maria Indrawan 2004

  18. Purpose of Indexing • a sufficiently general description of a document so that it can be retrieved with queries that concern the same subject as the document; • sufficiently specific description so that the document will not be returned for those queries which are not related to the document. (c) Maria Indrawan 2004

  19. Indexing • Manual indexing • Automatic indexing (c) Maria Indrawan 2004

  20. Style of indexing • depends on the form of queries and vice-versa. • We must decide whether the terms available for indexing are predefined, a controlled vocabulary, or chosen at the time of indexing, an uncontrolled vocabulary. (c) Maria Indrawan 2004

  21. Controlled Vocabulary • Controlled vocabulary is a method of predetermining the terms which will be used in a specific domain so that • indexers will select from a limited set of terms • searchers can use terms knowing that they have been applied in an objective manner • index sets are reduced in size (c) Maria Indrawan 2004

  22. Manual Indexing Methods • 1. Give the document a single code from a predefined list. e.g.: • the first letter of the first author’s family name • a Dewey Decimal number • 2. Assign several of a predefined lists of codes to a document. e.g.: • assign the Computing Reviews classification to articles. • Assign to each document a set of descriptors that are not predefined. The descriptors may be words from the text of the document and/or thesaurus. (c) Maria Indrawan 2004

  23. Manual Indexing - Analysis • Single term indexing: simple and low index cost, but poor retrieval. • All other techniques require that a more complex index be maintained. • When a controlled vocabulary is used, a taxonomy of the document contents must be devised. Having devised this it must be adhered to henceforth. (c) Maria Indrawan 2004

  24. Manual Indexing - Analysis • Advantage: terms never used in the text but are extremely descriptive may be assigned to the document. • Disadvantage: • inter-indexer consistency • inflexible view of documents • no control on number of satisfying documents. (c) Maria Indrawan 2004

  25. Automatic Indexing - A Basic Method • Assume that a document consists of just text and that we will derive our indexing terms from this text. • Break the text up into words, casefold, and index on every word. This technique is very simple and performs reasonably well. (c) Maria Indrawan 2004

  26. Automatic Indexing - Refinement • Language dependent. • refinement for English will be different from Chinese • Stop List • Stemming • Term Weighting (c) Maria Indrawan 2004

  27. Indexing Refinement – Stop List • A list of common words. • Generally contains words that are not nouns, verbs, adjectives and adverbs. • A stop list might consist of a, the, an is, be , .... • Common stop lists run from 10 to hundreds of words. • It does not matter what the stop list is, typically around 300 common words will do well. • Indexing process will ignore the words listed in the stop list. (c) Maria Indrawan 2004

  28. Stop Lists • Fox indicates that the first 20 stop words accounts for 31.19% of the English corpus. • Fox C. (1992). Lexical Analysis and Stoplists. In Frakes W.B. and Baeza-Yates R., Eds.), Information Retrieval:Data Structures and Algorithms, Englewood Cliffs, NJ.: Prentice-Hall • The first 20 stop words: • The, of, and, to, a , in, that, is, was, he, for , it, with, as, not, his, on, be, at, by. (c) Maria Indrawan 2004

  29. Refinement - Stemming • To incorporate many variations of words, where an attempt is made to accommodate many variations comprising a concept • This avoids exceedingly long “or” query statement. • Example: inquiry or inquired or inquiries • The process is performed after the “stop list” process. • Porter stemming algorithm • Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137) (c) Maria Indrawan 2004

  30. Stemming - Suffix • Most English meaning shifts for grammatical purposes are handled by suffixes • Most retrieval systems allow for “trailing” or suffixes truncation. • Example: • “inquir$” will retrieve documents containing the words “inquire”, “inquired”, “inquires”, “inquiring”, “inquiry” etc. (c) Maria Indrawan 2004

  31. Stemming - Prefix • Usually is not used in English text retrieval systems. • Prefix is substantial modifier, even a negation. • Example: • flammable and inflammable. • Prefix stemming may be useful in Chemical databases. (c) Maria Indrawan 2004

  32. Stemming – Exception List • Irregularity in the language needs to be implemented as a “lookup list” • Example: • Irregular plurals • woman => women • child => children • past tense • choose => chose • find => found (c) Maria Indrawan 2004

  33. Summary • Text Retrieval Systems: • motivation • model • Indexing Refinements: • Stop List • Stemming • Term Weight (week 8) (c) Maria Indrawan 2004

More Related