Introduction to text retrieval
This presentation is the property of its rightful owner.
1 / 33

Introduction to Text Retrieval PowerPoint PPT Presentation

Introduction to Text Retrieval. CSE3201/4500 Information Retrieval Systems. Database Types. highly-structured. Relational DB. XML collections. Text Collections. Multimedia Collections. ill-structured. Ill-structured data. Attributes: Variable length records, fields

Download Presentation

Introduction to Text Retrieval

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Introduction to text retrieval

Introduction to Text Retrieval


Information Retrieval Systems

(c) Maria Indrawan 2004

Database types

Database Types


Relational DB

XML collections

Text Collections

Multimedia Collections


(c) Maria Indrawan 2004

Ill structured data

Ill-structured data

  • Attributes:

    • Variable length records, fields

    • Repeated fields [non-normalised]

    • Mixed media

    • Often large

  • Often accessed by “novice” users

  • Need for both currency and completeness

(c) Maria Indrawan 2004

Information retrieval

Information Retrieval

  • Information retrieval has been the term applied to such areas as:

    • text retrieval systems, library systems, citation retrieval systems, records management and archives, photo library applications etc.

  • These systems are typical of variable-length record systems

  • Text retrieval is a subset of Information Retrieval.

    • research articles may use the term IR = text retrieval, especially in the 70s,80s and 90s.

(c) Maria Indrawan 2004

Text retrieval overview

Text Retrieval - Overview

  • Information retrieval

    • branch of database theory

    • specialises in managing retrieval of unstructured data

    • large amount of free format text.

  • Key problem:

    • How to retrieve the appropriate pieces of unstructured data (e.g. documents) in response to a more or less structured query.

  • Response to a query:

    • Does not answer the query directly

    • Identify relevant information.

(c) Maria Indrawan 2004

Text retrieval characteristics

Text Retrieval Characteristics

  • large volume of document space

  • document space may/may not be structured.

  • query may not be structured.

  • exact matching, such as relational database, will not work effectively.

  • objects which are to be retrieved, usually represented by surrogate records.

(c) Maria Indrawan 2004

Surrogate records

Surrogate Records

  • Most text retrieval systems rely on surrogate records rather than directly accessing the objects themselves.

  • The quality of the surrogate records often decides how well the system retrieves.

  • The structure of the surrogate records will affect how well they can be indexed or otherwise accessed.

(c) Maria Indrawan 2004

Text retrieval processes

Text Retrieval Processes

  • Representation

  • Storage

  • Organization

  • Retrieval

  • Presentation

(c) Maria Indrawan 2004

Text retrieval processes model

Text Retrieval Processes Model

(c) Maria Indrawan 2004

Retrieval process

Retrieval Process

(c) Maria Indrawan 2004

Indexing document analysis


Natural Language Text



- Stemming

- Thesauri Replacement

- (Weight Assignment)


Indexing (Document Analysis)

(c) Maria Indrawan 2004

Query formulation

Query Formulation

  • Controlled vocabulary:

    • keyword of query  keyword in document collection

(c) Maria Indrawan 2004

Indexing in text retrieval systems

Indexing in Text Retrieval Systems

(c) Maria Indrawan 2004

Indexed files in traditional databases

Indexed Files in Traditional Databases

  • An index is a look up table which establishes a correspondence between a particular attribute (or attributes) and the address of the record in the file.

  • One named (physical) file - two logical files:

    • Data file - contains full data records

    • Index file - “records” consist of two fields:

      key value and address

  • Index file small - quick to search

  • Addresses obtained from the index enable direct access to the data file

  • Logically sequential access also via index

(c) Maria Indrawan 2004

Indexed non sequential file


Indexed Non-Sequential File

Data Records

(c) Maria Indrawan 2004

Indexed sequential file


Indexed Sequential File

Data Records

(c) Maria Indrawan 2004

Indexing in text retrieval systems1

Indexing in Text Retrieval Systems


(data record)


(data record)

(c) Maria Indrawan 2004

Purpose of indexing

Purpose of Indexing

  • a sufficiently general description of a document so that it can be retrieved with queries that concern the same subject as the document;

  • sufficiently specific description so that the document will not be returned for those queries which are not related to the document.

(c) Maria Indrawan 2004



  • Manual indexing

  • Automatic indexing

(c) Maria Indrawan 2004

Style of indexing

Style of indexing

  • depends on the form of queries and vice-versa.

  • We must decide whether the terms available for indexing are predefined, a controlled vocabulary, or chosen at the time of indexing, an uncontrolled vocabulary.

(c) Maria Indrawan 2004

Controlled vocabulary

Controlled Vocabulary

  • Controlled vocabulary is a method of predetermining the terms which will be used in a specific domain so that

    • indexers will select from a limited set of terms

    • searchers can use terms knowing that they have been applied in an objective manner

    • index sets are reduced in size

(c) Maria Indrawan 2004

Manual indexing methods

Manual Indexing Methods

  • 1. Give the document a single code from a predefined list. e.g.:

    • the first letter of the first author’s family name

    • a Dewey Decimal number

  • 2. Assign several of a predefined lists of codes to a document. e.g.:

    • assign the Computing Reviews classification to articles.

  • Assign to each document a set of descriptors that are not predefined. The descriptors may be words from the text of the document and/or thesaurus.

(c) Maria Indrawan 2004

Manual indexing analysis

Manual Indexing - Analysis

  • Single term indexing: simple and low index cost, but poor retrieval.

  • All other techniques require that a more complex index be maintained.

  • When a controlled vocabulary is used, a taxonomy of the document contents must be devised. Having devised this it must be adhered to henceforth.

(c) Maria Indrawan 2004

Manual indexing analysis1

Manual Indexing - Analysis

  • Advantage: terms never used in the text but are extremely descriptive may be assigned to the document.

  • Disadvantage:

    • inter-indexer consistency

    • inflexible view of documents

    • no control on number of satisfying documents.

(c) Maria Indrawan 2004

Automatic indexing a basic method

Automatic Indexing - A Basic Method

  • Assume that a document consists of just text and that we will derive our indexing terms from this text.

  • Break the text up into words, casefold, and index on every word. This technique is very simple and performs reasonably well.

(c) Maria Indrawan 2004

Automatic indexing refinement

Automatic Indexing - Refinement

  • Language dependent.

    • refinement for English will be different from Chinese

  • Stop List

  • Stemming

  • Term Weighting

(c) Maria Indrawan 2004

Indexing refinement stop list

Indexing Refinement – Stop List

  • A list of common words.

  • Generally contains words that are not nouns, verbs, adjectives and adverbs.

  • A stop list might consist of a, the, an is, be , ....

  • Common stop lists run from 10 to hundreds of words.

    • It does not matter what the stop list is, typically around 300 common words will do well.

  • Indexing process will ignore the words listed in the stop list.

(c) Maria Indrawan 2004

Stop lists

Stop Lists

  • Fox indicates that the first 20 stop words accounts for 31.19% of the English corpus.

    • Fox C. (1992). Lexical Analysis and Stoplists. In Frakes W.B. and Baeza-Yates R., Eds.), Information Retrieval:Data Structures and Algorithms, Englewood Cliffs, NJ.: Prentice-Hall

  • The first 20 stop words:

    • The, of, and, to, a , in, that, is, was, he, for , it, with, as, not, his, on, be, at, by.

  • (c) Maria Indrawan 2004

    Refinement stemming

    Refinement - Stemming

    • To incorporate many variations of words, where an attempt is made to accommodate many variations comprising a concept

    • This avoids exceedingly long “or” query statement.

    • Example: inquiry or inquired or inquiries

    • The process is performed after the “stop list” process.

    • Porter stemming algorithm

      • Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137)

    (c) Maria Indrawan 2004

    Stemming suffix

    Stemming - Suffix

    • Most English meaning shifts for grammatical purposes are handled by suffixes

    • Most retrieval systems allow for “trailing” or suffixes truncation.

    • Example:

      • “inquir$” will retrieve documents containing the words “inquire”, “inquired”, “inquires”, “inquiring”, “inquiry” etc.

    (c) Maria Indrawan 2004

    Stemming prefix

    Stemming - Prefix

    • Usually is not used in English text retrieval systems.

    • Prefix is substantial modifier, even a negation.

    • Example:

      • flammable and inflammable.

    • Prefix stemming may be useful in Chemical databases.

    (c) Maria Indrawan 2004

    Stemming exception list

    Stemming – Exception List

    • Irregularity in the language needs to be implemented as a “lookup list”

    • Example:

      • Irregular plurals

        • woman => women

        • child => children

      • past tense

        • choose => chose

        • find => found

    (c) Maria Indrawan 2004



    • Text Retrieval Systems:

      • motivation

      • model

    • Indexing Refinements:

      • Stop List

      • Stemming

      • Term Weight (week 8)

    (c) Maria Indrawan 2004

  • Login