introduction to text retrieval n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction to Text Retrieval PowerPoint Presentation
Download Presentation
Introduction to Text Retrieval

Loading in 2 Seconds...

play fullscreen
1 / 33

Introduction to Text Retrieval - PowerPoint PPT Presentation


  • 281 Views
  • Uploaded on

Introduction to Text Retrieval. CSE3201/4500 Information Retrieval Systems. Database Types. highly-structured. Relational DB. XML collections. Text Collections. Multimedia Collections. ill-structured. Ill-structured data. Attributes: Variable length records, fields

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to Text Retrieval' - paul


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction to text retrieval

Introduction to Text Retrieval

CSE3201/4500

Information Retrieval Systems

(c) Maria Indrawan 2004

database types
Database Types

highly-structured

Relational DB

XML collections

Text Collections

Multimedia Collections

ill-structured

(c) Maria Indrawan 2004

ill structured data
Ill-structured data
  • Attributes:
    • Variable length records, fields
    • Repeated fields [non-normalised]
    • Mixed media
    • Often large
  • Often accessed by “novice” users
  • Need for both currency and completeness

(c) Maria Indrawan 2004

information retrieval
Information Retrieval
  • Information retrieval has been the term applied to such areas as:
    • text retrieval systems, library systems, citation retrieval systems, records management and archives, photo library applications etc.
  • These systems are typical of variable-length record systems
  • Text retrieval is a subset of Information Retrieval.
    • research articles may use the term IR = text retrieval, especially in the 70s,80s and 90s.

(c) Maria Indrawan 2004

text retrieval overview
Text Retrieval - Overview
  • Information retrieval
    • branch of database theory
    • specialises in managing retrieval of unstructured data
    • large amount of free format text.
  • Key problem:
    • How to retrieve the appropriate pieces of unstructured data (e.g. documents) in response to a more or less structured query.
  • Response to a query:
    • Does not answer the query directly
    • Identify relevant information.

(c) Maria Indrawan 2004

text retrieval characteristics
Text Retrieval Characteristics
  • large volume of document space
  • document space may/may not be structured.
  • query may not be structured.
  • exact matching, such as relational database, will not work effectively.
  • objects which are to be retrieved, usually represented by surrogate records.

(c) Maria Indrawan 2004

surrogate records
Surrogate Records
  • Most text retrieval systems rely on surrogate records rather than directly accessing the objects themselves.
  • The quality of the surrogate records often decides how well the system retrieves.
  • The structure of the surrogate records will affect how well they can be indexed or otherwise accessed.

(c) Maria Indrawan 2004

text retrieval processes
Text Retrieval Processes
  • Representation
  • Storage
  • Organization
  • Retrieval
  • Presentation

(c) Maria Indrawan 2004

text retrieval processes model
Text Retrieval Processes Model

(c) Maria Indrawan 2004

retrieval process
Retrieval Process

(c) Maria Indrawan 2004

indexing document analysis

Document

Natural Language Text

ANALYSE

Keywords

- Stemming

- Thesauri Replacement

- (Weight Assignment)

STORE

Indexing (Document Analysis)

(c) Maria Indrawan 2004

query formulation
Query Formulation
  • Controlled vocabulary:
    • keyword of query  keyword in document collection

(c) Maria Indrawan 2004

indexing in text retrieval systems

Indexing in Text Retrieval Systems

(c) Maria Indrawan 2004

indexed files in traditional databases
Indexed Files in Traditional Databases
  • An index is a look up table which establishes a correspondence between a particular attribute (or attributes) and the address of the record in the file.
  • One named (physical) file - two logical files:
    • Data file - contains full data records
    • Index file - “records” consist of two fields:

key value and address

  • Index file small - quick to search
  • Addresses obtained from the index enable direct access to the data file
  • Logically sequential access also via index

(c) Maria Indrawan 2004

indexed non sequential file

Index

Indexed Non-Sequential File

Data Records

(c) Maria Indrawan 2004

indexed sequential file

Index

Indexed Sequential File

Data Records

(c) Maria Indrawan 2004

indexing in text retrieval systems1
Indexing in Text Retrieval Systems

Doc-2

(data record)

Doc-1

(data record)

(c) Maria Indrawan 2004

purpose of indexing
Purpose of Indexing
  • a sufficiently general description of a document so that it can be retrieved with queries that concern the same subject as the document;
  • sufficiently specific description so that the document will not be returned for those queries which are not related to the document.

(c) Maria Indrawan 2004

indexing
Indexing
  • Manual indexing
  • Automatic indexing

(c) Maria Indrawan 2004

style of indexing
Style of indexing
  • depends on the form of queries and vice-versa.
  • We must decide whether the terms available for indexing are predefined, a controlled vocabulary, or chosen at the time of indexing, an uncontrolled vocabulary.

(c) Maria Indrawan 2004

controlled vocabulary
Controlled Vocabulary
  • Controlled vocabulary is a method of predetermining the terms which will be used in a specific domain so that
    • indexers will select from a limited set of terms
    • searchers can use terms knowing that they have been applied in an objective manner
    • index sets are reduced in size

(c) Maria Indrawan 2004

manual indexing methods
Manual Indexing Methods
  • 1. Give the document a single code from a predefined list. e.g.:
    • the first letter of the first author’s family name
    • a Dewey Decimal number
  • 2. Assign several of a predefined lists of codes to a document. e.g.:
    • assign the Computing Reviews classification to articles.
  • Assign to each document a set of descriptors that are not predefined. The descriptors may be words from the text of the document and/or thesaurus.

(c) Maria Indrawan 2004

manual indexing analysis
Manual Indexing - Analysis
  • Single term indexing: simple and low index cost, but poor retrieval.
  • All other techniques require that a more complex index be maintained.
  • When a controlled vocabulary is used, a taxonomy of the document contents must be devised. Having devised this it must be adhered to henceforth.

(c) Maria Indrawan 2004

manual indexing analysis1
Manual Indexing - Analysis
  • Advantage: terms never used in the text but are extremely descriptive may be assigned to the document.
  • Disadvantage:
    • inter-indexer consistency
    • inflexible view of documents
    • no control on number of satisfying documents.

(c) Maria Indrawan 2004

automatic indexing a basic method
Automatic Indexing - A Basic Method
  • Assume that a document consists of just text and that we will derive our indexing terms from this text.
  • Break the text up into words, casefold, and index on every word. This technique is very simple and performs reasonably well.

(c) Maria Indrawan 2004

automatic indexing refinement
Automatic Indexing - Refinement
  • Language dependent.
    • refinement for English will be different from Chinese
  • Stop List
  • Stemming
  • Term Weighting

(c) Maria Indrawan 2004

indexing refinement stop list
Indexing Refinement – Stop List
  • A list of common words.
  • Generally contains words that are not nouns, verbs, adjectives and adverbs.
  • A stop list might consist of a, the, an is, be , ....
  • Common stop lists run from 10 to hundreds of words.
    • It does not matter what the stop list is, typically around 300 common words will do well.
  • Indexing process will ignore the words listed in the stop list.

(c) Maria Indrawan 2004

stop lists
Stop Lists
  • Fox indicates that the first 20 stop words accounts for 31.19% of the English corpus.
      • Fox C. (1992). Lexical Analysis and Stoplists. In Frakes W.B. and Baeza-Yates R., Eds.), Information Retrieval:Data Structures and Algorithms, Englewood Cliffs, NJ.: Prentice-Hall
  • The first 20 stop words:
    • The, of, and, to, a , in, that, is, was, he, for , it, with, as, not, his, on, be, at, by.

(c) Maria Indrawan 2004

refinement stemming
Refinement - Stemming
  • To incorporate many variations of words, where an attempt is made to accommodate many variations comprising a concept
  • This avoids exceedingly long “or” query statement.
  • Example: inquiry or inquired or inquiries
  • The process is performed after the “stop list” process.
  • Porter stemming algorithm
    • Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137)

(c) Maria Indrawan 2004

stemming suffix
Stemming - Suffix
  • Most English meaning shifts for grammatical purposes are handled by suffixes
  • Most retrieval systems allow for “trailing” or suffixes truncation.
  • Example:
    • “inquir$” will retrieve documents containing the words “inquire”, “inquired”, “inquires”, “inquiring”, “inquiry” etc.

(c) Maria Indrawan 2004

stemming prefix
Stemming - Prefix
  • Usually is not used in English text retrieval systems.
  • Prefix is substantial modifier, even a negation.
  • Example:
    • flammable and inflammable.
  • Prefix stemming may be useful in Chemical databases.

(c) Maria Indrawan 2004

stemming exception list
Stemming – Exception List
  • Irregularity in the language needs to be implemented as a “lookup list”
  • Example:
    • Irregular plurals
      • woman => women
      • child => children
    • past tense
      • choose => chose
      • find => found

(c) Maria Indrawan 2004

summary
Summary
  • Text Retrieval Systems:
    • motivation
    • model
  • Indexing Refinements:
    • Stop List
    • Stemming
    • Term Weight (week 8)

(c) Maria Indrawan 2004