Automatic indexing term selection
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

Automatic Indexing (Term Selection) PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on
  • Presentation posted in: General

Automatic Indexing (Term Selection). Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989. Automatic Indexing. Indexing: assign identifiers (index terms) to text documents. Identifiers: single-term vs. term phrase

Download Presentation

Automatic Indexing (Term Selection)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Automatic indexing term selection

Automatic Indexing (Term Selection)

Automatic Text Processing

by G. Salton, Chap 9,

Addison-Wesley, 1989.


Automatic indexing

Automatic Indexing

  • Indexing:

    • assign identifiers (index terms) to text documents.

  • Identifiers:

    • single-term vs. term phrase

    • controlled vs. uncontrolled vocabulariesinstruction manuals, terminological schedules, …

    • objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, …


Two issues

Two Issues

  • Issue 1: indexing exhaustivity

    • exhaustive: assign a large number of terms

    • nonexhaustive

  • Issue 2: term specificity

    • broad terms (generic)cannot distinguish relevant from nonrelevant documents

    • narrow terms (specific)retrieve relatively fewer documents, but most of them are relevant


Term frequency consideration

Term-Frequency Consideration

  • Function words

    • for example, "and", "or", "of", "but", …

    • the frequencies of these words are high in all texts

  • Content words

    • words that actually relate to document content

    • varying frequencies in the different texts of a collect

    • indicate term importance for content


A frequency based indexing method

A Frequency-Based Indexing Method

  • Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words.

  • Compute the term frequency tfij for all remaining terms Tj in each document Di, specifying the number of occurrences of Tj in Di.

  • Choose a threshold frequency T, and assign to each document Di all term Tj for which tfij > T.


How to compute w ij

How to compute wij ?

  • Inverse document frequency, idfj

    • tfij*idfj (TFxIDF)

  • Term discrimination value, dvj

    • tfij*dvj

  • Probabilistic term weightingtrj

    • tfij*trj

  • Global properties of terms in a document collection


Inverse document frequency

Inverse Document Frequency

  • Inverse Document Frequency (IDF) for term Tjwhere dfj (document frequency of term Tj) is thenumber of documents in which Tj occurs.

    • fulfil both the recall and the precision

    • occur frequently in individual documents but rarely in the remainder of the collection


Tfxidf

TFxIDF

  • Weight wij of a term Tj in a document di

  • Eliminating common function words

  • Computing the value of wij for each term Tj in each document Di

  • Assigning to the documents of a collection all terms with sufficiently high (tfxidf) factors


Term discrimination value

Term-discrimination Value

  • Useful index terms

    • Distinguish the documents of a collection from each other

  • Document Space

    • Two documents are assigned very similar term sets, when the corresponding points in document configuration appear close together

    • When a high-frequency term without discrimination is assigned, it will increase the document space density


Automatic indexing term selection

A Virtual Document Space

After Assignment of

good discriminator

After Assignment of

poor discriminator

Original State


Good term assignment

Good Term Assignment

  • When a term is assigned to the documents of a collection, the few objects to which the term is assigned will be distinguished from the rest of the collection.

  • This should increase the average distance between the objects in the collection and hence produce a document space less dense than before.


Poor term assignment

Poor Term Assignment

  • A high frequency term is assigned that does not discriminate between the objects of a collection. Its assignment will render the document more similar.

  • This is reflected in an increasein document spacedensity.


Term discrimination value1

Term Discrimination Value

  • Definitiondvj = Q - QjwhereQ and Qj are space densities before and after the assignments of term Tj.

  • dvj>0, Tj is a good term; dvj<0, Tj is a poor term.


Automatic indexing term selection

Variations of Term-Discrimination Value

with Document Frequency

Document

Frequency

N

Low frequency

dvj=0

Medium frequency

dvj>0

High frequency

dvj<0


Tf ij x dv j

TFijx dvj

  • wij = tfijx dvj

  • compared with

    • : decrease steadily with increasing document frequency

    • dvj: increase from zero to positive as the document frequency of the term increase, decrease shapely as the document frequency becomes still larger.


Document centroid

Document Centroid

  • Issue: efficiency problemN(N-1) pairwise similarities

  • Document centroidC = (c1, c2, c3, ..., ct)where wij is the j-th term in document i.

  • Space density


Probabilistic term weighting

Probabilistic Term Weighting

  • GoalExplicit distinctions between occurrences of terms in relevant and nonrelevant documents of a collection

  • DefinitionGiven a user query q, and the ideal answer set of the relevant documents

  • From decision theory, the best ranking algorithm for a document D


Probabilistic term weighting1

Probabilistic Term Weighting

  • Pr(rel), Pr(nonrel):document’s a priori probabilities of relevance and nonrelevance

  • Pr(D|rel), Pr(D|nonrel):occurrence probabilities of document D in the relevant and nonrelevant document sets


Assumptions

Assumptions

  • Terms occur independently in documents


Derivation process

Derivation Process


For a specific document d

Given a document D=(d1, d2, …, dt)

Assume di is either 0 (absent) or 1 (present).

For a specific document D

Pr(xi=1|rel) = piPr(xi=0|rel) = 1-pi

Pr(xi=1|nonrel) = qiPr(xi=0|nonrel) = 1-qi


Term relevance weight

Term Relevance Weight


Issue

Issue

  • How to computepjand qj?

    pj = rj / Rqj = (dfj-rj)/(N-R)

    • R: the total number of relevant documents

    • N: the total number of documents


Estimation of term relevance

Estimation of Term-Relevance

  • The occurrence probability of a term in the nonrelevant documents qj is approximated by the occurrence probability of the term in the entire document collectionqj= dfj/ N

  • The occurrence probabilities of the terms in the small number of relevant documents is equal by using a constant value pj= 0.5 for all j.


Comparison

Comparison

= idfj

When N is sufficiently large, N-dfj N,


  • Login