vocabulary languages in indexing searching n.
Skip this Video
Loading SlideShow in 5 Seconds..
Vocabulary & languages in indexing & searching PowerPoint Presentation
Download Presentation
Vocabulary & languages in indexing & searching

Loading in 2 Seconds...

play fullscreen
1 / 56

Vocabulary & languages in indexing & searching - PowerPoint PPT Presentation

  • Uploaded on

Vocabulary & languages in indexing & searching. Connection: indexing searching tefkos@rutgers.edu ; http://comminfo.rutgers.edu/~tefko/. Central idea Indexing and searching: inexorably connected. you cannot search that that was not first indexed in some manner or other

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Vocabulary & languages in indexing & searching

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
vocabulary languages in indexing searching

Vocabulary & languagesinindexing & searching




tefkos@rutgers.edu; http://comminfo.rutgers.edu/~tefko/

© Tefko Saracevic

central idea indexing and searching inexorably connected
Central ideaIndexing and searching: inexorably connected
  • you cannot search that that was not first indexed in some manner or other
    • to be searched everything is and must be indexed somehow even if it is not called “indexed”
  • indexing of documents or objects is done in order to be searchable
    • there are great many ways to do indexing
  • to index one needs an indexing language
    • there are great many indexing languages
      • even taking every word in a document is an indexing language

Knowing searching is knowing indexing

Tefko Saracevic



Controlled & uncontrolled vocabularies

Inverted indexes


© Tefko Saracevic

defined concepts valid for application in indexing searching
Defined concepts valid for application in indexing & searching



index terms

indexing vocabulary

indexing language



search terms

search vocabulary

query language

  • language
  • vocabulary

Tefko Saracevic

general definitions encarta dictionary
General definitions [Encarta Dictionary]

Language1. communication with words: the human use of spoken or written words as a communication system2. system of communication: a system of communication with its own set of conventions or special words

Vocabulary1. words of language: all the words used in a language as a whole2. words of subject area: the set of words associated with a subject or area of activity, or used by an individual person

Tefko Saracevic

specific definitions starting from the most basic concept
Specific definitionsStarting from the most basic concept:

Index term:

A word or phrase that denotes (describes) a concept & connotes (implies) a class

index term “table” describes a

and implies many kinds of tables:

for which, if desired, we may have more specific index terms

Tefko Saracevic

more definitions
More definitions ...

Indexing vocabulary

a set of index terms used in a domain or for a set of documents or objects

  • it could be even a single document or object e.g. a book

Indexing language

an indexing vocabulary together with rules – syntax, grammar – for their application and use

Tefko Saracevic

variation on index term
Variation on Index term


Word or phrase used to identify a topic or idea. Part of a controlled vocabulary, normally listed in a thesaurus (defined later) . May be used as a search term.


A significant word from a text of a record which can be used as a search term in a free-text search to retrieve all the records containing it

  • Could be assigned manually, but now done mostly automatically – key entry in automatic indexing

Tefko Saracevic

searching definitions
Searching definitions


request by a user related to user’s information need, task, problem at hand

Question analysis

breakdown & elaboration of concepts in a question to be translated into search terms


question or part thereof as stated for searching according to rules of a given system

© Tefko Saracevic

more ...

Search term

a counterpart to index term, also denoting a concept and connoting a class for a search

Search vocabulary

a set of search terms in a domain or available in a systems

Query language

a search vocabulary together with rules for their use in searching

Tefko Saracevic

elaboration …
  • Example: Question:
    • What are some major historical developments in the area of information retrieval?
  • Transformed into query
    • history information retrieval (in Google)
    • history AND information(w)retrieval (in Dialog) (plus you have to select which file(s) to search
  • Question is what user asks and what you may then have elaborated
  • Query is what is asked of computer to match – what is put in for searching
  • Question is transformed into query

Tefko Saracevic

more …

“An indexlanguage is the language used to describe documents and requests.

The elements of the index language are index terms, which may be derived from the text of the document to be described, or may be arrived at independently.

The vocabulary of an index language may be controlled or uncontrolled.”

(van Rijsbergen, 1979)

Tefko Saracevic

controlled vocabulary
Controlled vocabulary
  • Predetermined – indicating what terms to be used in indexing
    • may show definition of and relations between terms
      • examples: thesaurus, subject heading list, classification
  • Also indicates terms that may be selected for searching
  • An indexing AND a searching tool
  • Human constructed
    • and costly to construct and use

Tefko Saracevic

example of controlled vocabularies
Example of controlled vocabularies

Medical Subject Headings (MeSH) of the National Library of Medicine

  • One of the largest & most comprehensive
    • used in indexing & searching
  • More than 22,000 descriptors, with more than 106,000 cross-references
  • More than 139,000 Supplementary Concept Records
  • Approximately 50 publication types (Journal Article, News, Editorial, Review, Randomized Controlled Trial, etc)
  • Done by indexers
  • But also experimenting with semi-automatic indexing

© Tefko Saracevic

uncontrolled vocabulary
Uncontrolled vocabulary
  • Derived from texts – natural language - in documents
    • nowadays automatically
      • using various ways or algorithms
    • constantly tested: which algorithm is better?
  • Used to construct inverted indexes
  • In turn, inverted indexes are used for free text searching

Tefko Saracevic

comparison of vocabularies
Comparison of vocabularies


Uncontrolled or free

The idea is to follow natural language expressions as they occur in documents

Could be automatic

great advantage

algorithms constantly changing & improving

e.g. parsing phrases, connections

Prevailing in many applications

  • The idea of a controlled vocabulary is to reduce the variability of expressions used to characterize documents being indexed & searched for
  • Manual, costly, time consuming, also semi-automatic in some systems
  • Dynamic – needs constant changing, updating

© Tefko Saracevic

controlled vs free text searching
Endless source of debate & controversy

But, each has its place for given circumstance & retrieval goal

Each has strengths & weaknesses

can you list or find a list comparing them? – this is a good search assignment

Users mostly use free text searching

Professional searchers use both as warranted – have to know when

Professional credo:

KNOW THY CONTROLLED VOCABULARY so you can apply it in searching as/or when needed

Controlled vs. free text searching

Tefko Saracevic

inverted indexes searching
Inverted indexes & searching

Useful to know how they function to understand search & retrieval. Steps:

  • Each document is indexed
    • every word in a document is taken as index term with exception of stop words, if any
    • position in text is noted, even for stop words
  • Indexes for all documents are merged
    • index terms are arranged alphabetically in the bowel of the system, so they can be searched
      • under each index term are document numbers in which it appears & position in text for that document

Tefko Saracevic

so when you search
So, when you search

for digital ANDlibraries:

  • computer takes all documents under digital
  • and all documents under libraries
  • compares to “see” which documents have both terms and then
  • provides you the list of those documents that have in the document both terms, no matter where
  • This is also called “coordinate indexing”
    • coordination is done at time of searching

Tefko Saracevic

variation when you search
Variation: when you search

for digital(WITH) libraries or

“digitallibraries” i.e as a phrase

  • computer goes through the same steps as before but then also
  • “looks” for documents where digitalis positioned right before libraries
    • remember: computer “knows” position of each term in each document, each sentence
  • So searching for a phrase is a form of searching of terms connected with ANDbut in a given sequence

Tefko Saracevic

example of searches in inverted file

Inverted index

Example of searches in inverted file

For simplicity documents have one sentence.Stop words: “a” “of” “in” – but their position counted

Search for slow AND truck gets as results documents 1 and 3 since both contain slow andtruck

Search for slow (w) truck retrieves only document 3 in whichslow is 7th and truck is 8th, they are right next to each other. Doc 1 has both words, but not next to each other thus not retrieved

Tefko Saracevic

everything is inverted consequences for searching
Everything is inverted- consequences for searching
  • All words in all fields are inverted, no matter if
    • in title, full text, descriptor, author …
  • Thus all are searchable
  • In some systems (but not all) phrases are parsed & thus searchable
    • but in most phrases are searched as AwB, or “AB”
  • But beware:
    • search for libraries as descriptor
      • e.g. libraries/DE in Dialog
    • will retrieve ALL other descriptors where libraries appear in addition to descriptor libraries itself
      • e.g. academic libraries, public libraries, special libraries, research libraries …
    • but there are search tricks to avoid that

Tefko Saracevic

what is a thesaurus
What is a thesaurus?

“For writers, it is a tool like Roget’s ­ one with words grouped and classified to help select the best word to convey a specific nuance of meaning.

For indexers and searchers, it is an information storage and retrieval tool: a listing of words and phrases authorized for use in an indexing system, together with relationships, variants and synonyms, and aids to navigation through the thesaurus.”

(Milstead, 2000)

Tefko Saracevic


“A thesaurus to an information scientist is a controlled set of the terms used to index information in a database, and therefore also to search for information in that database so the same concepts are represented by the same term.”

(Batty, 1998)

Tefko Saracevic

  • Good old Peter Mark Roget had a most useful idea in 1890s & did a great job
  • Following this idea thesaurus became THE major tool for controlled vocabulary in IR
    • starting in 1950’s & to this day great many IR thesauri have been developed for all kinds of subjects
      • including, for instance, in information science
    • all have a similar structure & function
    • but they are difficult & costly to construct & maintain

Tefko Saracevic

standards software
Standards, software
  • Subject to international standards:
    • “Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies” ANSI/NISO Standard Z39.19
    • followed by “Construction of Controlled Vocabularies. A Primer”
  • A number of software products are available for thesaurus construction and maintenance
    • e.g. as listed by American Society for Indexing

© Tefko Saracevic

examples of thesauri
Examples of thesauri
  • Thesauri have been constructed for great many domains, from A to Z
    • here are some lists
      • international & multilingual thesauri
      • online thesauri
      • among them ERIC Thesaurus (we use it for example)
    • BUT: different thesauri may and do treat the same descriptor (index term) differently
      • having different, more or fewer narrower, broader, related terms
      • thus it is dangerous to use them interchangeably

Tefko Saracevic

basic thesaurus components
Basic thesaurus components
  • For each entry thesaurus has a classification grid:
    • Descriptor (DE) – an index term that has
      • Scope note (SN) – context in which used
      • Broader terms (BT) – higher in a hierarchy
      • Narrower terms (NT) – lower in a hierarchy
      • Related terms (RT) – other connected descriptors
      • Used for (UF) – synonyms that are not descriptors
    • Note: not all of these may be present for every descriptor
  • A searcher or indexer can use these as a guide for selection/rejection & for browsing to get ideas

Tefko Saracevic

standard structure

Broader terms - BT

Related terms - RT

Descriptor - DE

Used for - UF


Scope note - SN

Narrower terms - NT

Standard structure

With variations on the theme, thesauri have similar conceptual structure to guide searcher or indexer:

Note: Every descriptor doesn't have to have all of these

Tefko Saracevic

same thesaurus but
Same thesaurus but …
  • Examples of ERIC (Educational Resources Information Center) thesaurus as used differently in different systems:
    • ERIC own system
    • ERIC file on Dialog (begin 1)
    • ERIC file on OVID (accessible through RUL)
  • Notice how each uses the same ERIC thesaurus displays & search in its own way, but principles still the same
  • Oh well…

Tefko Saracevic

eric online thesaurus on eric
ERIC online thesaurus on ERIC
  • Allows for
    • searching for words that are included in descriptors by category or all categories
    • browsing alphabetically
    • browsing in one of about 40 categories
  • Search for libraries in all categories found 50 descriptors that have “library” included
  • Out of these selected libraries

Tefko Saracevic

eric online thesaurus on eric descriptor libraries
ERIC online thesaurus on ERICdescriptor libraries


Other descriptors – one could browse

© Tefko Saracevic

eric thesaurus on dialog
ERIC thesaurus on Dialog
  • In a convoluted way ERIC thesaurus (and other ones) can be displayed on Dialog (and other vendors, such as OVID)
  • How?
    • begin in file 1 – ERIC
    • then expand a desired term – here we used term library
    • you will see under R that certain terms have related terms – meaning that these are thesaurus entries
    • then expand on one of those to see related terms
    • then you can browse & choose which ones to use in search
  • And here are printed screens of the process

Tefko Saracevic

note on command expand e in dialog
Note on command expand (E) in Dialog
  • Dialog (and some other systems) has a neat way to display all entries in any inverted index alphabetically
    • command is Expand or e
    • it could be done in any of the indexes – basic and additional

For instance:

e library will provide alpha list of term library in basic index & then after expanding again you can see related terms (see next)

e Au=Saracevic will provide alpha list of all entries in the author additional index around that name

© Tefko Saracevic




Tefko Saracevic

going …

RT indicates related terms

46865 items havelibrary

This one has 14 related terms

Tefko Saracevic

going …

We now chose descriptor LIBRARY ADMINISTRATION and expand on that one

Neat trick:

You can expand on expand

& get related terms out of Eric thesaurus

Tefko Saracevic

going …

These are now R terms of various type

14 related terms for this one are listed

Can expand on this one to see other RT

You can also select any of these to search

Tefko Saracevic

going …

We have now selected r15 – library services to search for documents

Tefko Saracevic

going …

And this is the no. of items we got

Now we can view some items in a chosen format

or we can further modify this search - add refine, …

Tefko Saracevic


This is one of the items we got

Descriptors used for this item

Additional index terms

Tefko Saracevic

automatically gets you to thesaurus
Automatically gets you to thesaurus

This one of selected to enlarge

Tefko Saracevic

allows you to select thesaurus or not
Allows you to select thesaurus (or not)

This one of selected to enlarge

Tefko Saracevic


Point being that the same thesaurus is handled differently by different databases

Next go and select additional terms

Or search for libraries only

See no. of results

Select fields and formats by making a check

and happy going …

suggestion: repeat this exercise

Tefko Saracevic

relevance feedback an important search tactic
Relevance feedback - an important search tactic
  • Method for using information in items judged relevant to further refine or change the search
    • first you find a relevant document (or documents)
    • in relevant document(s) you browse titles, descriptors, identifiers, abstracts … to get leads (e.g. keywords) for further search terms & tactics
    • then you search for those
  • in some advanced systems this may be done automatically

Tefko Saracevic

query expansion another important search tactic
Query expansion – another important search tactic
  • Method for adding, modifying, changing search terms in a query
    • to broaden, narrow, focus, change … terms
  • Many sources can be used
    • relevance feedback, thesauri, dictionaries, textbooks, documents, catalogs, & people: users, colleagues, your own mind & experience
  • Some systems suggest terms for query expansion

Tefko Saracevic

query expansion tactics

Broader terms - BT

Related terms - RT

Query term


Narrower terms - NT

Query expansion tactics
  • You can use the same structure for expanding query terms as in a thesaurus
    • think of what may be broader, narrower, related terms or synonyms to use as search terms

Tefko Saracevic

At the base of all searching are




but a variety exists

In reality in searching there is no completely controlled or uncontrolled vocabulary

matter of degree

& most importantly, matter of mastery


Tefko Saracevic

thank you
thank you!

Tefko Saracevic