SUBJECT ANALYSIS AND REPRESENTATION

SUBJECT ANALYSIS AND REPRESENTATION Presented by GARRY L. BASTIDA

INTRODUCTION/REVIEW One of the major functions of an information retrieval system is to match the contents of documents with users queries. The system personnel have to prepare a surrogate for every document, and all such surrogates must be maintained in an organized manner. (indexing).

INTRODUCTION/REVIEW TASK: analyze the content of the given document and represent this analysis by some content identifiers or keywords. Lancaster: indexing involves two quite distinct contents. Conceptual analysis and representation. In subject classification, the basic objectiveof which is to arrange documents according to their subject contents, the result of the conceptual analysis is represented by some artificial analysis is represented by some artificial language or notational symbol

Subject Analysis What’s it all about, Garry?

What is it? • Subject analysis • Examination of a bibliographic item by a trained subject specialist to determine the most specific subject heading(s) or descriptor(s) that fully describe its content, to serve in the bibliographic record as access points in a subject search of a library catalog, index, abstracting service, or bibliographic database. When no applicable subject heading can be found in the existing headings list or thesaurus of indexing terms, a new one must be created.

What is it? • It means the presence, identification and expression of subject matter in document texts, databases, controlled and natural languages, information requests and search strategies.

Say what?

Why do all that? • If we don’t we can’t find stuff! • “Subject analysis is [essentially] all methods and processes which can be described as representation for retrieval of information by its subjects, be they names, geographic locations, or topical subjects.” • Quoted from Williamson, N. J. (1997). The Importance of Subject Analysis in Library and Information Science Education. Technical Services Quarterly 15(1/2):67-87 by Pamela Hill in LS 500 Organization of InformationTuesday, February 24, 2004

Why use a standardized list? • Why Subject Headings? • Subject headings often indicate the contents of books in terms that their titles do not use, which often may be nondescriptive or very general. Subject headings in online databases are often referred to as descriptors, but they serve the same purpose in locating valuable resources. • Along with their subdivisions, subject headings provide a clear and systematic way of scanning the catalog for what is needed. Assigned headings are usually the dominant, and most important, subjects of a given item. • Subject headings bring like materials together, requiring less use of the wide variation of synonomous terms that may appear to describe a single concept (teen, youth, adolescent, young adult, etc.). • Using Subject Headings in PantherCat

BS 65296 factors in choosing subject of document. • Does the document deal with a specific product condition or phenomenon? • Does the subject contain an action concept, an operation or a process? • Is the object or patient affected by the action identified? • Does the document deal with the agent of this action?

BS 65296 factors in choosing subject of document • Does it refer to a particular means for accomplishing the action • Were these factors considered in the content of a particular location or environment? • Are any independent or dependent variables identified? • Was the subject considered from a special viewpoint not normally associated with that field of study.

SUBJECT INDEXING • is the act of describing a document by index terms to indicate what the document is about or to summarize its content. Indexes are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents (such as books and articles) within a field of knowledge. • Subject indexing systems have been classified broadly as pre-coordinate and post-coordinate systems. The major objective of any indexing system is to represent the contents of documents through keywords or descriptors

Exhaustively and Specificity • An exhaustive index is one which lists all possible index terms. Greater exhaustivity gives a higher recall, or more likelihood of all the relevant articles being retrieved, however, this occurs at the expense of precision. This means that the user may retrieve a larger number of irrelevant documents or documents which only deal with the subject in little depth. In a manual system a greater level of exhaustivity brings with it a greater cost as more man hours are required. • The specificity describes how closely the index terms match the topics they represent . An index is said to be specific if the indexer uses parallel descriptors to the concept of the document and reflects the concepts precisely

Number of relevant documents retrieved Precision = ---------------------------------------------------------- Total number of documents retrieved Number of relevant documents retrieved Recall = ---------------------------------------------------------- Number of relevant documents in the collection Recall vs Precision

Manual indexing • Analysis of subject • Identification of keywords • Standardization of keywords • Choice of an indexing system • If the chosen system is a post – coordinate one then • Preparation of entries under each term with reference to the document identification number. • Preparation of reference entries.

Manual indexing • If the chosen system is a pre-coordinate one then: • Preparation of an entry (main entry) using all the keywords organized in a way prescribed by the system. • Preparation of index entries by using each significant term as an entry element and the full entry (main entry) as the context, or by rotation/permutation of the significant terms in the main entry according to the rules prescribed by the system chosen. • Preparation of reference entries. • filing entries

STEPS IN MANUAL INDEXING SYSTEM

Pre – coordinate indexing system • Chain indexing Dr. S.R. Ranganathan developed a method a pre-coordinate indexing. It attempts to represent, in natural language, the chain of concept’s that constitutes a subject

Pre – coordinate indexing system • Basic steps in chain indexing may be represented as follows: • Take the class number prepared for the given document. • Consult the corresponding classification schedule and write the notation at each step and the correspondence term or phrase (from the schedule). This will produce a chain of concepts from the general to the specific.

Basic steps in chain indexing may be represented as follows: • Identify the sought, unsought , and false links. Sought links denote the concepts that the user is likely to use as access points; unsought links are those that are not likely to be used as access points, and false links are those that really do not represent any valid concepts. • Invert the chain, and this will generate the index entries.

Pre – coordinate indexing system • Relational indexing • J.E.L. Farradane devised a scheme. The system was developed first in the early 1950s and has been modified several times since then. The latest changes may be noted from Farradane’s own papers that appeared in 1980. According to Farradane, any subject can be represented by identifying and representing in the form of what he called analets (pairs of terms interposed by an operator), the relationship between each pair of the contituent concepts, and he suggested that any possible relationship can be represented by either of these nine relational operators.

Pre – coordinate indexing system • PRECIS – PREserved contect Index System. • Developed by Derek Austin and first came out in 1974. Major tasks: • Analysing the document concerned and identifying key concepts. • Organizing the concepts into a subject statement based on the principle of context dependency. • Assigning codes (operators) which signify the syntactical function of each term • Deciding which terms should be the access points and which terms would be in other positions in the index entries, and assigning further codes to achieve these results. • Adding further prepositions, auxiliaries or phrases which would result in clarity and expressiveness of the resulting index entries. • Making supporting reference entries from semantically elated terms taken from a thesaurus.

Pre – coordinate indexing system • POPSI, Postulated – based Permuted Subject Indexing • Developed by Bhattacharyya. It uses the anytico-synthetic method for string formulation and permutation of the constituent terms in order to satisfy different approach points to the document. • There are two parts- the lead heading, which contains the index term or the access term, the context heading, which generally appears in the line following the lead heading and contains the subject words, with auxiliary words, denoting the context in which the lead term has been discussed in the given document.

Rules that govern POSI • A manifestation of property follows immediately the manifestation in relation to which it is a property. • A manifestation of action follows immediately the manifestation in relation to which it is an action • Property and action can have another property and/or action directly related. • A species or part follows immediately the manifestation in relation to which it process part, and part is used to denote the whole part relationship • A modifier follows immediately the manifestation in relation to which it is a modifier.

Post – coordinate indexing system • Uniterm • Developed by Mortimer Taube in 1953. A card is prepared for each term that is considered to be an appropriate index term for a given document. It relies on the ability of the searcher to notice matching numbers on the cards that are retrieved. • Optical coincidence/peek-a-boo cards • Developed to overcome the problem of manual searching. This is based on each card is divided into small units of numbered squares, each unit bearing a specific number, and a document number is punched on the appropriate position on the card.

PROBLEMS OF MANUAL INDEXING • Salton and Salton and McGill two major shortcomings: • It is not quite clear that all the complexities and refinements, exemplified by the categorization of terms and assignment of relations between terms, are really beneficial. • It that even if the indexing process is carried out accurately, and at the right level of detail, it is not possible to maintain consistency since more than one indexer will be needed in practice.

Theory of indexing • 1st level: is concordance, which consist of references to all words in the original text arranged in alphabetical order. • 2nd level: information theoretical level, which calculates the likelihood of a word being chosen for indexing based on its frequency of occurrence in a given text document. • 3rd level: linguistic one, which attempts to explain how meaningful words are extracted from large units of text. • 4th level: textual or skeletal framework, the text is prepared by the author in an organized manner and held together by a skeletal structure. • 5th level: inferential level. An indexer should be able to make inferences about the relationships between words and phrases by observing the sentence and paragraph structure, and by strippping the sentence of extraneous details.

Fugmann proposes theory based on axioms • Axiom of definability, proposes that compiling information relevant to a topic can only be accomplished to the degree to which a topic can be defined. • Axiom of order, suggests that any compilation of information relevant to a topic is an order creation process. • Axiom of the sufficient degree of order, that demands made on the degree of order increase as the size of a collection and frequency of searches increase. • Axiom of predictability, the success of any directed search for relevant information hinges on how readily predictable or reconstructible are the modes of expression for concepts and statements in the search file. • Axiom of fidelity, equates the success of any directed search for relevant information with the fidelity with which concepts and statements are expressed in the search file.

That's what it's all about!

SUBJECT ANALYSIS AND REPRESENTATION