Indexing

Indexing • Overview • Approaches to indexing • Automatic indexing • Information extraction

Overview Indexing: the transformation of documents to searchable data structures. • May be manual or automatic • Creates basis for direct search,or for search through index files. • Historically performed by professional indexers associated with library organizations. • A critical process: user’s ability to find documents on particular subject is limited by the indexer creating index terms for this subject. • Initial computerization still relied on human indexers, but encouraged using more index terms(index cards no longer being required for each index term)

Changes in Objectives of Indexing Due to full Tex Availability • Indexing defines the source major concepts of documents. • The use of a controlled vocabulary(the domain of the index),help standardize the choice of terms. • Controlled vacabularies slow the indexing process,but aid users because they know the domain the indexer had to use • With the availability of full text the need for manual indexing is diminishing • Source information (citation data) can easily be extracted. • Every word of a document(after appropriate normalization) may be used as a term • Thesauri compensate for lack of controlled vocabularies. • Hence,importance of manual indexing shifts to its ability to • Perform abstractions and determine additional related terms. • Judge the value of the information (e.g. , more difficult to “cheat”)

Approaches:Scope • Exhaustively: the extent to which concepts are indexed. • Should we index only the most important concepts, of also more minor concepts? • In a 10-page document, should a 2-sentence discussion of some subject be indexed? • Specificity: the preciseness of the index term used. • Should we use general indexing terms of more specific terms? • Should we use the term “computer”, “personal computer”, or “IBM Aptiv a Model M61”? • Main effect: • Low exhaustivity has adverse effect on recall. • Low specificity has adverse effect on precision. • Related issues: • Index title and abstract only, or the entire document? • Should index terms be weighted?

Approaches : Pre-coordination • Post-coordination : when a query uses a set of terms linked by AND, it links these terms together. • Pre-coordination : links among terms are specified in the index. Pre-coordination improves retrieval for post-coordinated queries. • Example : Document discusses drilling of oil wells in Mexico by CITGO and introduction of oil refineries in Peru by the U.S. • No pre-coordination of terms: • oil, wells, Mexico, CITGO, refineries, Peru, U.S. • Document retrieval if query links “oil”, “Mexico” and “Peru”. • Simple re-coordination: • (oil wells, Mexico, CITGO) • (oil refineries, Peru, U.S.) • Document not retrieved if query links “oil”, “Mexico” and “Peru”

Example(cont.) • Pre-coordination with position indicating role: • (CITGO, drill, old wells, Mexico) • (U.S. introduce, oil refineries, Peru) • Discriminates which country introduces refineries into the other country • Pre-coordination with modifier indicating role: • (Subject: CITGO, Action:drill, Object: oil wells, Modifier: in Mexico) • (Subject: U.S. , Action: introduce, Object: oil refineries, Modifier : in Peru) • If document discussed U.S. introducing refineries in Peru, Bolivia, and Argentina, one entry is used with three Modifier fields.

Automatic Indexing • System automatically determines the index terms assigned to documents. • Relative advantages • Human indexing: • Ability to determine concept abstractions. • Ability to judge the value of concepts. • Automatic indexing: • Reduced cost : once initial hardware cost is amortized, operational cost is cheaper vs. compensation for human indexers. • Reduced processing time : at most few seconds vs. at least a few minutes. • Improved consistency : algorithms select index terms terms much more consistently than humans.

Weighted and Unweighted indexes • Unweighted indexing: • No attempt to determine the value of the different terms assigned to a document. Not possible to distinguish between major topics and casual references. • All retrieved documents are equal in value. • Typical of commercial systems through the 1980s. • Weighted indexing: • Attempt made to place a value on each term as a description of the document. • This value is related to frequency of occurrence of the term in the document(higher is better), but also to number of collection documents that use this term (lower is better). • Query weights and document weights are combined to a value describing the likelihood that a document matches a query, and a threshold value limits the number of documents returned. • Typically used only with automatic indexing.

Automatic Indexing by Term and by Concept • Indexing by Term: The item is represented by terms extracted from the item. • The Vector model • The Bayesian Model • Natural language processing • indexing by concept:The document is represented by concepts not necessarily used in the document.

Indexing by Term:the Vector Model • The SMART system developed by Salton at Cornell University. • Each document is stored as a vector of weights. • Each vector position represents a term in the database domain(the dimension of these vectors is the size of the vocabulary). • The value is represented by a similar vector • The Search involves calculating the vector distance between the query vector and each document vector.

Indexing by Term : the Bayesian Model • Bayes rule of conditional probability : • P(A/B) = P(A,B)/P(B) = P(A)P(B/A)/P(B) • Bayesian methods can be used to determine the processing tokens and their weights. • Principle : calculate the (posterior) probability that a given document contains concept C, given the presence of features (words) F1,…,Fm in the document. • To calculate this probability we need to know : • The prior probability that the document is relevant to the concept C. • The conditional probability that the features Fi are present in a document, given that the document is relevant to the concept C.

Indexing by Term : Natural Language Processing • The DR-LINK system. • Enhance indexing by using semantic information ( in addition to statistical information). • Process the language, rather than treat each word as an independent entity. • Process documents at different levels : morphological, lexical, semantic, syntactic, and discourse ( beyond the sentence).

Indexing by Concept • There are many ways to represent the same idea and increased retrieval performance comes from using a single representation. • Hence, a single canonical set of concepts is determined and is used for indexing all documents. • The MatchPlus system: • A set of n features (concepts) is selected. • For each word stem a context vector of dimension n is built, describing how strongly the stem reflects each feature. • The context vectors for the word stems are combined with a weighted sum, to create a single context vector for the entire documents. • This vector represents the document in terms of the concepts. • Queries go through same analysis to determine vector representations. • During search, query vector is compared to document vectors.

Information Extraction • Two processed related to indexing : • Extraction of facts(e.g, when building indexes automatically). • Document summarization. • Extraction of facts into a database: • Extract specific types of information using extraction criteria (indexing attempts to represent the entire document). • Recall now refers to how much information was extracted from a documents(vs. how much should have been extracted). • Precision now refers to the proportion of the extracted information which is accurate. • Experiments show that automatic extraction performs much worse than human extracion (55% precision and recall vs. about 80%), but operates about 20 times faster.

Information Extraction(cont.) • Documents summarization: • Extract the most important ideas, while reducing the size significantly. • Example : the abstract of a document. • “True summarization” is not feasible. • Instead, most summarization techniques extract the “most significant” subsets(e.g. , sentences), and concatenate them. • Each sentence is assigned a score, and the highest scoring sentences are extracted. • No guarantee of a coherent narrative. • Heuristic algorithms, with no overall theory. For example, • Consider sentences over 5 words in length. • Look for “cues”; e.g., “in conclusion”. • Focus on the first 10 and last 5 paragraphs.

Indexing

Indexing

Presentation Transcript

Indexing

Indexing:

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing