automatic indexing

automatic indexing jenny winston andrew coleman jennifer boyter

order of presentation part one definition & approaches part two current efforts in automatic indexing part three the future of automatic indexing

what is automatic indexing? “A method of indexing in which an algorithm is applied by a computer to the title and/or text of a work to identify and extract words and phrases representing subjects, for use as headings under which entries are made in the index” (Online Dictionary for Library & Information Science)

automatic indexing vs. computer-assisted indexing • Two tasks in indexing: coming up with surrogates to represent the information and clerical tasks related to index production • Automatic indexing software searches for words in text and builds a list of words, attempts the “intellectual” work. • Computer-assisted software does the clerical work; a human still does the intellectual task of indexing.

general approaches to automatic indexing • Statistical – Counts of words, statistical associations • Syntactical – Grammar, parts of speech • Semantic systems – Words and their meaning in terms of the context • Knowledge-based – Knowing the relationships between words

extraction • Automatic indexing in its simplest form extracts all of the words from a document as possible index terms (perhaps excluding stop words – negative vocabulary control) • Concordances • This does not consider the subject or content of the document as a whole.

subject analysis • Computers have been attempting to extract subject terms from documents for over forty years. • Hans Peter Luhn developed a program to identify words that were indicative of subject content in natural language text. • His program was based on absolute frequency of words in a document, using negative vocabulary control.

the process • Algorithms are then applied to “stem” words – bake, baker, baking can all be retrieved with the same term at the search stage • Then the idea of proximity is applied.

statistics-based • Systems are “trained” by examining a set of documents (50-60) associated with each keyword in a thesaurus. • Process then uses scenarios from word occurrence & word location in the training document • Probability – Bayesian statistics

rule-based • Simple rules are built through matching and synonyms of a list of keywords already approved. • An organization or editor can designate certain “if/then” conditions – ex: tobacco & smokeless or chewing • Define proximity, how close to one another key terms need to be before a word from the controlled vocabulary is assigned • Rules can be modified based on results.

rule-based • Claim to do processing more quickly than statistical systems • Said to offer greater precision • May cost less

natural language processing • A subfield of artificial intelligence and linguistics, studies the problems of automated generation and understanding of natural human languages • Latent semantic indexing – a technique of NLP aiming to solve the fundamental problems of synonym and polysemy *definitions from http://www.wikipedia.org

the debate • Can computer programs truly create high-quality indexes with no human intervention? • Advances must be made in the field of artificial intelligence, natural language processing • For now, semi-automatic indexing programs seem to be having success.

current efforts in automatic indexing

automatic back-of-the-book indexing? • Computers can easily construct a concordance, but this is not an index. • Computers cannot decide what is and what is not a valid reference. • Computers cannot recognize concepts which are discussed over a range of pages.

unclear terminology • Key: on a piano, to unlock a door, for computer security, of a piece of music, a geographical feature (e.g., Key West), or just an abstract “something vital”? • In computer field – software, application, and program are often used interchangeably. But program does not always mean the same thing as software.

human vs. computer back-of-the-book indexing

“semi-automatic” back-of-the-book indexing TExtract • Drag and drop a PDF file onto TExtract; it fully automatically creates an index, which the user can then edit. • Works with authority files • Tries to construct compound terms (“pre-existing legal relationship”), prepositional phrases (“security of transactions”) and inverted phrases (“preemptive war, doctrine of”)

however… “In short, you cannot assume this software is going to pick out words or phrases on any predictable basis, and while it is easy enough to delete what is superfluous and edit what is incorrectly structured, you cannot know what it has omitted.” -- Review of TExtract in THE INDEXER, October 2005

NewsIndexer • Uses a thesaurus specific to the newspaper industry • Rule-based generation of index terms • “Semi-automatic” in that a human is supposed to accept or reject the terms; compiles statistics on hits, misses, and noise to allow for rule improvement

example NewsIndexer rules IF - ELSE Rules If the initial statement is true, the term is applied and the process ends. If the initial statement is false, a default term will be applied. Text to match:Norwegian IF (MENTIONS "language") ....USE Norwegian language ELSE ....USE Norway ENDIF Negative RulesNegates rules under stated conditions. Text to match: bear IF (NOT NEAR "Chicago") ....USE Wild animals ENDIF IF Rules Some additional condition(s) must apply before the thesaurus term is invoked. Text to match: building IF (NEAR "security") ....USE Crime prevention ENDIF Text to match: hospitals IF (WITH "psychiatric") ....USE Mental health facilities ENDIF Text to match: theater IF (MENTIONS "improv") ....USE Experimental theatre ENDIF

Data Harmony – M.A.I. (Machine-Aided Indexer) • Similar approach to Newsindexer, but more generalized • Usually marketed with a module that lets users build their own thesaurus • Runs in either a fully automatic or “assisted” mode, in which users have the opportunity to review index terms

Data Harmony, cont’d. Explicitly rejects a semantic approach: “We find that a rulebase approach is more efficient, more flexible, easier to maintain, and less costly to maintain. A rulebase can be easily managed by an editor, not requiring the more expensive services of a programmer. There is no limit to the number of controlled vocabulary terms or size of thesaurus it serves. Modification or addition of rules is easily accomplished. A rulebase does not require research to locate and prove a large corpus of documents that exemplify the concept represented by a single term.”

LexisNexis “SmartIndexing” • Combined approach using humans and computers • Human indexers pick out the terms for a given subject, developing a controlled vocabulary. • Computer searches documents as they arrive and applies an algorithm to the terms in the controlled vocabulary, assigning a relevancy score for each term

LexisNexis relevancy scores • Results might say “petroleum industry (90),” which means the article has a 90% chance of being relevant • Searches can expand or refine their searches by adjusting the desired relevancy score: • Broadest search - all matching documents: terms(index term) • Narrower search - documents discussing a topic: terms(index term PRE/2 8*% OR 9*%) • Narrowest search - strongest discussions of a topic: terms(index term 9*%)

Information Management & Technology APPLICATION SERVICE PROVIDERS DATA MINING DATA PROCESSING SERVICES DATA WAREHOUSING DECISION SUPPORT SYSTEMS DOCUMENT MANAGEMENT INFORMATION MANAGEMENT KNOWLEDGE MANAGEMENT LIBRARY TECHNOLOGY METADATA MANAGEMENT Internet & World Wide Web B2B ELECTRONIC COMMERCE COMPUTER NETWORK SECURITY CYBERCRIME DIGITAL SIGNATURES ELECTRONIC BILLING ELECTRONIC COMMERCE ELECTRONIC COMMUNICATIONS NETWORKS ELECTRONIC MAIL ELECTRONIC TICKETS ELECTRONIC WALLETS ENTERPRISE PORTALS INTERNET & WWW INTERNET 2 INTERNET AUCTIONS INTERNET AUDIO INTERNET BANKING INTERNET BROWSERS INTERNET CONTENT PROVIDERS INTERNET CRIME INTERNET FILTERS INTERNET PRIVACY INTERNET PUBLISHING & BROADCASTING INTERNET RETAILING INTERNET SERVICE PROVIDERS INTERNET TELEPHONY INTERNET VIDEO MICROBROWSERS MOBILE COMMERCE ONLINE INFORMATION VENDORS ONLINE LEGAL RESEARCH ONLINE SECURITY & PRIVACY ONLINE TRADING SEARCH ENGINES SECURE ONLINE TRANSACTIONS WEB DEVELOPMENT WEB SEARCH PORTALS WEB SITES & PORTALS WEB SITES WIRELESS INTERNET ACCESS Networks COMPUTER NETWORK SECURITY COMPUTER NETWORKS EXTRANETS INTRANETS LOCAL AREA NETWORKS NETWORK PROTOCOLS NETWORK SERVERS example of Lexis taxonomy

Lexis, cont’d. • In addition to the thousands of topical terms, the controlled vocabulary includes 340,000 companies and organizations, 20,000 personal names, and 950 places. • Automatically extracts and indexes personal names, even when not matched to controlled vocabulary. • Human indexers evaluate results weekly and adjust vocabulary, then periodically apply updates retroactively.

Factiva • Similar approach to Lexis –more sophisticated • Controlled vocabulary is polyarchical, while Lexis’s is a simpler classification scheme. • Factiva seems to use a more sophisiticated linguistics-based algorithm, whereas Lexis extracts terms in a controlled vocabulary

Internet/Online services E-commerce Internet browsers Internet portals Internet search engines Internet service providers etc. Computers Computer hardware Computer services Computer stores Networking Semiconductors Software Applications software GroupWare Intelligent agents Internet browsers etc. Factiva polyarchical structure

setting up Factiva rules

Nstein • Claims to collect and index “structured and unstructured information, in almost any language, from multiple sources such e-mails, reports, chat rooms, newsfeeds, Web pages, conversations from call centers and so on.” • Very highly tailored to indexing specific subjects for specific clients.

Nstein methods • Use all four approaches – statistical, syntactic, semantic, and knowledge-based. “Linguistic DNA.” • Works with any controlled vocabulary – customers’ existing, off-the-shelf, or custom-built. • Automated summarization provides an “abstract.” • Automated “similar documents” extraction suggests documents containing similar topics.

the future of automatic indexing

the future of automatic indexing • Will automatic indexing one day replace human indexers? • Will automatic indexing help human indexers do their job faster and easier? • Is it really possible for a computer to parse full sentences, recognize the core ideas, the important terms, and the relationships between related concepts? • Or is it unlikely that the technology will ever be good enough to completely mimic the skills and talents of human indexers?

human indexers defend themselves • “Indexing is an arcane art whose time has not yet come” (Wright). • “If we rely totally on automation to retrieve information, some will be lost” (Wright). • “Those who advocate automatic software, however, would argue that the machine gets ‘close enough’ so that a human can edit the resulting product. However, expert evaluators unanimously agree that the software fails; those who disagree are likely those who are sufficiently ignorant of indexing in the first place such that they are unable to determine the quality differences” – Ouch! (Maislin).

humans vs. machines Computer-generated results are often more like concordances than truly usable indexes. • Abstraction is more important than alphabetization. • Abstractions are the result of intellectual processes based on judgments on what to include and what to exclude 2. Index headings do not depend solely on terms used in the document.

humans vs. machines 3. Indexes should not contain headings for topics for which there is no information in the document. • Can a computer tell when a name is being used in a trivial and non-useful way? 4. Headings and subheadings should be tailored to the needs and viewpoints of anticipated users. (Tulic)

humans vs. machines • Index can use cross-references to alert user to wider or allied concepts, while a computer can only find designated words and phrases and can’t identify text that may convey the same meaning but use different words.

importance of aboutness • Human indexers use their knowledge to find the “aboutness” of a document. • Still takes human analysis to provide oversight on “aboutness” • Example - Factiva and LexisNexis both use human indexers to review, check results. • Almost all automatic indexing programs use human indexers to tweak rules and algorithms to improve results.

humans vs. machines “To date, no one has found a way to provide computer programs with the judgment, expertise, intelligence or audience awareness that is needed to create usable indexes. Until they do, automatic indexing will remain a pipe dream” (Tulic).

or is it just a matter of time? • Is it just a matter of time until the technology IS good enough? • Indexer Seth Maislin says he is “a firm believer that the technology doesn’t exist, and that a human being is required to write an index.” • But concedes that “good automatic indexes will exist once there’s good artificial intelligence, something that presently doesn’t exist.”

the other side • Cost – Automatic Indexing is much cheaper on per-unit basis • Time – Automatic Indexing can index large amounts of material in short time period • Content to be Indexed – Automatic Indexing is routinely used on the full text, where human indexer may be limited to abstracts.

the other side • Exhaustivity – Automatic indexing is by its nature more exhaustive and inclusive. • Headings – Human indexing is better at combining terms, identifying context. • Vocabulary – Human indexing has advantage: can cross-reference, link synonyms, display related terms. This is still a work in progress for automatic indexing.

the other side Automatic Indexing: • Predictable, less biased than human indexing • Cheaper and faster • Becoming more sophisticated • Good for materials that are homogeneous • Somewhat inflexible – takes time to adapt rules / algorithms to include new vocabulary

what the research shows • Researchfinds that automatic and human indexing produce different results. • But that users find them "on balance, more or less equally effective” • Similar evidence comes from observing behavior of expert searchers. • When they have access to indexing from both approaches, they generally use both, preferring human indexing for some types of searches and automatic indexing for others.

what the research shows "The bottom line is clear, however: automatic indexing works! And itappears to work just as well as human indexing, just differently.Automatic indexing is also considerably faster and cheaper thanindexing based on human intellectual analysis. Automatic indexing can be applied to enormous collections of messages (such as the world-wide web) where the volume of texts and constant change, both in individual texts and in the composition of the collection as a whole, makes human indexing impractical, if not impossible" (Anderson, 2001A)

who will embrace automatic indexing? • Corporations and government agencies have embraced automated indexing due to significant cost and time savings. • Because they do not require such a precise level of indexing? Is it good enough for what they need? • Represents knowledge management – use with government records like property records. • Users do report success: A client with rule-based system reported accuracy rate of 92% (Hlava, 2005).

real-world success story • Medical Text Indexer (MTI) used since 2002 by the National Library of Medicine (NLM) to assist human indexers in their indexing of MedLine by selecting appropriate Medical Subject Heading (MeSH). • Five nights a week, MTI indexes 3700 citations at about 530 per hour. • MTI uses text words in article titles and abstracts to generate ranked list of potentially applicable MeSH terms. • Other records are automatically indexed as added to the database. • Allows for indexing of records that would not normally be indexed due to the sheer number of medical journals.

future of automatic indexing • Both types of indexing make important, but different, contributions to successful information retrieval. • But human indexing gets more expensive, while automated indexing gets cheaper and more effective. • Maximize benefits by allocating human analysis and indexing to documents where the benefits of human expertise are most apparent. • Must stop treating every document as if they are all equally important (Anderson 2001b).

future of automatic indexing • Develop methods for predicting the most important documents and devoting human analysis to them. • All documents can receive inexpensive, relatively effective automatic indexing. • For important documents, can be augmented by human indexing, to make them even more accessible. • Can identify important documents by studying usage, citation patterns, reviews.

automatic indexing

automatic indexing

Presentation Transcript

Automatic Document Indexing in Large Medical Collections

Automatic multi-label subject indexing in a multilingual environment :

Automatic Indexing

Indexing:

Automatic Concept Indexing and Classification

Automatic Indexing

Automatic Speech Recognition and Audio Indexing

Automatic Indexing

Automatic Indexing

Semi-Automatic Indexing of Full Text Biomedical Articles

ALIP: Automatic Linguistic Indexing of Pictures

Automatic Document Indexing in Large Medical Collections

A Vector Space Model for Automatic Indexing

Indexing

Automatic indexing

Indexing

Automatic Indexing (Term Selection)

Indexing

Indexing

Indexing