160 likes | 282 Views
This workshop paper by Mark Wasson from LexisNexis discusses advancements in operational text classification through the lens of the Topic Identification System (TIS). It delves into term-based topic identification (TTI), named entity indexing, and the balance of frequency and weighting in indexing systems. The paper highlights the effectiveness of the TIS model, including its use of chi-square analysis and regression methods to enhance precision and recall rates, while emphasizing the importance of manual verification and the iterative process for creating topic definitions.
E N D
Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September 13, 2001
The Topic Identification System • The Topic Identification System Model • Term-based Topic Identification (TTI) • Term Mapping System • Company Concept Indexing • Named Entity Indexing (Companies, People, Organizations, Places) • Subject Indexing Prototype (not released) • NEXIS Topical Indexing
Psycholinguistics Features • Propositional Language Model Underlies Surface Forms • Word Concepts • Semantic Priming, Additive up to a Point • Spreading Activation
Terms and Word Concepts • All words and phrases are searchable – no stop words • No automatic morphological or thesaurus expansion • Exception – name variant generation, but subject to human verification • Word Concept: a set of functionally equivalent terms with respect to a given topic; 1 to 100s of terms in a single word concept
Frequency & Weighting • Frequency & weighting at word concept level rather than at individual term level • TTI used chi-square to compare individual word concepts to supervised training set • TTI used stepwise linear regression to test in combination and suggest weights • Allow both positive and negative weights in addition to absolute yes/no Boolean functionality
Problem Word Concepts 5 documents: 3 relevant (G), 2 irrelevant (B) W1 in G1, G2, B1 W2 in G2, G3, B2 W3 in G1, G3, B1 Each W by itself produces 67% recall, 67% precision W1 + W2 -> 100% recall, 60% precision W1 + W3 -> 100% recall, 75% precision W2 + W3 -> 100% recall, 60% precision W1 + W2 + W3 -> 100% recall, 60% precision Also, fewer terms -> faster processing
Looking Up Terms in Documents • Count a term extra in key document parts • Headlines • Leading text • Captions • Count all potential matches • American gets counted for 100s of companies • Don’t count a term when part of another • Mead in Mead Corp. • French in French Fry
Calculating Topic Scores • Summation of frequency * weight across all word concepts • Normalize score • Compare to threshold • Verification range in TTI • Major references, strong passing references, weak passing references in indexing tools • Add controlled vocabulary term or marker to document if score >= threshold • Add score, any associated secondary CVTs
Source-dependent, -independent • Similar field functions, different field names and locations • Database and file information to guide production processes The source specification file allows us to reuse a single topic definition across a wide variety of sources and source types
Manual vs. Automatic • Build each definition using iterative manual process • Use supervised learning? • TTI’s chi-square and regression • Cost of creating training samples • Automate repetitive, labor-intensive tasks • Generate name variants • Cheap labor cost – few minutes to 8 hours
Test, Test, Test • Business unit benchmarks prior to adoption • Development process test cases • Internal benchmarks with 3rd party technologies • Sorry, not TREC • Most tests, topics, sources – recall and precision both in the 90-95% range
The End? • TIS Model? 16 years old • TTI? In production for 11 years • Term Mapping? 9 years old • Entity Indexing? 6-7 years old • Topical Indexing? 3 years old • Complemented by SRA NetOwl-based indexing 2 years ago • No movement afoot to replace any of them
Related Papers • TTI • Leigh, S. (1991). The Use of Natural Language Processing in the Development of Topic Specific Databases. Proceedings of the 12th National Online Meeting. • Company Concept Indexing • Wasson, M. (2000). Large-scale Controlled Vocabulary Indexing for Named Entities. Proceedings of the ANLP-NAACL 2000 Conference.