An Experiment in Using Lexical Disambiguation to Enhance Information Access

An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi, and Heyning Cheng

Goal • Enhance information access by • fully automated text categorization • by adding searching by word sense • Applied to the World Wide Web

Manual vs. Automatically Created Directories • Manual classification of documents is • Expensive • Not scalable • Hard to keep up with the rapid growth and changes of information sources such as the Web • Would like fully automatic classification • no training set • no rules • appeal instead to “intrinsic semantics”

Lexical Disambiguation • Problem: Determine the intended sense of ambiguous word • Approach: Based on Yarowsky, et al. • Thesaurus categories as proxies for senses • We used Roget’s 5th • Training: Count nearby word-category co-occurrence • Deployment: Add up the word-category evidence

Counting Co-occurrences of Terms with Categories …while storks and cranes make their nests in the bank… Result is category co-occurrence vector for each term. [Tools, Animals]

Automatic Topic Assignment Based on Word Sense • Hearst • Topic  word-category association vectors • Fisher and Wilensky • Contrasted different algorithms • Concluded that exploiting word senses may improve topic assignment • We use prior prob. dist. of word senses, (and more recently, disambiguation per se.)

IAGO 0.1 vs. 1.0 • IAGO 0.1: • Eliminated short (< 100 content words) pages • Trained on newswire text • IAGO 1.0: • Trained on Encarta encyclopedia • Estimated word sense priors on the Web (used 10 million words of random web documents) • ignored proper nouns • augmented stop-list to deal with various problems • Tested categorization by mapping Yahoo categories to ours • Tested disambiguation on newswire, then sampled Web.

IAGO! Overview

Classification Results Then: (version 0.1) Now: (version 1.0) Category Name Precision Recall ------------- --------- ------ ComputerScience 31.6% 17.1% FinanceInvestment 94.4% 22.0% FitnessExercise 100.0% 4.3% MotionPictures 100.0% 57.1% Music 97.5% 58.3% Nutrition 80.3% 35.6% Occupation 100.0% 13.1% TheEnvironment n/a 0.0% Travel 50.0% 5.7% Overall precision = 88% Overall recall = 23% Category Name Precision Recall ------------- --------- ------ ComputerScience 87.5% 19.4% FinanceInvestment 100.0% 13.4% FitnessExercise 100.0% 1.8% MotionPictures 100.0% 54.8% Music 98.2% 42.4% Nutrition 97.9% 29.9% Occupation 97.8% 30.3% TheEnvironment n/a 0.0% Travel 75.0% 15.4% Overall precision = 97% Overall recall = 21% (92.3%and20.4% if no adjustment by hand)

IAGO! 1.0 Internet Directory • Used engine to classify a few tens of thousands of web documents into Roget’s categories.

Disambiguation Results

Application to Text Searching • Present user with set of known word senses from which to select • e.g., keyword = “rock” • =stone • =kind of music • Retrieve by word, filter by word sense • Rank by number of matching word senses

Is it Useful? • Results in the literature generally suggest disambiguation not useful for long queries, and utility is highly sensitive to disambiguation accuracy. • However, 40% of search queries on the web are reported to be single words. • So, does disambiguation work well enough to aid with single word queries?

Usefulness • Let r be the frequency of the most common of (non-overlapping) senses. • Can show that, to be better than just using keyword retrieval, disambiguation accuracy needs to be at least 50%, increasing in accuracy as r increases, but need not be highly accurate. (In fact, it can perform below the baseline.) • IAGO! 1.0 performs well above this level.

Usefulness • Key word retrieval will produce word sense retrieval precision and recall of r and 1 for common sense, (1-r) , 1 for less common • A disambiguation method that was correct p of the time would have precision and recall values of and p for a word sense with frequency r. • Using E as the metric, can show that p needs to be at least for a disambiguation method to outperform keyword retrieval • For small r, p must be greater than 50%. For large r, this compares favorably with keyword retrieval even with fairly low disambiguation accuracy. • E.g., with a 90/10 distribution of word senses, then, for the more common word sense case, E, with a beta of .5, is better for a disambiguation algorithm with an accuracy over 77% than for keyword retrieval. (For the less common word sense, a “disambiguation” algorithm that is completely random gives a superior result.)

More results • Latest implementation (by Heyning Cheng) reduces training to about 1 hour (from about 24); classifying 1000 documents takes about 10 minutes. • Also improved performance of disambiguation. This made it practical to use disambiguation in topic assignment: • I.e, produces slightly better results; also appears to be less sensitive to changes in stoplist, and can be made to run quickly. • Disambiguation with a substantially smaller window size (even as small as 5) did not reduce accuracy; in some cases, a half-window size of 10 out-performed one of 50.

More results (con’t) • Weighted word sense priors by IDF of the term

More Results • Excluding low-utility or confusing Roget’s categories (down to about 200) improved recall to about 40% on the 1000 document test set. • The “purity” of topic assignment (% of all word senses disambiguated to the assigned topic) seems correlated with accuracy at least as well as IAGO’s ranking algorithm.

Future Work • Get better word sense proxies! • Word-sense searching • Create word sense index • Support word-sense searching within more general searches. • Improve disambiguation by exploiting priors. • Test against synonym expansion methods • Automatic topic-categorization • Handle multi-word phrases; proper names

Future Plans: Longer Term • Disambiguation • Handle non-nouns • Better word sense source • Automatic grouping of thesaural word senses • Topic-categorization • Multiple topic assignment • Quality • Summarization via same techniques • Other linguistic choices, e.g., thematic roles

An Experiment in Using Lexical Disambiguation to Enhance Information Access