Special Topics on Information Retrieval

Special Topics onInformation Retrieval Manuel Montes y Gómezhttp://ccc.inaoep.mx/~mmontesg/ mmontesg@inaoep.mx

Beyond word-based representations

Content of the section • Language ambiguity and IR • Indexing with parts of speech • POS tagging • Indexing with senses • Approaches for word sense disambiguation • Concept indexing • DOR and TCOR representations • Random indexing Special Topics on Information Retrieval

Language ambiguity • Ambiguity is a condition where information can be understood or interpreted in more than one way. • Context may play a role in resolving ambiguity. • Different kinds of ambiguity: • Lexical: words may have different meanings • Syntactic: sentence can be parsed in more than one way (or words having two parts of speech). • Semantic: words or concepts have an inherently diffuse meaning based on informal usage Special Topics on Information Retrieval

Ambiguity and IR – looking for what? • “Paris Hilton” • Really interested in The Hilton Hotel in Paris? • “Tiger Woods” • Searching something about wildlife or thefamous golf player? • Conclusion, “simple word matching fails”.

Examples of ambiguity • Lexical: • “Plants/N need light and water” vs. “Each one plant/V one” • “The fisherman jumped off the bank and into the water” vs. “The bank down the street was robbed!” • Syntactic • He ate the cookies on the couch • He was seated on the couch or the cookies were there? Special Topics on Information Retrieval

Ambiguity and IR – two problems • Most IR models represent documents as “bag of words” • There is no information on the words’ positions. • Two main problems: • Synonymy: many ways to refer to the same object, e.g. car and automobile • leads to poor recall • Polysemy: most words have more than one distinct meaning, e.g. model, bank, chip • leads to poor precision

Example: Vector Space Model(Taken from Lillian Lee) auto engine bonnet tyres lorry boot car emissions hood make model trunk make hidden Markov model emissions normalize Synonymy Will have small cosine but are related Polysemy Will have large cosine but not truly related

First idea: indexing with POS tags Whole vocabulary of the collection with POS tags Weight indicating the contributionof term-pos j in document i. • Simple and nice idea, but how to determine the POS tag of each word of a given document? Special Topics on Information Retrieval

Part-Of-Speech tagging(based on matieial from Dana S. Nau of University of Maryland and Huong LeThanh of the Dresden University of Technology) • Part of Speech (POS) tagging is the problem of assigning each word in a sentence the part of speech that it assumes in that sentence. • Input: a string of words + a tag set • Output: a single best tag for each word • Example (from Penn Treebank): • The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. Special Topics on Information Retrieval

Brown/Penn Treebank tags Special Topics on Information Retrieval

Main approaches • Rule-Based POS tagging • e.g., ENGTWOL [ Voutilainen, 1995 ] • Transformation-based tagging • e.g.,Brill’s tagger [ Brill, 1995 ] • Stochastic (Probabilistic) tagging • e.g., TNT [ Brants, 2000 ] • Necessitates a training corpus (the Brown Corpus) • Based on probability of certain tag occurring, given information from the word and previous tags. Special Topics on Information Retrieval

Very first approach • Assign each word its most likely POS tag • If w has tags t1, …, tk, then can use • P(ti|w) = c(w,ti)/(c(w,t1) + … + c(w,tk)), where • c(w,ti) = number of times w/ti appears in the corpus • Success: 91% for English • For instance, heat is more used as a noun than as a verb. Special Topics on Information Retrieval

HMM tagging • A HMM simplifying assumption: the tagging problem can be solved by looking at nearby words and tags. ti = argmaxj P(tj| tj-1 )P(wi|tj ) Previous tag sequence (tag co-occurence) word (lexical) likelihood Special Topics on Information Retrieval

Example Secretariat/NNP is/VBZ expected/VBNto/TO race/VBtomorrow/NN • Suppose we have tagged all but race • Look at just preceding word (bigram): • to/TO race/??? NN or VB? • Choose tag with greater of the two probabilities: • P(VB|TO)P(race|VB) or P(NN|TO)P(race|NN) Special Topics on Information Retrieval

Does indexing with POS work? • Improves precision but reduces recall. • Conclusion, annotating POS does not seem worthy as a standalone indexing strategy, even if tagging is performed manually. • Example: • Query: “talented baseball player” • Document: “is one of the top talents of the time” Special Topics on Information Retrieval

Second idea: motivation • Using single words as index terms generally has good exhaustivity, but poor specificity due to word ambiguity. • Some word associations have a totally different meaning of the “sum” of the meanings of the words that compose them. • Hot + dog ≠ “hot dog” • To remedy this problem: use index terms more complex than single words, such as phrases. • Distinguish the two meanings by using phrasal index terms such as “bank of the Seine” and “bank of Japan” Special Topics on Information Retrieval

Second idea: indexing with phrases Extracted phrases from the collection Weight indicating the contributionof phrase j in document i. • Here the questions are, which kind of word sequences are relevant phrases?, how to extract them? Special Topics on Information Retrieval

Syntactical phrases as index terms This apple pie looks good and is a real treat • adjective-noun relation (real-treat) • noun-noun relation (apple-pie) • subject-verb relation (pie-looks) • verb-object relation (is-treat) • The complication is that they are extracted from the POS tagged text or from the syntactic tree. Special Topics on Information Retrieval

Named entities as index terms • Proper names in texts • Three universally accepted categories: person, location and organisation • Other categories: date/time expressions, measures (percent, money, weight etc), email addresses, etc. • One problem: they can be also ambiguous! • George Bush: person or location? • Mexico: geo-political organization or location? • How to detect named entities? Special Topics on Information Retrieval

Named entity recognition • Two tasks: identification and classification • Two main approaches: • Knowledge-based • rule based; developed by experienced language engineers; make use of human intuition • Names often have internal structure and style. • Learning-based • Use statistics or machine learning methods • Requires large amounts of labeled documents • Typical features are: Capitalisation, numeric symbols, punctuation marks, position in the sentence and the words. Special Topics on Information Retrieval

N-grams as index terms • N-gram is a subsequence of n items from a given sequence • N-grams are easily computed • Combining n-grams for different sizes produces great flexibility at searching time. • Main problem is the high dimensionality. How to reduce dimensionality? How to select only the most useful n-grams? Special Topics on Information Retrieval

Maximal Frequent Sequences as index terms • Sequences of words that are frequent in the document collection and that are not contained in any other longer frequent sequence. • A sequence is considered to be frequent if it appears in at least σ documents. • Its main strength is to form a very compact index • Avoids storing the numerous least significant phrases • The extraction of MFS is commonly based on a combination of bottom-up and greedy methods Special Topics on Information Retrieval

Does indexing with phrases work? • Early results were very promising. However, the constant growth of test collections caused a drastic fall in the quality of the results. • A conclusion of research works is that phrases improve results in low levels of recall. • The recommendation is to consider phrases as supplementary terms of the vector space • Terms + phrases as index terms Special Topics on Information Retrieval

Third idea: motivation • Traditional IR approaches are highly dependent on term-matching • Term matching is affected by the synonymy and polysemy phenomena. • Need to capture the concepts instead of only the words • Solution: indexing by senses! Special Topics on Information Retrieval

What is word sense? • Word sense is one of the meanings of a word. • “Words” are having different meanings based on the context of the word. • Example: • We went to see a play at the theater • The children went out to play in the park A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human NILESH.A.SHEWALE

Third idea: indexing by senses • How to construct this index? How to determine the sense of each word from the document collection? All different word senses from the target collection Weight indicating the contribution of the word-sense j in document i. Special Topics on Information Retrieval

Word sense disambiguation • The task of selecting a sense for a word from a set of predefined possibilities. • Sense Inventory usually comes from a dictionary or thesaurus. • A related task is word sense discrimination; the task of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory. Special Topics on Information Retrieval

The WSD process • Choose a sense inventory • Dictionary or thesaurus where word senses are explicitly indicated. • Design/apply a disambiguation procedure • Two main approaches: Knowledge-Based and Machine Learning • Evaluate the performance of the procedure • Using a manually labeled corpus • Using as baseline the more frequent sense Special Topics on Information Retrieval

Approaches for WSD • Knowledge Based Approaches • Rely on knowledge resources like WordNet, Thesaurus, etc. • May use grammar rules for disambiguation. • May use hand coded rules for disambiguation. • Machine Learning Based Approaches • Rely on corpus evidence. • Train a model using tagged (and untagged) corpus. • Probabilistic/Statistical models. Special Topics on Information Retrieval

Knowledge resources • Dictionaries in Machine-readable form (MRD) • Oxford English Dictionary, Collins, Longman Dictionary of Ordinary Contemporary English. Roget’s Thesaurus • Thesaurus – add synonymy information • Roget’s Thesaurus • Semantic networks – add more relations • WordNet, EuroWordNet Special Topics on Information Retrieval

Wordnet • A large lexical database organized in terms of meanings. • Includes nouns, adjectives, adverbs, and verbs • Synonym words are grouped into synset • Example: • {car, auto, automobile, machine, motorcar} Henrik Bulskov

Wordnet example Special Topics on Information Retrieval

Lesk algorithm • Identify senses of words in a context using definition overlap. • Identify simultaneously the correct senses for all words in context • Algorithm: • Retrieve from MRD all sense definitions of the words to be disambiguated • Determine the definition overlap for all possible sense combinations • Choose senses that lead to highest overlap Special Topics on Information Retrieval

Example • Disambiguate “PINE CONE” • PINE • kinds of evergreen tree with needle-shaped leaves • waste away through sorrow or illness • CONE • solid body which narrows to a point • something of this shape whether solid or hollow • fruit of certain evergreen trees Pine#1  Cone#1 = 0 Pine#2  Cone#1 = 0 Pine#1  Cone#2 = 1 Pine#2  Cone#2 = 0 Pine#1  Cone#3 = 2 Pine#2  Cone#3 = 0 Special Topics on Information Retrieval

Disadvantages of Lesk algorithm • Two many combinations need to be evaluated; problem with long sentences. • Simplified version is to compare the dictionary definition of an ambiguous word with the terms contained in its neighborhood. • No enough overlapping words between definitions • Extend definitions by use such information as synonyms, different derivatives, or words from definitions of words from definitions. Special Topics on Information Retrieval

WSD using the conceptual density • Select a sense based on the relatedness of that word-sense to the context. • Relatedness is measured in terms of conceptual density (in a structured hierarchical semantic net) • Idea: if all words in the context are strong indicators of a particular concept then that concept will have a higher density. Special Topics on Information Retrieval

Example of the conceptual density • The dots in the figure represent the senses of the word to be disambiguated or the senses of the words in context. • The CD formula will yield highest density for the sub-hierarchy containing more senses. • The sense of W contained in the sub-hierarchy with the highest CD will be chosen. Special Topics on Information Retrieval

Supervised approach for WSD • Induces a classifier from manually sense-tagged text using machine learning techniques. • Resources: • Sense Tagged Text • Dictionary (implicit source of sense inventory) • Syntactic Analysis (POS tagger, Chunker, Parser, …) • Reduces WSD to a classification problem • A target word is assigned the most appropriate sense from a given set of possibilities based on the context in which it occurs Special Topics on Information Retrieval

Supervised methodology • Create a sample of training data where a given target word is manually annotated with its senses • Select a set of features with which to represent context information. • Convert sense-tagged training instances to feature vectors. • Apply a machine learning algorithm to induce a classifier. • Convert a held out sample of test data into feature vectors. • Apply classifier to test instances to assign a sense tag. Special Topics on Information Retrieval

Some interesting data • High polysemy: especially verbs. • Imbalanced training sets: Most examples are from the first sense. • Current methods: explore semi-supervised machine learning approaches. Special Topics on Information Retrieval

Does indexing with senses work? • How much can WSD help improve IR effectiveness? Open question • Weiss: 1%, Voorhees’ method : negative • Krovetz and Croft, Sanderson : only useful for short queries • Schütze and Pedersen’s approaches and Gonzalo’s experiment : positive result • WSD must be accurate to be useful for IR • It seems that it can be more useful as visualization strategy.

Fourth idea: motivation • Bag of words representation ignores all semantic or conceptual information. • It simply looks at the surface word forms • Words (forms) are very ambiguous. • Polysemy and synonymy are big problems • It is necessary to have representations at concept level. • “Concept ” is related with “sense”, but from a practical (usage) point of view. Special Topics on Information Retrieval

Fourth idea: concept-based representations • In IR, documents are represented by the words occurring in them. • The semantics of a document is conveyed by the words that occur in it. • Can the semantics of a word be conveyed by the documents in which it occurs? • Basis of a representation called: • Document Occurrence Representation (DOR) Special Topics on Information Retrieval

Document Occurrence Representation • Intuitions about the weights: • The more frequently ti occurs in dj, the more important is djfor characterizing the semantics of ti • The more distinct the words dj contains, the smaller its contribution to characterizing the semantics of ti. All documents from the collection All words from the collection Weight indicating the contribution of document j for the semantics of term i. Special Topics on Information Retrieval

Representing documents by DOR • DOR is a word representation, not a document representation. • Representation of documents is obtained by the sum of the vectors from their words. • Queries are represented in the same way: sum of the vectors from its words. Word representation Word–Document matrix Index for IR Document–Document matrix SUM Special Topics on Information Retrieval

Alternative representation • In WSD, words are represented by the terms occurring in their context. • The semantics (meaning) of a word is conveyed by the words commonly co-occurring with it. • Basis of a representation called: • Term Co-Occurrence Representation (TCOR) Special Topics on Information Retrieval

Term Co-Occurrence Representation • Intuitions about the weights: • The more words ti and tj co-occur in, the more important tj is for characterizing the semantics of ti • The more distinct words tj co-occurs with, the smaller its contribution for characterizing the semantics of ti. All words from the collection Weight indicating the co-occurrenceof words i and j Special Topics on Information Retrieval

Representing documents by TCOR • TCOR, such as DOR, is a word representation, not a document representation. • Representation of documents is obtained by the sum of the vectors from their words. • Queries are represented in the same way: sum of the vectors from its words. Word representation Word–Word matrix Index for IR Document–Word matrix SUM Special Topics on Information Retrieval

Other bag-of-concepts representations • Standard BoW representations are usually refined before used: • Feature selection: remove some words based on statistical measures • Feature extraction: artificial features are created from the originals using distributional clustering of words or factor analytic methods. • Problem with these approaches is that they are computationally expensive. • Random indexing is a simple approach to generate BoC representations Special Topics on Information Retrieval

Special Topics on Information Retrieval