CS621 : Artificial Intelligence. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 27: Towards more intelligent search. Desired Features of the Search Engines. Meaning based More relevant results Multilingual Query in English, e.g. Fetch document in Hindi, e.g. Show it in English.
CS621 : Artificial Intelligence Pushpak BhattacharyyaCSE Dept., IIT Bombay Lecture 27: Towards more intelligent search
Desired Features of the Search Engines • Meaning based • More relevant results • Multilingual • Query in English, e.g. • Fetch document in Hindi, e.g. • Show it in English
Precision (P) and Recall (R) • Tradeoff between P and R Obtained (O) Actual (A) Intersection: shaded area (S) P= S/O R= S/A
Building blocks of UNL • Universal Words (UWs) • Relations • Attributes • Knowledge Base
forward(icl>send) @ entry @ past agt gol obj he(icl>person) minister(icl>person) @def mail(icl>collection) @def UNL Graph He forwarded the mail to the minister.
UNL Expression agt (forward(icl>send).@ entry @ past, he(icl>person)) obj (forward(icl>send).@ entry @ past, minister(icl>person)) gol (forward(icl>send ).@ entry @ past, mail(icl>collection). @def)
Universal Word (UW) • vocabulary of UNL • represents a concept • Basic UW (an English word/compound word/phrase with no restrictions or Constraint List) • Restricted UW (with a Constraint List ) • Examples: “crane(icl>device)” “crane(icl>bird)” “crane(icl>do)” nouns verb (“crane the neck”)
Desirable features of UWs • Expressibility: able to represent any concept in a language • Economy: only enough to disambiguate the head word • Formal situatedness: every UW should be defined in the UNL Knowledge-Base
UNL Knowledge Base • A semantic network comprising every possible UW • A lattice structure
UNL expression Input Sentence/ Query Enconverter Rule base Dictionary Enconversion
Encoversion process • Analysis at 3 levels • Morphological • Syntactic • Semantic • Crucial role of disambiguation • Sense • Part of speech • Attachment (I saw the boy with a telescope) (I bank with the bank on the river bank)
Deconversion UNL expression Output sentence Deconverter Rule base Dictionary
win agt obj ptn match Brazil Japan Deconversion process • Syntax Planning • Case marking • Morphology Braajila jaapaan mecha jiit @entry@past Braajila ne jaapaan ke saatha mecha jiit Braajila ne jaapaan ke saatha mecha jiitaa
Top Level Description of the Methodology • Documents represented in meanings graphs • Queries converted to meaning graphs • Matching on meaning graphs • Retrieved document (a collection of meaning graphs) displayed in the language of interest
System Constituents: 1/2 • Search Front • Crawler • Indexer (3 level) • On Expression • On Concept • On keywords
System Constituents: 2/2 • Language Front • EnConverter (analyses sentence to UNL) • DeConverter (generates sentence from UNL) • Stemmer and Morphology Analyser • Parser • Word Sense Disambiguator • Needs wordnets
Overall Architecture: Query HTML Corpus WSD Query Expansion Enconverter Enconverter UNL Stemmers UNL Search engine U N L Index Index Yes Complete UNL Match No Partial UNL Match Yes No UW Match Yes Retrieved UNL Documents Lucene No Stemmers Deconverter Search Results Search Results Failsafe Search Strategy
Indexing and Failsafe Search Strategy: 1/2 • The indexer creates a three level indexing in the form of a. UNL expressions (phrasal and sentential concepts) b. Universal Words (lexical concepts) c. Keywords/Stem Words (Using Stemmers & Lucene)
Indexing and Failsafe Search Strategy: 2/2 • This enables a failsafe search strategy: - Complete expression matching, else - Partial expression matching, else - Universal Word (UW) matching, else - Search on Keywords/Stem Words
Indexing • UNL Document Index – Keeps information about each UNL Document • UNL Index – Stores the actual index of UNL Expressions
Index of UNL Expressions • Each entry (UNL Expression (rel,uw1,uw2)) points to the pair • of document id and its sentence number where it occurs. • Sample UNL Expressions: • mod:02(support(icl>help):4T, financial(mod<thing):4J) • mod:01(government(icl>governmental organization):5L.@def, australian(mod<thing):5A) • and:04(strategy(icl>idea):1D.@entry.@pl, trade(icl>activity):16) • mod:05(performance(icl>operation):2B.@entry.@topic, agriculture(icl>activity):2X)
Index of UWs • Each UW points to pair of document id and its sentence number where it occurs.
Sophisticated matching • Complete set of expressions matching • Weighted expression matching • Partial set of expressions matching • Complete UW matching • Headword matching (equivalent to keyword) • Restriction matching • Attribute matching
Keyword-based Matching needs morphology for Indian languages gannaa ganne Lucene Index User Input Stemmers All documents containing gannoM, gannaa, ganne Output
Multilingual Keyword Search • An UW dictionary based approach • Given a query in a language, generates a multilingual query using UW dictionaries • Example: - Monolingual Query: Farmer - Multilingual Query: Farmer किसान शेतकरी Query UW Dictionary Stemmer Preprocessor Multilingual Keyword Generator Dictionary database Multilingual Query • Provides Multilingual capability at the keyword level to the search engine
Experimentation • Chosen Domain: Agriculture • Languages: English, Hindi, Marathi • Document base: Pesticide and Diseases • Word order sensitive • Money lenders exploit farmers vs. farmers exploit moneylenders • For CLIR: Tested on • Hindi and Marathi query retrieval from English Display in Hindi/Marathi
System Interface • Agricultural Search Engine
Wordnet Sub-graph(Hindi) गाय -- वह व्यक्ति जो बहुत सीधा-साधा हो:"वह गाय है,उसे जो कुछ भी कहा जाता है चुपचाप स्वीकार कर लेता है" स्तनपायी जंतु H Y P E R N Y M Y चौपाया POLYSEMY थन M E R O N Y M Y H Y P E R N Y M Y H Y P E R N Y M Y ABLIT Y VERB जुगाली करना गाय,गऊ,गैया (SYNONYMY) बैल ANTONYNY दुम H Y P O N Y M Y GLOSS सींगवाला एक शाकाहारी मादा चौपाया जो अपने दूध के लिए प्रसिद्ध है:"हिन्दू लोग गाय को गो माता कहते हैं एवं उसकी पूजा करते हैं" नैचकी लवाई
Wordnet Sub-graph(Marathi) घोडा -- बुद्धिबळाच्या खेळातीलएकसोंगटी:"घोडा अडीच घरे चालतो" सस्तन प्राणी H Y P E R N Y M Y जारज POLYSEMY HOLONYMY खोड M E R O N Y M Y रान H Y P E R N Y M Y घोडा,अश्व बाग मूळ H Y P O N Y M Y GLOSS ओझे वाहणे,गाडी ओढणे किंवा बसण्यासाठी उपयोगात आणला जाणारा एक चतुष्पाद प्राणी:"प्राचीन काळापासून अरबस्थानातले घोडे प्रसिद्ध आहेत" अरबी भीमथडी
Semantically Relatable Set (SRS) Based Search (Pl look up publications under www.cse.iitb.ac.in/~pb for descriptions of SRS and SRS based search)
What is SRS • SRSs are UNL expressions without the semantic relations • E.g., “the first non-white president of USA” • (the, president) • (president, of, USA) • (first president) • (non-white president)
SRS Based matching • Complete SRS match • All the SRSs of the query should match with the SRSs of the sentence • Partial SRS match • All the query SRSs need not match with that of the sentence SRSs.
Experimental Setup • Text Retrieval Conference (TREC) data was used. • TREC provides the gold standard for query and relevant documents: Table: Relevance Judgments in TREC • We chose 1919 documents and the first 250 queries. • Mostly from the AP newswire, Wall Street Journal and the Ziff data.
Experiment Process • Lucene with search strategy tf-idf as the keyword based search engine (baseline) • Used SRS based search on the other hand • Compared both the search methods on various parameters
Precision Comparison • Shows that SRS search filters out non-relevant documents much more effectively than the keyword based tf-idf search.
Recall Comparison • tf-idf consistently outperforms the SRS search engine here.
Mean Average Precision (MAP) Comparison • MAP contains both recall and precision oriented aspects and is also sensitive to entire ranking. • SRS Search could not perform here because of the low recall.
Reasons for poor Recall: Word Divergence 1/2 • Inflectional Morphology Divergence • Query: “child abuse” • Query SRS: (child, abuse) • Sentence: “children are abused” • Sentence SRS: (children, abused) • Derivational Morphology Divergence • Query: “debt rescheduling” • Query SRS: (debt, rescheduling) • Sentence: “rescheduling of debt” • Sentence SRS: (rescheduling, of, debt) • Query: “polluted water” • Query SRS: (polluted, water) • Sentence: “water pollution has increased in the city” • Sentence SRS: (water, pollution)
Reasons for poor Recall: Word Divergence 2/2 • Synonymy Divergence • Query: “antitrust cases” • Query SRS: (antitrust, cases) • Sentence: “An antitrust lawsuit was charged today”. • Sentence SRS: (antitrust, lawsuit) • Hypernymy Divergence • Query has keyword “car”, while the document has keyword “automobile”. • Hyponymy Divergence • Query can be “car” whereas the document might contain “minicar”.
Physical Separation Divergence • Physical Separation Divergence • Query: “antitrust lawsuit” • Query SRS: (antitrust, lawsuit) • Sentence: “The federal lawsuit represents the largest antitrust action” • Sentence SRSs: (lawsuit, represents), (represents, action), (antitrust, action)
Solution to Morphological Divergence • Stemming • All words in the document and the query SRSs are stemmed before matching. • Gets the base form based on WordNet, while keeping the tag of the word unchanged. • children_NN stemmed to child_NN, but childish_JJ not stemmed to child_NN
Solution to Synonymy-Hyperonymy-Hyponmy Divergence • Find related words from the WordNet • Algorithm Outline • Get synonyms • Get hypernyms upto depth 2 • Get hyponyms upto depth 2 • Repeat step 1,2 and 3 for all synonyms • All the words are related words • Found related words for all words in corpus (Nouns and Verbs). • Calculated similarity between a word and the related words
SRS Tuning • Deals with the “Other Divergences” problem. • Enriches the SRSs in the corpus. • Basically adds new SRSs by applying augment rules on existing SRSs.
Sample Rules I Rule: (N1, N2) => (N2(J), N1) Sentence: “water pollution” Sentence SRS: (water_N, pollution_N) Tuned SRS: (polluted_J, water_N)
Sample Rules II • Rule: (V, N) => (N, V(N)) • Sentence: “destroy city” • Sentence SRS: (destroy_V, city_N) • Augmented SRS: (city_N, destruction_N)