1 / 85

CS621 : Artificial Intelligence

CS621 : Artificial Intelligence. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 27: Towards more intelligent search. Desired Features of the Search Engines. Meaning based More relevant results Multilingual Query in English, e.g. Fetch document in Hindi, e.g. Show it in English.

rory
Download Presentation

CS621 : Artificial Intelligence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS621 : Artificial Intelligence Pushpak BhattacharyyaCSE Dept., IIT Bombay Lecture 27: Towards more intelligent search

  2. Desired Features of the Search Engines • Meaning based • More relevant results • Multilingual • Query in English, e.g. • Fetch document in Hindi, e.g. • Show it in English

  3. Precision (P) and Recall (R) • Tradeoff between P and R Obtained (O) Actual (A) Intersection: shaded area (S) P= S/O R= S/A

  4. The UNL System: An Overview

  5. Building blocks of UNL • Universal Words (UWs) • Relations • Attributes • Knowledge Base

  6. forward(icl>send) @ entry @ past agt gol obj he(icl>person) minister(icl>person) @def mail(icl>collection) @def UNL Graph He forwarded the mail to the minister.

  7. UNL Expression agt (forward(icl>send).@ entry @ past, he(icl>person)) obj (forward(icl>send).@ entry @ past, minister(icl>person)) gol (forward(icl>send ).@ entry @ past, mail(icl>collection). @def)

  8. Universal Word (UW) • vocabulary of UNL • represents a concept • Basic UW (an English word/compound word/phrase with no restrictions or Constraint List) • Restricted UW (with a Constraint List ) • Examples: “crane(icl>device)” “crane(icl>bird)” “crane(icl>do)” nouns verb (“crane the neck”)

  9. Desirable features of UWs • Expressibility: able to represent any concept in a language • Economy: only enough to disambiguate the head word • Formal situatedness: every UW should be defined in the UNL Knowledge-Base

  10. UNL Knowledge Base • A semantic network comprising every possible UW • A lattice structure

  11. UNL expression Input Sentence/ Query Enconverter Rule base Dictionary Enconversion

  12. Encoversion process • Analysis at 3 levels • Morphological • Syntactic • Semantic • Crucial role of disambiguation • Sense • Part of speech • Attachment (I saw the boy with a telescope) (I bank with the bank on the river bank)

  13. Deconversion UNL expression Output sentence Deconverter Rule base Dictionary

  14. win agt obj ptn match Brazil Japan Deconversion process • Syntax Planning • Case marking • Morphology Braajila jaapaan mecha jiit @entry@past Braajila ne jaapaan ke saatha mecha jiit Braajila ne jaapaan ke saatha mecha jiitaa

  15. Application: meaning based multilingual search

  16. Application: meaning based multilingual search

  17. Top Level Description of the Methodology • Documents represented in meanings graphs • Queries converted to meaning graphs • Matching on meaning graphs • Retrieved document (a collection of meaning graphs) displayed in the language of interest

  18. System Constituents: 1/2 • Search Front • Crawler • Indexer (3 level) • On Expression • On Concept • On keywords

  19. System Constituents: 2/2 • Language Front • EnConverter (analyses sentence to UNL) • DeConverter (generates sentence from UNL) • Stemmer and Morphology Analyser • Parser • Word Sense Disambiguator • Needs wordnets

  20. Overall Architecture: Query HTML Corpus WSD Query Expansion Enconverter Enconverter UNL Stemmers UNL Search engine U N L Index Index Yes Complete UNL Match No Partial UNL Match Yes No UW Match Yes Retrieved UNL Documents Lucene No Stemmers Deconverter Search Results Search Results Failsafe Search Strategy

  21. Indexing and Failsafe Search Strategy: 1/2 • The indexer creates a three level indexing in the form of  a. UNL expressions (phrasal and sentential concepts) b. Universal Words (lexical concepts) c. Keywords/Stem Words (Using Stemmers & Lucene)

  22. Indexing and Failsafe Search Strategy: 2/2 • This enables a failsafe search strategy:  - Complete expression matching, else - Partial expression matching, else - Universal Word (UW) matching, else - Search on Keywords/Stem Words

  23. Indexing • UNL Document Index – Keeps information about each UNL Document • UNL Index – Stores the actual index of UNL Expressions

  24. Index of UNL Expressions • Each entry (UNL Expression (rel,uw1,uw2)) points to the pair • of document id and its sentence number where it occurs. • Sample UNL Expressions: • mod:02(support(icl>help):4T, financial(mod<thing):4J) • mod:01(government(icl>governmental organization):5L.@def, australian(mod<thing):5A) • and:04(strategy(icl>idea):1D.@entry.@pl, trade(icl>activity):16) • mod:05(performance(icl>operation):2B.@entry.@topic, agriculture(icl>activity):2X)

  25. Index of UWs • Each UW points to pair of document id and its sentence number where it occurs.

  26. Sophisticated matching • Complete set of expressions matching • Weighted expression matching • Partial set of expressions matching • Complete UW matching • Headword matching (equivalent to keyword) • Restriction matching • Attribute matching

  27. Keyword-based Matching needs morphology for Indian languages gannaa ganne Lucene Index User Input Stemmers All documents containing gannoM, gannaa, ganne Output

  28. Multilingual Keyword Search • An UW dictionary based approach • Given a query in a language, generates a multilingual query using UW dictionaries • Example: - Monolingual Query: Farmer - Multilingual Query: Farmer किसान शेतकरी Query UW Dictionary Stemmer Preprocessor Multilingual Keyword Generator Dictionary database Multilingual Query • Provides Multilingual capability at the keyword level to the search engine

  29. Experimentation • Chosen Domain: Agriculture • Languages: English, Hindi, Marathi • Document base: Pesticide and Diseases • Word order sensitive • Money lenders exploit farmers vs. farmers exploit moneylenders • For CLIR: Tested on • Hindi and Marathi query  retrieval from English Display in Hindi/Marathi

  30. System Interface • Agricultural Search Engine

  31. Wordnet Sub-graph(Hindi) गाय     --     वह व्यक्ति जो बहुत सीधा-साधा हो:"वह गाय है,उसे जो कुछ भी कहा जाता है चुपचाप स्वीकार कर लेता है" स्तनपायी जंतु H Y P E R N Y M Y चौपाया POLYSEMY थन M E R O N Y M Y H Y P E R N Y M Y H Y P E R N Y M Y ABLIT Y VERB जुगाली करना गाय,गऊ,गैया (SYNONYMY) बैल ANTONYNY दुम H Y P O N Y M Y GLOSS सींगवाला एक शाकाहारी मादा चौपाया जो अपने दूध के लिए प्रसिद्ध है:"हिन्दू लोग गाय को गो माता कहते हैं एवं उसकी पूजा करते हैं" नैचकी लवाई

  32. Wordnet Sub-graph(Marathi) घोडा     --  बुद्धिबळाच्या खेळातीलएकसोंगटी:"घोडा अडीच घरे चालतो" सस्तन प्राणी H Y P E R N Y M Y जारज POLYSEMY HOLONYMY खोड M E R O N Y M Y रान H Y P E R N Y M Y घोडा,अश्व बाग मूळ H Y P O N Y M Y GLOSS ओझे वाहणे,गाडी ओढणे किंवा बसण्यासाठी उपयोगात आणला जाणारा एक चतुष्पाद प्राणी:"प्राचीन काळापासून अरबस्थानातले घोडे प्रसिद्ध आहेत" अरबी भीमथडी

  33. Semantically Relatable Set (SRS) Based Search (Pl look up publications under www.cse.iitb.ac.in/~pb for descriptions of SRS and SRS based search)

  34. What is SRS • SRSs are UNL expressions without the semantic relations • E.g., “the first non-white president of USA” • (the, president) • (president, of, USA) • (first president) • (non-white president)

  35. SRS Based matching • Complete SRS match • All the SRSs of the query should match with the SRSs of the sentence • Partial SRS match • All the query SRSs need not match with that of the sentence SRSs.

  36. System Architecture

  37. Experimental Setup • Text Retrieval Conference (TREC) data was used. • TREC provides the gold standard for query and relevant documents: Table: Relevance Judgments in TREC • We chose 1919 documents and the first 250 queries. • Mostly from the AP newswire, Wall Street Journal and the Ziff data.

  38. Experiment Process • Lucene with search strategy tf-idf as the keyword based search engine (baseline) • Used SRS based search on the other hand • Compared both the search methods on various parameters

  39. Precision Comparison • Shows that SRS search filters out non-relevant documents much more effectively than the keyword based tf-idf search.

  40. Recall Comparison • tf-idf consistently outperforms the SRS search engine here.

  41. Mean Average Precision (MAP) Comparison • MAP contains both recall and precision oriented aspects and is also sensitive to entire ranking. • SRS Search could not perform here because of the low recall.

  42. Reasons for poor Recall: Word Divergence 1/2 • Inflectional Morphology Divergence • Query: “child abuse” • Query SRS: (child, abuse) • Sentence: “children are abused” • Sentence SRS: (children, abused) • Derivational Morphology Divergence • Query: “debt rescheduling” • Query SRS: (debt, rescheduling) • Sentence: “rescheduling of debt” • Sentence SRS: (rescheduling, of, debt) • Query: “polluted water” • Query SRS: (polluted, water) • Sentence: “water pollution has increased in the city” • Sentence SRS: (water, pollution)

  43. Reasons for poor Recall: Word Divergence 2/2 • Synonymy Divergence • Query: “antitrust cases” • Query SRS: (antitrust, cases) • Sentence: “An antitrust lawsuit was charged today”. • Sentence SRS: (antitrust, lawsuit) • Hypernymy Divergence • Query has keyword “car”, while the document has keyword “automobile”. • Hyponymy Divergence • Query can be “car” whereas the document might contain “minicar”.

  44. Physical Separation Divergence • Physical Separation Divergence • Query: “antitrust lawsuit” • Query SRS: (antitrust, lawsuit) • Sentence: “The federal lawsuit represents the largest antitrust action” • Sentence SRSs: (lawsuit, represents), (represents, action), (antitrust, action)

  45. Solutions for Divergences

  46. Solution to Morphological Divergence • Stemming • All words in the document and the query SRSs are stemmed before matching. • Gets the base form based on WordNet, while keeping the tag of the word unchanged. • children_NN stemmed to child_NN, but childish_JJ not stemmed to child_NN

  47. Solution to Synonymy-Hyperonymy-Hyponmy Divergence • Find related words from the WordNet • Algorithm Outline • Get synonyms • Get hypernyms upto depth 2 • Get hyponyms upto depth 2 • Repeat step 1,2 and 3 for all synonyms • All the words are related words • Found related words for all words in corpus (Nouns and Verbs). • Calculated similarity between a word and the related words

  48. SRS Tuning • Deals with the “Other Divergences” problem. • Enriches the SRSs in the corpus. • Basically adds new SRSs by applying augment rules on existing SRSs.

  49. Sample Rules I Rule: (N1, N2) => (N2(J), N1) Sentence: “water pollution” Sentence SRS: (water_N, pollution_N) Tuned SRS: (polluted_J, water_N)

  50. Sample Rules II • Rule: (V, N) => (N, V(N)) • Sentence: “destroy city” • Sentence SRS: (destroy_V, city_N) • Augmented SRS: (city_N, destruction_N)

More Related