A COMPARISON OF TWO NATURAL LANGUAGE BASED INFORMATION RETRIEVAL SYSTEMS MURAX AND ASKJEEVES

1. A COMPARISON OFTWO NATURAL LANGUAGE BASED INFORMATION RETRIEVAL SYSTEMS(MURAX AND ASKJEEVES) By Chris Becker & Nick Sarnelli

2. Natural Language Introduction LINGUISTICS Syntax � study of the patterns of formation of sentences and phrases from words and of the rules for the formation of grammatical sentences in a language. Semantics � study of word meaning, including the ways meaning is structured in language and changes in meaning over time.

3. QUERIES Corpus-based analysis � relies largely on statistical methods to extract data. - well-matched to customary resources of IR. Demand-based analysis - reduces permanent storage requirements. - analysis depends upon query. - requires access to documents returned from a search.

4. Closed Class Questions Questions (queries) stated in Natural Language that have definite answers (usually nouns). Answers are assumed to lie in a set of objects.

5. Query Results Depend Upon How clear the user�s need is. How precisely the need has been expressed by the query. How well the query matches the terminology of the text corpus being searched.

6. MURAX A ROBUST LINGUISTIC APPROACH FOR QUESTION-ANSWERING USING AN ON-LINE ENCYCLOPEDIA

7. Purpose To answer close-class questions using a corpus (large collection of documents) of natural language In simpler terms, answering general-knowledge questions using an on-line encyclopedia (Grolier Encyclopedia � http://www.grolier.com)

8. What is a closed-class question? A question stated in natural language, which assumes some definite answer typified by a noun phrase rather than a procedural answer It hypothesizes noun phrases that are likely to be the answer, and presents the user with relevant text, focusing the user�s attention appropriately Answers are based on relation to the question, not word frequency

9. Queries The corpus (documents) are accessed by an Information Retrieval system that supports boolean search with proximity constraints Queries are automatically constructed from the phrasal content of the question The query is then passed to the IR system to find relevant text

10. Queries (continued) The relevant text itself is then analyzed Noun phrase hypothesis is extracted new queries are independently made to confirm phrase relations for the various hypotheses

11. Why MURAX? The encyclopedia (Grolier�s) is composed of a massive amount of unrestricted text It is impossible to manually provide detailed lexical or semantic information for the over 100,000 word stems it contains Therefore, shallow syntactic analysis can be used where nothing is known in advanced (types of relations and objects) and may differ for each new question

12. Benefits of Natural Language Improves quality of results by providing text to the user which confirms phrase relations, instead of just word matches Serves as a focus for the development of linguistic tools for content analysis Reveals what kind of grammar development should be done to improve performance

13. Question Characteristics Closed-class question is a direct question whose answer is expressible as a noun phrase Examples: What�s the capital of the Netherlands? Who�s won the most Oscar�s for costume design?

14. Question Words and Expectations Who/Whose: Person What /Which: Thing, Person, Location Where: Location When: Time How Many: Number Why/How � expect a procedural answer instead of a noun phrase

15. Type Phrase The Type Phrase is the noun phrase at the start of the questions NP = Noun Phrase �What is the NP�..? Or What NP�.? Type Phrase indicates what type of thing the answer is Ex. What is the book�s title? What book has this title?

16. Grolier�s Encyclopedia Was chosen as the corpus (set of documents) for the system and contains approximately 27,000 articles The components responsible for the linguistic analysis of these 27,000 articles are a part-of-speech tagger and a lexico-syntactic pattern matcher

17. Hidden Markov Model (HMM) Part of Speech Tagger Probabilistic Parameters are estimated by training with a sample of ordinary untagged text Uses suffix information and local context to predict the categories of words

18. Encyclopedia Training with HMM HMM was trained with encyclopedia text Valuable because it allowed for it to adapt to certain characteristics of the domain Ex. The Word �I� Encyclopedia cases � proper noun (King George I, World War I Normal text � pro-noun

19. Tagging Phrases Improves the efficiency of boolean query construction (enables direct phrase matches, rather than requiring several words to be successively dropped from the phrase) Phrases are identified by part of speech categories and word-initial capitalization, which splits phrases containing upper-case words

20. Phrase Example �New York City borough� Using word-initial capitalization, how do you think it would be split? Title Phrases are tagged based on word-initial capitalization, quotes surrounding the title, or multiple words italicized

21. Example Question and Component NP�s �Who was the Pulitzer Prize-winning novelist that ran for mayor of New York City?� Pulitzer Prize winning novelist mayor New York City

22. Primary Document Matches As illustrated from the previous example, noun phrases and main verbs are first extracted from the question These phrases are constructed through a query construction/refinement procedure to form boolean queries that are used to search the encyclopedia to find a list of relevant articles from which primary document matches are made They are sentences containing one or more of the phrases

23. Scoring System Matching head words in a noun phrase receive double the score of other matching words in a phrase Words with matching stems, but incompatible part-of-speech categories are given minimal scores. Primary document matches are then ranked according to their scores

24. Extracting Answers Extraction begins with finding the simple noun phrases located in the primary document matches These noun phrases are answer hypotheses distinguished by component words, article, and the sentence in which it occurs The system then tries to verify phrase relations implied by the question

25. Example Answer should be a person because of the �who� The type phrase indicated the answer is probably a �Pulitzer Prize winning novelist� The relative noun indicates the answer also �ran for mayor of New York City�

26. Secondary Queries Used as an alternative means to confirm phrase relations Consists either of an answer hypothesis or includes other question phrases such as the question�s type phrase To find out whether the answer hypothesis was a �novelist�, the two phrases were included in a query and a search yields a list of relevant articles and sentences which contain co-occurrences, called secondary document matches

27. System Output Answer hypotheses are shown to the user to focus his attention on likely answers and how they relate to other phrases in the question The text present is not from documents that have the highest similarity scores, but instead those which confirm phrase relations that lend evidence for an answer

28. Primary Query Construction How phrases from a question are translated into boolean queries with proximity constraints �Who shot President Lincoln?� Step 1 (Tagging) Only One Noun phrase: President Lincoln Main Verb: Shot Step 2 (Boolean Terms are constructed from the phrases) {p term 1, term 2, �term n)

29. Primary Query Construction (continued) The first query becomes: {0 president lincoln} The IR system is given this boolean query and searches for documents that match New boolean queries may be generated to: Refine the rankings of the documents Reduce the number of hits (narrowing) Increase the number of hits (broadening)

30. Narrowing Performed by using title phrases or by adding extra query terms such as the main verbs and performing a new search in the encyclopedia Ex. [ {0 president lincoln} shot} ] Reduces the number of hits Reduces the co-occurrence scope of terms in the query and constrains phrases to be closer together A sequence of queries with increasingly smaller scope are made, until there are fewer hits Ex. ( 10 {0 president lincoln} shot )

31. Broadening Tries to increase the number of hits for a boolean query (done in three ways) Dropping the requirement for strict ordering of words (allowing a match of �President Abraham Lincoln� Dropping one or more whole phrases from the boolean query Dropping one or more words from within multiple-word phrases in a query to produce a query that is composed of sub-phrases of the original ex. Dropping president or lincoln

32. Control Strategy The following are order of operations for broadening and / or narrowing in rank order Co-occurrence scope is increased before terms are dropped Single phrases are dropped before two phrases Higher frequency phrases are dropped first Title phrases are tried first Complete phrases are used before sub phrases Broadening or Narrowing Terminates when a threshold on the number of hits has been reached, or no other useful queries can be made

33. Answer Extraction How the most likely answer hypotheses are found from the relevant sentences in the various hits Phrase matching operations are considered first, followed by the construction of secondary queries to get secondary document matches

34. Answer Extraction: Phrase Matching Done with lexico-syntactic patterns which are described using regular expressions Expressions are translated into finite-state recognizers which are determined and minimized so matching is done efficiently without backtracking Recognizers are applied to primary and secondary matches, and the longest possible match is recorded

35. Phrase Matching Example NP=Noun Phrase If input is not a question, system provides output that is typical of co-occurrence based search methods

36. Verifying Type Phrases Used to try to verify answer hypotheses as instances of type phrases through: Apposition: exemplified by the match between the type phrase of the following question and the document match below it: The IS-A Relation:

37. Verifying Type Phrases Also Through: Noun Phrase Inclusion: type phrases that are included in answer hypotheses The type phrase river is in the same noun phrase as the answer hypothesis Colorado River

38. Predicate / Argument Match Associates answer hypotheses and other noun phrases in a document match that satisfy a verb relation implied in a question Patterns accounting for active and passive alteration are applied

39. Minimum Mismatch For reliable identification, simple noun phrases are extracted from primary document matches �mayor of New York City� is broken into two simpler, independent phrases and then exact matching is done after all the document matches are found The most minimum degree of mismatch is considered best With Anglo-Saxon king of England, both document matches match equally well, but Harold is a shorter and more exact match and �Saint Edward the Confessor� is more involved when combined with �next to last Anglo-Saxon king of England�

40. Person Verification Almost always have Word-initial Capital Letters Articles about people generally have their name as the title Have a higher percentage of words that are male or female pronouns than in other articles To confirm, a secondary query is made to see if a person�s name is present as a title, and then it is decided whether the article is about the person

41. Co-Occurrence Queries Secondary queries are also used to find co-occurrences of answer hypotheses and question phrases that extend beyond the context of a single sentence Useful for ranking alternative answer hypotheses in the absence of other differentiating phrase matches Key Largo occurs with Florida Keys, and the other film hypotheses do not, allowing Key Largo to receive preference

42. Equivalent Hypotheses The same answer can be expressed by several hypotheses such as �President Kennedy�, �John F. Kennedy�, & �President John F. Kennedy�, which all refer to the same person, and in some cases even just �Kennedy� does Determined equivalent when referenced to the title of the article � title is �Kennedy, John F.� and �Kennedy� is mentioned in article If he or she is used in article, it is assumed that he or she refers to the article�s title � Refer to Norman Mailer Slide - (if title is Norman Mailer, and �he� was used instead of �Mailer,� result would be the same

43. Combining Phrase Matches Criteria for partially ordering hypotheses in order of preference When type phrases occur, the highest answer hypotheses are those with minimum mismatch Number of question phrases that co-occur with an answer hypothesis � qualified by the number of different articles needed to match the most question phrases Predicate / argument matches produce preferences among different answer hypotheses For �who� questions, an answer hypothesis that is verified as a person takes precedence Answer hypotheses are ranked in terms of their co-occurrence with question phrases

44. Interim Evaluation The implementation for the MURAX system is not yet complete, because it cannot handle how and why questions, but enough programming has been complete to permit performance estimates The system was tested with 70 �Trivial Pursuit� questions Results: Best Guess was correct answer for a little more than half (53%), and correct answer lies within top 5 guesses for (74%) of the questions

45. Current Status Current system is not fast enough Articles are tagged and then completely analyzed But, it is actually only necessary to analyze specific sentences Implementation is being made to improve performance

46. Future Work MURAX is a means for investigating how natural language methods can be used for intelligent information retrieval WordNet Thesaurus appears extremely useful and could provide synonym and hyponym information Ex. �What Pulitzer Prize-winning novelist ran for mayor of New York City? � WordNet would indicate that novelist is a hyponym of person and the answer should be a person�s name even though the question begins with �what�

47. Ask Jeeves

48. What is Ask Jeeves? Natural Language search engine designed to return answers to user questions Utilizes a cluster of fault-tolerant servers User enters a query in question, phrase, or word form User-relevance algorithm compares user question to pre-selected questions Pre-selected questions stored in a database that contains links to their answers

49. How does Ask Jeeves Work? User enters a question Question-processing engine attempts to determine the nature of the question The answer processing engine returns a list of questions The User selects the closest match Ask Jeeves returns an answer based on a comprehensive knowledge base.

50. Ask Jeeves Knowledge Base Compiled by Ask Jeeves� research staff Monitored by human editors Contains answers to over 7 million of the most popular questions on the internet Operates on the 80/20 rule Geared to answer 20% of the questions asked 80% of the time 80% of the answers users seek result from the same 20% of the questions asked Supplemented by a results summary from the major search engines

51. Teoma System Acquired by Ask Jeeves in 2001 Teoma Search system is used to determine a level of authority associated between a particular site and a subject. Authority is determined by three techniques Refine Results Resources

52. Refine, Results, Resources Refine: Teoma organizes sites into communities that are about the same subject Results: Subject-Specific Popularity analyzes the relationship of the sites in a community. Authority is determined by # of same subject pages that reference a page, assessment of expert opinion on the best source for a subject, and hundreds of other criteria. Resources: Teoma finds and identifies expert resources about a particular subject.

53. Ask Jeeves Operations Consumer-oriented Revenues generated from advertising Partnership with Google brings in 65% of revenue Companies bid on ad placement for related questions. Licensing of technology Dell�s �Ask Dudley� Deals with Toshiba and BellSouth

54. Ask Jeeves Conclusion Ask Jeeves interprets Natural Language queries and attempts to match them to pre-selected questions Originally, Ask Jeeves� Knowledge base of questions was built and maintained by humans Teoma system now determines level of authority between a site and a query

55. Comparison MURAX is a higher precision IR system AskJeeves tries to match your question to pre-selected subjects, and then displays links to resources of authority for these subjects MURAX actually deciphers your question and can find any answer from the encyclopedia, as long as the question does not begin with �how� or �why�

56. Comparison AskJeeves relies more on external links, whereas MURAX provides links to all internal documents and responds with an answer. AskJeeves is better suited for �popular� questions � 80% of their answers come from 20% of the questions asked As long as the answer can be found in the encyclopedia, MURAX will find an answer within its top 5 results 74% of the time, and is thus more precise

57. Critique Lacked technicality specifics (just gave brief general information on each topic) Examples led to figures that were not included in the presentation Typos and incomplete sentences and phrases (some points made no sense) Outdated (October 2000) Since then AskJeeves has switched indexing methods Basically all presentation material came from outside sources

58. Questions?

A COMPARISON OF TWO NATURAL LANGUAGE BASED INFORMATION RETRIEVAL SYSTEMS MURAX AND ASKJEEVES

A COMPARISON OF TWO NATURAL LANGUAGE BASED INFORMATION RETRIEVAL SYSTEMS MURAX AND ASKJEEVES

Presentation Transcript

Language Models for Information Retrieval

Natural Language Processing for Information Retrieval

Graph-based Methods for Natural Language Processing and Information Retrieval

Cross-Language Information Retrieval

Cross-Language Information Retrieval

Information Retrieval Systems Capabilities

Multimedia Information Retrieval Systems

Evaluation of Information Retrieval Systems

Two-stage Language Models for Information Retrieval

Natural Language Processing for Information Retrieval

A Discourse-based Information Retrieval Approach

Speech-based Information Retrieval

Evaluating Cross-language Information Retrieval Systems

A Language Modeling Approach to Information Retrieval

Performance Evaluation of Information Retrieval Systems

Music information retrieval systems

Three Information Retrieval Systems

Evaluation of Information Retrieval Systems