580 likes | 738 Views
Natural Language Introduction. LINGUISTICSSyntax
E N D
1. A COMPARISON OFTWO NATURAL LANGUAGE BASED INFORMATION RETRIEVAL SYSTEMS(MURAX AND ASKJEEVES) By Chris Becker &
Nick Sarnelli
2. Natural Language Introduction LINGUISTICS
Syntax – study of the patterns of formation of sentences and phrases from words and of the rules for the formation of grammatical sentences in a language.
Semantics – study of word meaning, including the ways meaning is structured in language and changes in meaning over time.
3. QUERIES Corpus-based analysis
– relies largely on statistical methods to extract data.
- well-matched to customary resources of IR.
Demand-based analysis
- reduces permanent storage requirements.
- analysis depends upon query.
- requires access to documents returned from a search.
4. Closed Class Questions Questions (queries) stated in Natural Language that have definite answers (usually nouns).
Answers are assumed to lie in a set of objects.
5. Query Results Depend Upon How clear the user’s need is.
How precisely the need has been expressed by the query.
How well the query matches the terminology of the text corpus being searched.
6. MURAX A ROBUST LINGUISTIC APPROACH FOR QUESTION-ANSWERING USING AN ON-LINE ENCYCLOPEDIA
7. Purpose To answer close-class questions using a corpus (large collection of documents) of natural language
In simpler terms, answering general-knowledge questions using an on-line encyclopedia (Grolier Encyclopedia – http://www.grolier.com)
8. What is a closed-class question? A question stated in natural language, which assumes some definite answer typified by a noun phrase rather than a procedural answer
It hypothesizes noun phrases that are likely to be the answer, and presents the user with relevant text, focusing the user’s attention appropriately
Answers are based on relation to the question, not word frequency
9. Queries The corpus (documents) are accessed by an Information Retrieval system that supports boolean search with proximity constraints
Queries are automatically constructed from the phrasal content of the question
The query is then passed to the IR system to find relevant text
10. Queries (continued) The relevant text itself is then analyzed
Noun phrase hypothesis is extracted
new queries are independently made to confirm phrase relations for the various hypotheses
11. Why MURAX? The encyclopedia (Grolier’s) is composed of a massive amount of unrestricted text
It is impossible to manually provide detailed lexical or semantic information for the over 100,000 word stems it contains
Therefore, shallow syntactic analysis can be used where nothing is known in advanced (types of relations and objects) and may differ for each new question
12. Benefits of Natural Language Improves quality of results by providing text to the user which confirms phrase relations, instead of just word matches
Serves as a focus for the development of linguistic tools for content analysis
Reveals what kind of grammar development should be done to improve performance
13. Question Characteristics Closed-class question is a direct question whose answer is expressible as a noun phrase
Examples:
What’s the capital of the Netherlands?
Who’s won the most Oscar’s for costume design?
14. Question Words and Expectations Who/Whose: Person
What /Which: Thing, Person, Location
Where: Location
When: Time
How Many: Number
Why/How – expect a procedural answer instead of a noun phrase
15. Type Phrase The Type Phrase is the noun phrase at the start of the questions
NP = Noun Phrase
“What is the NP…..? Or What NP….?
Type Phrase indicates what type of thing the answer is
Ex. What is the book’s title?
What book has this title?
16. Grolier’s Encyclopedia Was chosen as the corpus (set of documents) for the system and contains approximately 27,000 articles
The components responsible for the linguistic analysis of these 27,000 articles are a part-of-speech tagger and a lexico-syntactic pattern matcher
17. Hidden Markov Model (HMM) Part of Speech Tagger
Probabilistic
Parameters are estimated by training with a sample of ordinary untagged text
Uses suffix information and local context to predict the categories of words
18. Encyclopedia Training with HMM HMM was trained with encyclopedia text
Valuable because it allowed for it to adapt to certain characteristics of the domain
Ex. The Word “I”
Encyclopedia cases – proper noun (King George I, World War I
Normal text – pro-noun
19. Tagging Phrases Improves the efficiency of boolean query construction (enables direct phrase matches, rather than requiring several words to be successively dropped from the phrase)
Phrases are identified by part of speech categories and word-initial capitalization, which splits phrases containing upper-case words
20. Phrase Example “New York City borough”
Using word-initial capitalization, how do you think it would be split?
Title Phrases are tagged based on word-initial capitalization, quotes surrounding the title, or multiple words italicized
21. Example Question and Component NP’s “Who was the Pulitzer Prize-winning novelist that ran for mayor of New York City?”
Pulitzer Prize
winning novelist
mayor
New York City
22. Primary Document Matches As illustrated from the previous example, noun phrases and main verbs are first extracted from the question
These phrases are constructed through a query construction/refinement procedure to form boolean queries that are used to search the encyclopedia to find a list of relevant articles from which primary document matches are made
They are sentences containing one or more of the phrases
23. Scoring System Matching head words in a noun phrase receive double the score of other matching words in a phrase
Words with matching stems, but incompatible part-of-speech categories are given minimal scores.
Primary document matches are then ranked according to their scores
24. Extracting Answers Extraction begins with finding the simple noun phrases located in the primary document matches
These noun phrases are answer hypotheses distinguished by component words, article, and the sentence in which it occurs
The system then tries to verify phrase relations implied by the question
25. Example Answer should be a person because of the “who”
The type phrase indicated the answer is probably a “Pulitzer Prize winning novelist”
The relative noun indicates the answer also “ran for mayor of New York City”
26. Secondary Queries Used as an alternative means to confirm phrase relations
Consists either of an answer hypothesis or includes other question phrases such as the question’s type phrase
To find out whether the answer hypothesis was a “novelist”, the two phrases were included in a query and a search yields a list of relevant articles and sentences which contain co-occurrences, called secondary document matches
27. System Output Answer hypotheses are shown to the user to focus his attention on likely answers and how they relate to other phrases in the question
The text present is not from documents that have the highest similarity scores, but instead those which confirm phrase relations that lend evidence for an answer
28. Primary Query Construction How phrases from a question are translated into boolean queries with proximity constraints
“Who shot President Lincoln?”
Step 1 (Tagging)
Only One Noun phrase: President Lincoln
Main Verb: Shot
Step 2 (Boolean Terms are constructed from the phrases)
{p term 1, term 2, …term n)
29. Primary Query Construction (continued) The first query becomes:
{0 president lincoln}
The IR system is given this boolean query and searches for documents that match
New boolean queries may be generated to:
Refine the rankings of the documents
Reduce the number of hits (narrowing)
Increase the number of hits (broadening)
30. Narrowing Performed by using title phrases or by adding extra query terms such as the main verbs and performing a new search in the encyclopedia
Ex. [ {0 president lincoln} shot} ]
Reduces the number of hits
Reduces the co-occurrence scope of terms in the query and constrains phrases to be closer together
A sequence of queries with increasingly smaller scope are made, until there are fewer hits
Ex. ( 10 {0 president lincoln} shot )
31. Broadening Tries to increase the number of hits for a boolean query (done in three ways)
Dropping the requirement for strict ordering of words (allowing a match of “President Abraham Lincoln”
Dropping one or more whole phrases from the boolean query
Dropping one or more words from within multiple-word phrases in a query to produce a query that is composed of sub-phrases of the original
ex. Dropping president or lincoln
32. Control Strategy The following are order of operations for broadening and / or narrowing in rank order
Co-occurrence scope is increased before terms are dropped
Single phrases are dropped before two phrases
Higher frequency phrases are dropped first
Title phrases are tried first
Complete phrases are used before sub phrases
Broadening or Narrowing Terminates when a threshold on the number of hits has been reached, or no other useful queries can be made
33. Answer Extraction How the most likely answer hypotheses are found from the relevant sentences in the various hits
Phrase matching operations are considered first, followed by the construction of secondary queries to get secondary document matches
34. Answer Extraction: Phrase Matching Done with lexico-syntactic patterns which are described using regular expressions
Expressions are translated into finite-state recognizers which are determined and minimized so matching is done efficiently without backtracking
Recognizers are applied to primary and secondary matches, and the longest possible match is recorded
35. Phrase Matching Example NP=Noun Phrase
If input is not a question, system provides output that is typical of co-occurrence based search methods
36. Verifying Type Phrases Used to try to verify answer hypotheses as instances of type phrases through:
Apposition: exemplified by the match between the type phrase of the following question and the document match below it:
The IS-A Relation:
37. Verifying Type Phrases Also Through:
Noun Phrase Inclusion: type phrases that are included in answer hypotheses
The type phrase river is in the same noun phrase as the answer hypothesis Colorado River
38. Predicate / Argument Match Associates answer hypotheses and other noun phrases in a document match that satisfy a verb relation implied in a question
Patterns accounting for active and passive alteration are applied
39. Minimum Mismatch For reliable identification, simple noun phrases are extracted from primary document matches
“mayor of New York City” is broken into two simpler, independent phrases and then exact matching is done after all the document matches are found
The most minimum degree of mismatch is considered best
With Anglo-Saxon king of England, both document matches match equally well, but Harold is a shorter and more exact match and “Saint Edward the Confessor” is more involved when combined with “next to last Anglo-Saxon king of England”
40. Person Verification Almost always have Word-initial Capital Letters
Articles about people generally have their name as the title
Have a higher percentage of words that are male or female pronouns than in other articles
To confirm, a secondary query is made to see if a person’s name is present as a title, and then it is decided whether the article is about the person
41. Co-Occurrence Queries Secondary queries are also used to find co-occurrences of answer hypotheses and question phrases that extend beyond the context of a single sentence
Useful for ranking alternative answer hypotheses in the absence of other differentiating phrase matches
Key Largo occurs with Florida Keys, and the other film hypotheses do not, allowing Key Largo to receive preference
42. Equivalent Hypotheses The same answer can be expressed by several hypotheses such as “President Kennedy”, “John F. Kennedy”, & “President John F. Kennedy”, which all refer to the same person, and in some cases even just “Kennedy” does
Determined equivalent when referenced to the title of the article – title is “Kennedy, John F.” and “Kennedy” is mentioned in article
If he or she is used in article, it is assumed that he or she refers to the article’s title – Refer to Norman Mailer Slide - (if title is Norman Mailer, and “he” was used instead of “Mailer,” result would be the same
43. Combining Phrase Matches Criteria for partially ordering hypotheses in order of preference
When type phrases occur, the highest answer hypotheses are those with minimum mismatch
Number of question phrases that co-occur with an answer hypothesis – qualified by the number of different articles needed to match the most question phrases
Predicate / argument matches produce preferences among different answer hypotheses
For “who” questions, an answer hypothesis that is verified as a person takes precedence
Answer hypotheses are ranked in terms of their co-occurrence with question phrases
44. Interim Evaluation The implementation for the MURAX system is not yet complete, because it cannot handle how and why questions, but enough programming has been complete to permit performance estimates
The system was tested with 70 “Trivial Pursuit” questions
Results:
Best Guess was correct answer for a little more than half (53%), and correct answer lies within top 5 guesses for (74%) of the questions
45. Current Status Current system is not fast enough
Articles are tagged and then completely analyzed
But, it is actually only necessary to analyze specific sentences
Implementation is being made to improve performance
46. Future Work MURAX is a means for investigating how natural language methods can be used for intelligent information retrieval
WordNet Thesaurus appears extremely useful and could provide synonym and hyponym information
Ex. “What Pulitzer Prize-winning novelist ran for mayor of New York City? – WordNet would indicate that novelist is a hyponym of person and the answer should be a person’s name even though the question begins with “what”
47. Ask Jeeves
48. What is Ask Jeeves? Natural Language search engine designed to return answers to user questions
Utilizes a cluster of fault-tolerant servers
User enters a query in question, phrase, or word form
User-relevance algorithm compares user question to pre-selected questions
Pre-selected questions stored in a database that contains links to their answers
49. How does Ask Jeeves Work? User enters a question
Question-processing engine attempts to determine the nature of the question
The answer processing engine returns a list of questions
The User selects the closest match
Ask Jeeves returns an answer based on a comprehensive knowledge base.
50. Ask Jeeves Knowledge Base Compiled by Ask Jeeves’ research staff
Monitored by human editors
Contains answers to over 7 million of the most popular questions on the internet
Operates on the 80/20 rule
Geared to answer 20% of the questions asked 80% of the time
80% of the answers users seek result from the same 20% of the questions asked
Supplemented by a results summary from the major search engines
51. Teoma System Acquired by Ask Jeeves in 2001
Teoma Search system is used to determine a level of authority associated between a particular site and a subject.
Authority is determined by three techniques
Refine
Results
Resources
52. Refine, Results, Resources Refine: Teoma organizes sites into communities that are about the same subject
Results: Subject-Specific Popularity analyzes the relationship of the sites in a community. Authority is determined by # of same subject pages that reference a page, assessment of expert opinion on the best source for a subject, and hundreds of other criteria.
Resources: Teoma finds and identifies expert resources about a particular subject.
53. Ask Jeeves Operations Consumer-oriented
Revenues generated from advertising
Partnership with Google brings in 65% of revenue
Companies bid on ad placement for related questions.
Licensing of technology
Dell’s “Ask Dudley”
Deals with Toshiba and BellSouth
54. Ask Jeeves Conclusion Ask Jeeves interprets Natural Language queries and attempts to match them to pre-selected questions
Originally, Ask Jeeves’ Knowledge base of questions was built and maintained by humans
Teoma system now determines level of authority between a site and a query
55. Comparison MURAX is a higher precision IR system
AskJeeves tries to match your question to pre-selected subjects, and then displays links to resources of authority for these subjects
MURAX actually deciphers your question and can find any answer from the encyclopedia, as long as the question does not begin with “how” or “why”
56. Comparison AskJeeves relies more on external links, whereas MURAX provides links to all internal documents and responds with an answer.
AskJeeves is better suited for “popular” questions – 80% of their answers come from 20% of the questions asked
As long as the answer can be found in the encyclopedia, MURAX will find an answer within its top 5 results 74% of the time, and is thus more precise
57. Critique Lacked technicality specifics (just gave brief general information on each topic)
Examples led to figures that were not included in the presentation
Typos and incomplete sentences and phrases (some points made no sense)
Outdated (October 2000)
Since then AskJeeves has switched indexing methods
Basically all presentation material came from outside sources
58. Questions?