Is that your final answer the role of confidence in question answering systems
This presentation is the property of its rightful owner.
Sponsored Links
1 / 39

Is That Your Final Answer? The Role of Confidence in Question Answering Systems PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on
  • Presentation posted in: General

Is That Your Final Answer? The Role of Confidence in Question Answering Systems. Robert Gaizauskas 1 and Sam Scott 2 1 Natural Language Processing Group Department of Computer Science University of Sheffield 2 Centre for Interdisciplinary Studies Carleton University. Outline of Talk.

Download Presentation

Is That Your Final Answer? The Role of Confidence in Question Answering Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Is that your final answer the role of confidence in question answering systems

Is That Your Final Answer?The Role of Confidence in Question Answering Systems

Robert Gaizauskas1 and Sam Scott2

1Natural Language Processing Group

Department of Computer Science

University of Sheffield

2Centre for Interdisciplinary Studies

Carleton University


Outline of talk

Outline of Talk

  • Question Answering: A New Challenge in Information Retrieval

  • The TREC Question Answering Track

    • The Task

    • Evaluation Metrics

    • The Potential of NLP for Question Answering

  • The Sheffield QA System

    • Okapi

    • QA-LaSIE

  • Evaluation Results

  • Confidence Measures and their Application

  • Conclusions and Discussion

Dublin Computational Linguistics Research Seminar


Question answering a new challenge in ir

Question Answering: A New Challenge in IR

  • Traditionally information retrieval systems are viewed as systems that return documents in response to a query

    • Such systems better termed document retrieval systems

    • Once document returned user must search it to find required info

    • Acceptable if docs returned are short, not too many returned, and info need is general

    • Not acceptable if many docs returned or docs very long or info need very specific

  • Recently (1999,2000) the TREC Question Answering (QA) track has been designed to address this issue

    • As construed in TREC, Q A systems take natural language questions and a text collection as input and return specific answers (literal text strings) from documents in the text collection

Dublin Computational Linguistics Research Seminar


Qa an incomplete historical perspective

QA: An (Incomplete) Historical Perspective

Question answering not a new topic:

  • Erotetic logic (Harrah, 1984; Belnap and Steel, 1976)

  • Deductive question answering work in AI (Green, 1969;Schubert, 1986)

  • Conceptual Theories of QA (Lehnert, 1977)

  • Natural language front-ends to databases (Copestake, 1990; DARPA ATIS evaluations)

Dublin Computational Linguistics Research Seminar


Outline of talk1

Outline of Talk

  • Question Answering: A New Challenge in Information Retrieval

  • The TREC Question Answering Track

    • The Task

    • Evaluation Metrics

    • The Potential of NLP for Question Answering

  • The Sheffield QA System

    • Okapi

    • QA-LaSIE

  • Evaluation Results

  • Confidence Measures and their Application

  • Conclusions and Discussion

Dublin Computational Linguistics Research Seminar


The trec qa track task definition

The TREC QA Track: Task Definition

  • Inputs:

    • 4GB newswire texts (from the TREC text collection)

    • File of natural language questions (200 TREC-8/700 TREC-9)

      e.g.

      Where is the Taj Mahal?

      How tall is the Eiffel Tower?

      Who was Johnny Mathis’ high school track coach?

  • Outputs:

    • Five ranked answers per question, including pointer to source document

      • 50 byte category

      • 250 byte category

    • Up to two runs per category per site

  • Limitations:

    • Each question has an answer in the text collection

    • Each answer is a single literal string from a text (no implicit or multiple answers)

Dublin Computational Linguistics Research Seminar


The trec qa track metrics and scoring

The TREC QA Track: Metrics and Scoring

  • The principal metric is Mean Reciprocal Rank (MRR)

    • Correct answer at rank 1 scores 1

    • Correct answer at rank 2 scores 1/2

    • Correct answer at rank 3 scores 1/3

    • Sum over all questions and divide by number of questions

  • More formally:

    whereN = # questions,ri= the reciprocal ofthe best (lowest) rankassigned by a system at which a correct answeris found for question i, or 0 if no correct answer was found

  • Judgements made by human judges based on answer string alone (lenient evaluation) and by reference to documents (strict evaluation)

Dublin Computational Linguistics Research Seminar


The potential of nlp for question answering

The Potential of NLP for Question Answering

  • NLP has failed to deliver significant improvements in the document retrieval task.

    Will the same be true of QA?

  • Must depend on the definition of task

    • Current TREC QA task is best construed as micro passage retrieval

  • There are a number of linguistic phenomena relevant to QA which suggest that NLP ought to be able to help, in principle.

  • But, it also now seems clear from TREC-9 results that NLP techniques do improve the effectiveness of QA systems in practice.

Dublin Computational Linguistics Research Seminar


The potential of nlp for question answering1

The Potential of NLP for Question Answering

  • CoreferencePart of the information required to answer aquestion may occur in one sentence, while the rest occurs in anotherlinked via an anaphor. E.g.

    Question: Howmuch did Mercury spend on advertising in 1993? Text: Mercury…Last year the company spent £12m onAdvertising.

  • DeixisReferences (possibly relative) to here and now may needto be correctly interpreted.E.g. to answer the preceding question requires interpretinglast yearas1993 via the date-line of the text (1994).

  • Grammatical knowledge Difference in grammatical role canbe of crucial importance. E.g.

    Question: Which company took overMicrosoft?

    cannot be answered

    Text:Microsoft took overEntropic.

Dublin Computational Linguistics Research Seminar


The potential of nlp for question answering cont

The Potential of NLP for Question Answering (cont)

  • Semantic knowledgeEntailments based on lexical semanticsmay need to be computed. E.g. To answer the

    Question:At what age did Rossinistop writing opera?

    using the

    Text:Rossini … did not write anotheropera after he was 35.

    requires knowing that stopping X at time t means not doing X after t.

  • World knowledgeWorld knowledge may be requiredto interpret linguistic expressions. E.g. To answer the

    Question: In which city is the Eiffel Tower?

    using the

    Text:The Eiffel Tower is in Paris.

    but not the

    Text:The Eiffel Tower is in France.

    requires the knowledge that Paris is a city, France a country.

Dublin Computational Linguistics Research Seminar


Outline of talk2

Outline of Talk

  • Question Answering: A New Challenge in Information Retrieval

  • The TREC Question Answering Track

    • The Task

    • Evaluation Metrics

    • The Potential of NLP for Question Answering

  • The Sheffield QA System

    • Okapi

    • QA-LaSIE

  • Evaluation Results

  • Confidence Measures and their Application

  • Conclusions and Discussion

Dublin Computational Linguistics Research Seminar


Sheffield qa system architecture

Sheffield QA System Architecture

Overall objective is to use:

  • IR system as fast filter to select small set of documents with high relevance to query from the initial, large text collection

  • IE system to perform slow, detailed linguistic analysis to extract answer from limited set of docs proposed by IR system

Dublin Computational Linguistics Research Seminar


Okapi

Okapi

  • Used “off the shelf” – available from http://www.soi.city.ac.uk/research/cisr/okapi/okapi.html

  • based on the probabilistic retrieval model (Robertson + Sparck-Jones, 1976)

  • Used passage retrieval capabilities of Okapi

  • Passage retrieval parameters:

    • Min. passage: 1 para; Max. passage: 3 paras; Para step unit: 1

      arrived at by experimentation on TREC-8 data

  • Examined trade-offs between:

    • number of documents and “answer loss” :

      184/198 questions had answer in top 20 full docs; 160/198 in top 5

    • passage length and “answer loss” :

      only 2 answers lost from top 5 3-para passages

Dublin Computational Linguistics Research Seminar


Qa lasie

QA-LaSIE

  • Derived from LaSIE: Large Scale Information Extraction System

  • LaSIE developed to participate in the DARPA Message Understanding Conferences (MUC-6/7)

    • Template filling (elements, relations, scenarios)

    • Named Entity recognition

    • Coreference identification

  • QA-LaSIE is a pipeline of 9 component modules – first 8 are borrowed (with minor modifications) from LaSIE

  • The question document and each candidate answer document pass through all nine components

  • Key difference between MUC and QA task: IE template filling tasks are domain-specific; QA is domain-independent

Dublin Computational Linguistics Research Seminar


Qa lasie components

QA-LaSIE Components

1. Tokenizer. Identifies token boundaries and text section boundaries.

2. Gazetteer Lookup. Matches tokens against specialised lexicons (place,person names, etc.). Labels with appropriate name categories.

3. Sentence Splitter. Identifies sentence boundaries in the text body.

4. Brill Tagger. Assigns one of the 48 Penn TreeBank part-of-speech tags to each token in the text.

5. Tagged Morph. Identifies the root form and inflectional suffix for tokens tagged as nouns or verbs.

6. Parser. Performs two-pass bottom-up chart parsing first with a special named entity grammar, then with a general phrasal grammar. A “best parse” (possibly partial) is selected and a quasi-logical form(QLF) of each sentence is constructed.

For the QA task, a special grammar module identifies the “sought entity” of a question and forms a special QLF representation for it.

Dublin Computational Linguistics Research Seminar


Qa lasie components cont

QA-LaSIE Components (cont)

7. Name Matcher. Matches variants of named entities across the text.

8. Discourse Interpreter. Adds the QLF representation to a semantic net containing background world and domain knowledge. Additional info inferred from the input is added to the model, and coreference resolution is attempted between instances mentioned in the text.

For the QA task, special code was added to find and score a possible answer entity from each sentence in the answer texts.

9. TREC-9 Question Answering Module. Examines the scores for each possible answer entity, and then outputs the top 5 answers formatted for each of the four submitted runs.

New module for the QA task.

Dublin Computational Linguistics Research Seminar


Qa in detail 1 question parsing

Q:Who released the internet worm?

Question QLF:

qvar(e1), qattr(e1,name), person(e1),

release(e2), lsubj(e2,e1), lobj(e2,e3)

worm(e3), det(e3,the),

name(e4,’Internet’), qual(e3,e4)

QA in Detail (1): Question Parsing

Phrase structure rules are used to parse differentquestion typesandproduce a quasi-logical form (QLF) representation which contains:

  • a qvar predicate identifying the sought entity

  • a qattr predicate identifying the property or relationwhose value is sought for the qvar (this may not always be present.)

Dublin Computational Linguistics Research Seminar


Qa in detail 2 sentence entity scoring

QA in Detail (2):Sentence/Entity Scoring

Two sentence-by-sentence passes through each candidate answer text

  • Sentence Scoring:

    • Co-reference system from LaSIE discourse interpreter resolves coreferring entities both within answer texts and betweenanswer and question texts.

    • Main verb in question matched to similar verbs in answertext

    • Each non-qvarentity in the question is a “constraint”, and candidate answer sentences get one point for each constraint they contain.

Dublin Computational Linguistics Research Seminar


Qa in detail 2 sentence entity scoring cont

QA in Detail (2):Sentence/Entity Scoring (cont)

Entity Scoring: Each entity in each candidate answer sentence which was not matched to a term in the question at the sentence scoring stage receives a score based on:

  • semantic and property similarity to the qvar

  • whether it shares with the qvar the same relation to a matched verb (the lobj or lsubj relation)

  • whether it stands in a relation such as apposition, qualification or prepositional attachment to another entity in the answer sentence which was matched to a term in the question at the sentence scoring stage

    Entity scores are normalised in the range [0-1] so that they never outweigh a better sentence match

Dublin Computational Linguistics Research Seminar


Qa in detail 2 sentence entity scoring cont1

QA in Detail (2):Sentence/Entity Scoring (cont)

  • Total Score: For each sentence a total score is computed by

    • summing the sentence score and the “best entity score”

    • dividing by the number of entities in question + 1 (has no effect on answer outcome but normalises scores in [0-1] – useful for comparisons across questions)

  • Each sentence is annotated with

    • Total sentence score

    • “best entity”

    • “exact answer” = name attribute of best entity, if found

Dublin Computational Linguistics Research Seminar


Question answering in detail answer generation

Question Answering in Detail: Answer Generation

  • The 5 highest scoring sentences from all 20 candidate answer texts were used as the basis for the TRECanswer output

  • Results from 4 runs were submitted:

    • shef50ea – outputthe name of the best entityif available; otherwise output its longest realization in the text

    • shef50 – output the first occurrence of the best answer entity in the text –if less than 50 bytes longoutputentire sentence or a 50 byte windowaround the answer, whichever is shorter

    • shef250- same as shef50 but with a limit of 250 bytes

    • shef250p - same as shef250 but with extra padding from the surrounding text allowed to a 250 byte maximum

Dublin Computational Linguistics Research Seminar


Question answering in detail an example

Question QLF:

qvar(e1), qattr(e1,name), person(e1),

release(e2), lsubj(e2,e1), lobj(e2,e3)

worm(e3), det(e3,the),

name(e4,’Internet’), qual(e3,e4)

Answers:

Answer QLF:

Shef50ea: “Morris”

Shef50:“Morris testified that he

released the internet wor”

Shef250: “Morris testified that he

released the internet worm …”

Shef250p: “… Morris testified that he

released the internet worm …”

person(e1),name(e1,’Morris'),

testify(e2), lsubj(e2,e1), lobj(e2,e6), proposition(e6), main_event(e6,e3),

release(e3), pronoun(e4,he), lsubj(e3,e4), worm(e5), lobj(e3,e5)

Question Answering in Detail: An Example

Q:Who released the internet worm?

A:Morris testified that he released the internet worm…

Sentence Score: 2

Entity Score (e1): 0.91

Total (normalized): 0.97

Dublin Computational Linguistics Research Seminar


Outline of talk3

Outline of Talk

  • Question Answering: A New Challenge in Information Retrieval

  • The TREC Question Answering Track

    • The Task

    • Evaluation Metrics

    • The Potential of NLP for Question Answering

  • The Sheffield QA System

    • Okapi

    • QA-LaSIE

  • Evaluation Results

  • Confidence Measures and their Application

  • Conclusions and Discussion

Dublin Computational Linguistics Research Seminar


Evaluation results

Evaluation Results

  • Two sets of results:

    • Development results on 198 TREC-8 questions

    • Blind test results on 693 TREC-9 questions

  • Baseline experiment carried out using Okapi only

    • Take top 5 passages

    • Return central 50/250 bytes

Dublin Computational Linguistics Research Seminar


Best development results on trec 8 questions

Best Development Results on TREC-8 Questions

Dublin Computational Linguistics Research Seminar


Trec 9 results

TREC-9 Results

Dublin Computational Linguistics Research Seminar


Trec 9 50 byte runs

TREC-9 50 Byte Runs

Dublin Computational Linguistics Research Seminar


Trec 9 250 byte runs

TREC-9 250 Byte Runs

Dublin Computational Linguistics Research Seminar


Outline of talk4

Outline of Talk

  • Question Answering: A New Challenge in Information Retrieval

  • The TREC Question Answering Track

    • The Task

    • Evaluation Metrics

    • The Potential of NLP for Question Answering

  • The Sheffield QA System

    • Okapi

    • QA-LaSIE

  • Evaluation Results

  • Confidence Measures and their Application

  • Conclusions and Discussion

Dublin Computational Linguistics Research Seminar


The role of confidence in qa systems

The Role of Confidence in QA Systems

  • Little discussion to date concerning usability of QA systems, as conceptualised in the TREC QA task

  • Imagine asking How tall is the Eiffel Tower? and getting answers:

    • 400 meters (URL …)

    • 200 meters (URL …)

    • 300 meters (URL …)

    • 350 meters (URL …)

    • 250 meters (URL …)

  • There are several issues concerning the utility of such output, but two crucial ones are

    • How confident can we be in the system’s output?

    • How confident is the system is its own output?

Dublin Computational Linguistics Research Seminar


The role of confidence in qa systems cont

The Role of Confidence in QA Systems (cont)

  • That these questions are important to users (question askers) is immediately apparent from watching any episode of the ITV quiz show Who Wants to be a Millionaire?

  • Participants are allowed to “phone a friend” as one of their “lifelines”, when confronted with a question they cannot answer.

    Almost invariably they

    • Select a friend who they feel is most likely to know the answer – i.e. they attach an a priori confidence rating to their friend’s QA ability (How confident can we be in the system’s output?)

    • Ask their friend how confident they are in the answer they supply – i.e. they ask their friend to supply a confidence rating on their own performance

      (How confident is the system is its own output?)

  • MRR scores give an answer to a); however, to date no exploration of b)

Dublin Computational Linguistics Research Seminar


The role of confidence in qa systems cont1

The Role of Confidence in QA Systems (cont)

  • QA-LaSIE associates a normalised score in the range [0-1] with each answer - the combined sentence/entity (CSE) score

    • can the CSE scores be treated as confidence measures?

  • To determine this, need to see if CSE scores correlate with answer correctness

    • Note this is also a test of whether the CSE measure is a good one

  • Have carried out an analysis of CSE scores for shef50ea and shef250 runs on the TREC-8 question set

    • Rank all proposed answers by CSE score

    • For 20, 10, and 5 equal subdivisions of the [0-1] CSE score range determine the % answers correct in that subdivision …

Dublin Computational Linguistics Research Seminar


Shef50ea cse vs correctness

Shef50ea: CSE vs. Correctness

Dublin Computational Linguistics Research Seminar


Shef250 cse vs correctness

Shef250: CSE vs. Correctness

Caveat: analysis based on unequal distribution of data points. For the .2 chunks:

Range Data-points

0-.19115

.2-.39511

.4-.59306

.6-.79 45

.8-1.05

Dublin Computational Linguistics Research Seminar


Applications of confidence measures

Applications of Confidence Measures

  • The CSE/Correctness correlation (preliminarily) established above indicates the CSE measure is a useful measure of confidence

  • How can we use this measure?

    • Show it to the user – good indicator of how much faith they should have in the answer/whether they should bother following up the URL to the source document

    • In a more realistic setting, where not every question can be assumed to have an answer in the text collection, CSE score may suggest a threshold below which “no answer” should be returned

      • proposal for TREC-10

Dublin Computational Linguistics Research Seminar


Outline of talk5

Outline of Talk

  • Question Answering: A New Challenge in Information Retrieval

  • The TREC Question Answering Track

    • The Task

    • Evaluation Metrics

    • The Potential of NLP for Question Answering

  • The Sheffield QA System

    • Okapi

    • QA-LaSIE

  • Evaluation Results

  • Confidence Measures and their Application

  • Conclusions and Discussion

Dublin Computational Linguistics Research Seminar


Conclusions and discussion

Conclusions and Discussion

  • TREC-9 test results represent significant drop wrt to best training results

    • But, much better than TREC-8, vindicating the “looser” approach to matching answers

  • QA-LaSIE scores better than Okapi-baseline, suggesting NLP is playing a significant role

    • But, a more intelligent baseline (e.g. selecting answer passages based on word overlap with query) might prove otherwise

  • Computing confidence measures provides some support that our objective scoring function is sensible. They can be used for

    • User support

    • Helping to establish thresholds for “no answer” response

    • Tuning parameters in the scoring function (ML techniques?)

Dublin Computational Linguistics Research Seminar


Future work

Future Work

  • Failure analysis

    • Okapi – for how many questions were no documents containing an answer found?

    • Question parsing – how many question forms were unanalysable?

    • Matching procedure – where did it break down?

  • Moving beyond word root matching – using Wordnet?

  • Building an interactive demo to do QA against the web – Java applet interface to Google + QA-LaSIE running in Sheffield via CGI

    • Gets the right answer to the million £ question “Who was the husband of Eleanor of Aquitaine?” !

Dublin Computational Linguistics Research Seminar


The end

THE END

Dublin Computational Linguistics Research Seminar


  • Login