1 / 7

Task 1: Intrinsic Evaluation

Task 1: Intrinsic Evaluation. Vasile Rus , Wei Chen, Pascal Kuyten , Ron Artstein. Task definition. Only interested in info-seeking questions Evaluation biased towards current technology Asking for the “trigger” text is problematic: Future QG systems may not employ a trigger

Download Presentation

Task 1: Intrinsic Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Task 1: Intrinsic Evaluation VasileRus, Wei Chen, Pascal Kuyten, Ron Artstein

  2. Task definition • Only interested in info-seeking questions • Evaluation biased towards current technology • Asking for the “trigger” text is problematic: • Future QG systems may not employ a trigger • Trigger less important for deep/holistic questions • Need to define what counts as QG • Would mining for questions be acceptable? • Require generative component? (defined how?) • Internal representation? Structure?

  3. Evaluation criteria • Evaluate question alone, or question+answer? • System provides question • Evaluator decides if answer is available • Separately, evaluate system answer if given • Answer = contiguous text? • Can this be relaxed? • Additional criteria: conciseness?

  4. Annotation guidelines • Question type: need more detailed definition • Yao et al (submitted): • What category includes (what|which) (NP|PP) • Question type identified mechanically with ad-hoc rules

  5. Terminology • For QG from sentences task: • “Ambiguity” is really specificity or concreteness • “Relevance” is really answerability

  6. Rating disagreements • Many (most?) of the disagreements are between close ratings (e.g. 3 vs. 4) • Need a measure that considers magnitudes, such as Krippendorff’s α • Perhaps normalize ratings by rater? • Specific disagreement on in-situ questions • The codes are not what? • Needs to be addressed in the guidelines

  7. New tasks • Replace QG from sentences with QG from metadata • Evaluates only the generation component • Finding things to ask remains a component of the QG from paragraphs task • Make all system results public for analysis • Required? Voluntary? • Use data to learn from others’ problems

More Related