Overspecified reference in hierarchical domains: measuring the benefits for readers Ivandre Paraboni * Judith Masthoff # Kees van Deemter # * = University of Sao Paulo # = University of Aberdeen
What this is about • Generation of Referring Expressions (GRE) • Referring expression is overspecified if a clear referring expression can be obtained by removing a property • Informally: overspecified = logically redundant
Introduction to the problem Suppose • I live on Western Road, the longest street in Aberdeen • I live at number 968. No other house in Aberdeen has that number “Number 968, Aberdeen” is a distinguishing description, but it’s not very useful It’s better to add logically redundant information, e.g., “968 Western Road, Aberdeen” , or even “968 Western Road, Bon Accord, Aberdeen”
Overspecification in referring expressions • Any GRE algorithm that does not achieve “Full Brevity” (Dale 1989) • Investigated in its own right by e.g. • Arts 2004 (role of location; purely empirical) • Jordan 2000 (overspec in specific situations, e.g., when a sale is confirmed) • Horacek 2005 (overspec when there is uncertainty about applicability of properties)
Our focus: • The need for overspecification when a large domain is not fully known in advance to a hearer. Typical examples involve space or time: • A house in a city, a photocopier in a building, a picture in a document • (An event or object in time, e.g., ‘the minister of the colonies in the XYZ government’ ) • This talk: empirical validation of algorithms
Caveat • Overspecification can make it easier to identify the referent ... • ... but it is bound to lengthen reading times • Our terminology: we expect overspecification • to make interpretation harder • to make resolution easier
Short history ... Paraboni & van Deemter (INLG-2002): • A simple theory of the way in which hearers perform search. Ancestral Search (AS) • Two types of situations that AS predicts to be problematic for hearers: Lack of Orientation (LO) and Dead End (DE). • An algorithm (in two flavours) that adds redundant information when AS predicts these problems • An experiment to test whether these algorithms improve the output of GRE
(1) Lack of Orientation (LO) University of Brighton Watts building Cockcroft building North Wing South Wing North West South auditorium biblioteca biblioteca “the West Wing”
(2) Dead End (DE) University of Brighton Watts building Cockcroft building North Wing ? South Wing North West South auditorium library library “the library in the North Wing”
Explanation (informal!) • Why are LO and DE bad? • Ancestral Search (AS): “Search locally, then one level up at a time” • Essentially, this is just salience (cf. Krahmer & Theune 2000) applied to hierarchies
Summary of Experiment 1: Descriptions compared by subjects • 15 subjects were shown documents from which most of the words were deleted • Binary forced choice between two expressions that refer to document parts: • the obvious minimal description • the redundant description generated by our algorithm
Hypotheses & Outcomes • Hyp 1: In problematic situations, redundant descriptions are preferred • Hyp 2: In non-problematic situations,non-redundant descriptions are preferred • Outcomes: • Hyp 1: overwhelmingly confirmed • Hyp 2: trend in the right direction (57%),but not statistically significant. (Too few subjects?)
Limitations of first experiment • This experiment was hybrid: partly about reading, partly about writing • It did not teach us why redundant descriptions were preferred (in problematic cases) • We think this was because non-redundant descriptions caused problems for resolution ... • ... but the experiment did not address resolution separately. (Subjects may have balanced interpretation and resolution when judging).
What next? • Therefore, a new experiment was called for, which addresses resolution only. • Documents as our domain again • Add hyperlinks to support non-linear search through the document • Track readers’ resolution (i.e., search) process • Intricate experiment, hence a new author: Judith Masthoff (University of Aberdeen)
Experiment 2: Tracking resolution • Effect of logical redundancy on the performance of readers • Focussing on resolution
Experimental Design • 40 subjects completed experiment • Within-subjects design: each subject shown 20 documents • Order of documents randomized • Documents were made to look different • Reader had knowledge of hierarchical structure • Reader was given task: “Please click on..” • Navigation actions recorded
Reader Location “Let’s talk about helicopters. Please click on picture 4 in part C”
Hypothesis 1 • In a problematic (DE/LO) situation, the number of navigation actions required for a long (FI/SL) description is smaller than that required for a minimal description. • Informally: redundancy helps resolution! (in problematic situations)
But ... • it seems likely that redundant information will always help resolution • so let’s compare the “Gain” in problematic/unproblematic situations
Hypothesis 2 • The Gain achieved by a long description over a minimal description will be larger in a problematic situation than in a non-problematic situation • Informally: redundancy helps especially in problematic situations
But ... • Even more redundancy might have helped even more • The obvious candidate: a complete description • Compare cases where our algorithm prescribes a complete description with ones where it does not. • We want b to be greater than a:a = Gain(complete-description, incomplete-description-generated-by-algorithm) b = Gain(complete-description-generated-by-algorithm, incomplete-description)
Hypothesis 3 • The Gain of a complete description over a less complete one will be larger for a situation in which our algorithms generated the complete description, than for a situation in which our algorithms generated the less complete description.
Results: Hypothesis 1 Do redundant descriptions benefit problematic situations?
Results: Hypothesis 1 Do redundant descriptions benefit problematic situations? Yes!
Results: Hypothesis 2 Do redundant descriptions benefit problematic situations MORE than non-problematic situations?
Comparing like with like • General Linear Model (GML) with repeated measures • Comparison of similar situations, e.g. 2 and 7sit2&7: minimal = “pic.3 in part A” redundant = “pic.3 in part A of section 2”sit2: reader is in same section as targetsit7: reader is in a different section
Results: Hypothesis 2 Do redundant descriptions benefit problematic situations MORE than non-problematic situations? Yes!
Results: Hypothesis 3 Are our algorithms economical with redundancy? FI FI FI FI
Results: Hypothesis 3 Are our algorithms economical with redundancy? FI FI FI FI Yes!
How much overspecification is optimal ? University of Brighton Watts building Cockcroft building North Wing South North West South auditorium library library “The auditorium” “The .... on this campus” “The .... in the Watts building” “The ...in the North Wing”
Which of all these descriptions is best? • Depends on issues other than the structure of the domain, e.g., • how much time/space has the speaker/writer available? • how important is it that misunderstandings are avoided? [cf., Van Deemter et al., this conference] • is there room for negotiation through dialogue [cf., Khan et al., this conference])
In setting of this experiment • We did not find a point beyond which overspecification backfires • We did find a point of “diminishing returns” for resolution speed • Given that interpretation deteriorates with every added property, the figures are suggestive
Getting a feeling for the numbers • Nonproblematic situations (situations 7 and 8): • short descr: 1.53 clicks (2 properties) • redundant (other): 1.34 clicks (3 properties) • Problematic situations (situations 3 and 4): • short descr: 4.05 clicks (1 property) • redundant (algorithm): 1.77 clicks (2 properties) • redundant(other): 1.31 clicks (3 properties)
Conclusion • Overspec can have many reasons (Jordan 2000, Horacek 2005) • Overspec isn’t always equally necessary • Focus on overspec for guiding “resolution” • The optimum amount of overspec is hard to determine • But we have found a point of diminishing returns, based on the need to avoid DE and LO.
[ A medical comparison • A hospital with two types of patients, all of whom have coughing (cf., clicking!) as their main symptom • chest infections (serious patients) • throat infections (light patients) • you can administer 1, 2, or 3 of pills (cf., properties). But pills can be harmfull, so the doctor uses them sparingly
The doctor’s regime: • light patients should get 1 pill • serious patients should get 2 pills on a normal night, and 3 pills on a bad night Is this a wise regime? Tests were done ...
Test of effectiveness of pills • Serious patients who get their 2 or 3 pills start coughing less • Serious patients benefit more from getting their prescribed high number of pills (as opposed to just 1) than light patients • Focus on serious patients. Try giving the ones that are having a good night 3 pills (i.e. one more than prescribed). They benefit less (from getting 3 instead of 2 pills) than the ones that are having a bad night benefitted (from getting 3 instead of 2 pills). ]
Results on Search Behaviour # Deviations from Ancestral Search in first navigation action for 12 documents with incomplete descriptions