Improving Search through Corpus Profiling

Improving Search throughCorpus Profiling Bonnie Webber School of Informatics University of Edinburgh Scotland

Original motivation • PhD research (Michael Kaisser) on using (lexical resources (FrameNet, PropBank, VerbNet) to improve performance in QA • Developed two methods [Kaisser & Webber, 2007] • Evaluation on Web and AQUAINT corpus produced significantly different results. • Other research where same methods on same input produce significantly different results on different corpora.

FrameNet Example of annotated FrameNet data: (Screenshots from framenet.icsi.berkeley.edu)

Two QA methods Method 1 • Use resources to generate templates in which answer might be found. • Project templates onto quoted strings used directly as search queries. • Method 2 • Use resources to generate dependency structures in which answer might occur. • Search on lexical co-occurance. • Filter results by comparing structure of candidate sentences with the structure of the annotated resource sentences.

Method 1 Example: “Who purchased YouTube?”

Method 1 Extract simplified dependency structure from question using MiniPar: head: purchase.v head\subj: “Who” head\obj: “YouTube”

Method 1 Get annotated sentences from FrameNet for purchase.v:

Method 1 Use MiniPar to associate annotated abstract frame structure with dependency structure: Buyer[Subject, NP] VERB Goods[Object, NP]

Method 1 Buyer[Subject, NP] VERB Goods[Object, NP] head=purchase.V, Subject=“Who”, Object=“YouTube” Buyer[ANSWER] purchase.V Goods[“YouTube”]

Method 1 Generate potential answer templates: • ANSWER[NP] purchased YouTube • ANSWER[NP] (has|have) purchased YouTube • ANSWER[NP] had purchased YouTube • YouTube (has|have) been purchased by ANSWER[NP] • ...

Method 1 • Use patterns to generate quoted strings as search queries: "YouTube has been purchased by" • Extract sentences from snippets. • Parse sentences. • If structures match, extract answer: “YouTube has been purchased by Google for $1.65 billion.”

Method 1 (extended) Create additional paraphrases using all verbs in original frame & verbs identified through inter-frame relations: • ANSWER[NP] bought YouTube • YouTube was sold to ANSWER[NP]

Method 1 (Web-based evaluation) • Accuracy results on 264 (of 500) TREC 2002 questions whose head verb is not "be":

Method 1 (further extension) FN often gives ‘interesting’ examples rather than common ones. So • assume (as default) that verbs display common patterns: • Intransitive: [ARG0] VERB • Transitive: [ARG0] VERB [ARG1] • Ditransitive: [ARG0] VERB [ARG1] [ARG2] • And if one of these patterns is observed in Q that isn’t among those found in FN, just add it.

Method 1 • Method 1 and its extensions all lead to clear improvements in QA over the web, • But they may be losing answers by finding only exact string matches. • “YouTube was recently purchased by Google for $1.65 billion.” Method 2 addresses this.

Method 2 Associates each annotated sentence in FN and PB with a set of dependency paths from the head to each of the frame elements. “The Soviet Union[ARG0] has purchased roughly eight million tons of grain[ARG1] this month[TMP]”. • head: “purchase”, path = /i • ARG0: paths = {./s, ./subj,} • ARG1: paths = {./obj} • TMP: paths = {./mod}

Method 2 • Question analysis: Same as Method 1. • Search based on key words from question: purchased YouTube (no quotes) • Sentences are extracted from the returned snippets, e.g.: “Their aim is to compete with YouTube, which Google recently purchased for more than $1 billion.” • Dependency parse produced for each extract.

Method 2 Eight tests comparing dependency paths: 1a Do the candidate and example sentences share the same head verb? 1b Do the candidate and example sentences share the same path to the head? 2a In the candidate sentence, do we find one or more of the example’s paths to the answer role? 2b In the candidate sentence, do we find all of the example’s paths to the answer role?

Method 2 3a Can some of the paths for the other roles be found in the candidate sentence? 3b Can all of the paths for the other roles be found in the candidate sentence? 4a Do the surface strings of the other roles partially match those of the question? 4b Do the surface strings of the other roles completely match those of the question?

Method 2 • Each sentence that passes steps 1a and 2a is assigned a weight of 1. (Otherwise 0.) • For each of the remaining tests that succeeds, that weight is multiplied by 2.

Method 2 Annotated frame sentence (from PropBank): “The Soviet Union[ARG0] has purchased roughly eight million tons of grain[ARG1] this month[TMP]”. Candidate sentence retrieved from the Web: “Their aim is to compete with YouTube, which Google recently purchased for more than $1 billion.” N.B. Object rel clause - string match would fail.

Method 2 Candidate sentence: • head: “purchase, ”path = /i/pred/i/mod/pcomp-n/rel/i • phrase: “Google”, paths = {./s, ./subj,} • phrase: “which”, paths = {./obj} • phrase: “YouTube”, paths = {\i\rel} • phrase: “for more than $1 billion”, paths = {./mod} PropBank example sentence: • head: “purchase”, path = /i • ARG0: “The Soviet Union”, paths = {./s, ./subj,} • ARG1: “roughly eight million tons of grain ”, paths = {./obj} • TMP: “this month”, paths = {./mod}

Method 2 Candidate sentence: • head: “purchase”, path = /i/pred/i/mod/pcomp-n/rel/i • phrase: “Google”, paths = {./s, ./subj} • phrase: “which”,paths = {./obj} • phrase: “YouTube”, paths = {\i\rel} • phrase: “for more than $1 billion”, paths = {./mod} PropBank example sentence: • head: “purchase”, path = /i • ARG0: “The Soviet Union”, paths = {./s, ./subj} • ARG1: “roughly eight million tons of grain”, paths = {./obj} • TMP: “this month”, paths = {./mod}

Method 2 Candidate sentence: • head: “purchase”, path = /i/pred/i/mod/pcomp-n/rel/i • phrase: “Google”, paths = {./s, ./subj} • phrase: “which”,paths = {./obj} • phrase: “YouTube”, paths = {\i\rel} • phrase: “for more than $1 billion”, paths = {./mod} The results of the tests are: This sentence returns the answer “Google”, to which a score of 8 is assigned.

Method 2 Candidate sentence: • head: “purchase”, path = /i/pred/i/mod/pcomp-n/rel/i • phrase: “Google”, paths = {./s, ./subj} • phrase: “which”,paths = {./obj} • phrase: “YouTube”, paths = {../..} • phrase: “for more than $1 billion”, paths = {./mod} We get a (partially correct) role assignment: • ARG0: “Google ”, paths = {./s, ./subj} • ARG1: “which”, paths = {./obj} • TMP: “for more than $1 billion”, paths = {./mod}

Method 2 Evaluation results for method 2: PropBank outperforms FrameNet because: • More lexical entries in PropBank • More example sentences per entry in PropBank • FrameNet does not annotate peripheral adjuncts

Evaluation 21% improvement on the 264 non-’be’ TREC 2002 questions, when used on the web.

Problem Similar levels of improvement were not found when applied directly to the AQUAINT corpus, using the exact same methods.

Not an isolated case • Across 9 different IR models, [Iwayama et al, 2003] found similar differences when posing the same queries to • a corpus of Japanese patent applications (full text) • a corpus of Japanese newspaper articles But they don’t speculate on the reason for these results.

What makes for such differences? • In Kaisser’s case, the form in which information appears in the corpus may match neither the question nor any form derivable from it via FrameNet, PropBank or VerbNet.

What year was Alaska purchased? • On March 30, 1867, U.S. Secretary of State William H. Seward reached agreement with Russia to purchase the territory of Alaska for $7.2 million, a deal roundly ridiculed as Seward's Folly. (APW20000329.0213) • But by 1867, when Secretary of State William H. Seward negotiated the purchase of Alaska from the Russians, sweetheart deals like that weren't available anymore.' (NYT19980915.0275)

Hypothesis • Profiling a corpus and adapting search to its characteristics can improve performance in IR and QA. • Neither new nor surprising: “Genre, like a range of other non-topical features of documents, has been under-exploited in IR algorithms to date, despite the fact that we know that searchers rely heavily on such features when evaluating and selecting documents” [Freund et al, 2006]. • Also cf. [Argamon et al, 1998; Finn & Kushmerik, 2006; Karlgren 2004; Kessler et al, 1997]

What basis for profiling? • Documents can be characterised in terms of • genre • register • domain • These in turn implicate • lexical choice • syntactic choice • choice of referring expression • structural choices at the document level • formatting choices

Definitions • Genre, register, domain are not completely independent concepts.

Definitions • Genre, register, domain are not completely independent concepts. • Genre: A distinctive type of communicative action, characterized by a socially recognizedcommunicative purpose and common aspects of form [Orlikowski & Yates, 1994].

Definitions • Genre, register, domain are not completely independent concepts. • Genre: A distinctive type of communicative action, characterized by a socially recognizedcommunicative purpose and common aspects of form [Orlikowski & Yates, 1994]. • Register: Generalized stylistic choices due to situational features such as audience and discourse environment [Morato et a., 2003]

Definitions • Genre: A distinctive type of communicative action, characterized by a socially recognizedcommunicative purpose and common aspects of form [Orlikowski & Yates, 1994]. • Register: Generalized stylistic choices due to situational features such as audience and discourse environment [Morato et a., 2003] • Domain: The knowledge and assumptions held by members of a (professional) community.

Assumptions • In IR, it seems worth characterizing documents directly as to genre (and possibly register). • Doing so automatically requires characterising inter alia significant linguistic features. • For QA, further benefits will come from profiling the lexical, syntactic, referential, structural and formatting consequences of genre, register and domain, and exploiting these features directly.

Direct use of genre • [Freund et al, 2006], [Yeung et al, 2007] • Analysed behavior of software engineering consultants looking for documents they need in order to provide technical services to customers using the company’s software product. • A range of genres identified through both user interviews and analysis of the websites and repositories they used

Direct use of genre • Manuals • Presentations • Product documents • Technotes, tips • Tutorials and labs • White papers • Best practices • Design patterns • Discussions/forums • ….

Direct use of genre • Requires manually labelling each document with its genre or recognizing its genre automatically. • The latter requires characterising genres in terms of automatically recognizable features. • Best practice: Description of a proven methodology • or technique for achieving a desired result, often based • on practical experience. • Form: primarily text, many formats, variable length • Style: imperatives, “best practice” • Subject matter: new technologies, design, coding

Direct use of genre (X-Site) • Prototype workplace search tool for software engineers currently in use [Yeung et al, 2007]. • Provides access to ~8GB of content crawled from the Internet, intranet and Lotus Notes data. • Exploits • Task profiles • Task-genre associations: known +/_/- relationships between task and genre pairs • Automatic genre classifier

Using genre, register, domain in QA • Answers to Qs can be found anywhere, not just in documents on the specific topic. Q: When Did the Titanic Sink? Twelve months have passed since 193 people died aboard the Herald of Free Enterprise. But time has not eased the pain of Evelyn Pinnells, who lost two daughters when the ferry capsized off Belgium. They were among the victims when the Herald of Free Enterprise capsized off the Belgian port of Zeebrugge on March 6, 1987. It was the worst peacetime disaster involving a British ship since the Titanic sank in 1912.

Using genre, register, domain in QA • For this reason, IR for QA differs from general IR, using (instead) passage retrieval, quoted strings, etc. • For the same reason, one may not want to prefilter documents by genre, register or domain labels (as seems useful for IR). • Rather, it may be beneficial to exploit features of and patterns in the linguistic features that realize genre, register and domain. • What are those features?

Lexical features • Register strongly affects word choice: • MedLinePlus: “runny nose” • PubMed: “rhinitis”, “nasopharyngeal symptoms” • Clinical notes: “greenish sputum” • UMLS: Informal “greenish” doesn’t appear [Bodenreider & Pakhomov, 2004] • Domain also affects word choice: • “smoltification” occurs ~600 times in a corpus of 1000 papers on salmon, while none in AQUAINT [Gabbay & Sutcliffe, 2004].

Lexical features • Register strongly effects type/token ratios: • Only 850 core words (+ inflections) in Basic English, so type/token ratio is very small. • Federalist papers: ~0.36 King James Version: And God said, Let the waters bring forth abundantly the moving creature that hath life, and fowl that may fly above the earth in the open firmament of heaven. Bible in Basic English: And God said, Let the waters be full of living things, and let birds be in flight over the earth under the arch of heaven.

Lexical features • IR4QA using either keywords or quoted strings for passage retrieval could benefit from responding to both types of lexical divergence between question and corpus.

Syntactic Features: Voice • Active • The Grid provides an ideal platform for new ontology tools and data bases, … • Users log-in using a password which is encrypted using a public key and private key mechanism. • Passive • Ontologies are recognized as having a key role in data integration on the computational Grid. • We store ontology files in hierarchical collections, based on user unique identifiers, ontology identifiers, and ontology version numbers.

Syntactic Features • Passive voice is used significantly more often in the physical sciences than in the social sciences [Bonzi, 1990].

Syntactic Features • Passives also used significantly often in surgical reports [Bross et al, 1972] and repair reports. • For agentive verbs, missing agent is surgeon (or surgical team) or repair person: • “… the skin was prepared and draped .. Incision was made .. Axillary fat was dissected and bleeding controlled …” • But not for non-agentive verbs.

Improving Search through Corpus Profiling

Improving Search through Corpus Profiling

Presentation Transcript

improving performance through strength

Improving Search

Enabling global trust through requirements profiling

- Improving Lives Through Research -

Improving Comprehension Through Visualization

Accelerating Corpus Annotation through Active Learning

Improving Data Discovery in Metadata Repositories through Semantic Search

Improving Products Through Procurement

The Case for Corpus Profiling

Improving Statistical Parsing Using Cross-Corpus Data

Improving Search through Corpus Profiling

“Improving Results Through Collaboration”

Improving search at UCL

Improving Writing through Research

through Search

Improving Spelling through

Improving Data Discovery Through Semantic Search

Statistical Measures for Corpus Profiling

Improving MyRoots Search

Improving Through Engineered Thermoplastics

Improving Health Through Policy

The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis