An Approach to Automatic Construction of Hierarchical Subject Domain for Question Answering Systems

An Approach to Automatic Construction of Hierarchical Subject Domain for Question Answering Systems Anna V. Zhdanova and Pavel V. Mankevich Novosibirsk State University and A.P. Ershov Institute of Informatics Systems, Novosibirsk

Contents • Goals • Research Motivation • What Matters when You Deal with Texts? • Proposed Algorithm • Subject Domain Construction: Example • Results

Our Goals To provide a solution for a knowledge management problem… which includes arranging natural language texts in a structure… which is hierarchical with the order of «easy for understanding» vs. «difficult for understanding» texts... and serves its best (and is the best) in natural question answering... Knowledge Management Structure Question Answering Hierarchy

Automatic Construction of Subject Domain for Natural Language QueriesMotivation Our assumptions and preconditions • Most kinds of information presented in a natural language can not be effectively stored and accessed by widely used data bases • Currently, most of the popular search engines create their own web resource hierarchies and use them in information retrieval. However, even for the largest engines the hierarchies are created manually • Manual construction of a hierarchy is not effective, because • it takes a lot of human effort and time • people make mistakes,but machines don’t • WWW grows too rapidly to process it manually

Automatic Construction of Subject Domain for Natural Language QueriesMotivationAutomatic construction of a subject domain (i.e., presenting knowledge of a chosen field in the way we do) -- why? • Obtaining the structures of text data specifically for satisfying natural language queries This problem has not had acceptable solutions until today. • Generating additional metadata for ontologies (establishing “general - particular”and “simple - complicated” relationships between hierarchy units)

They Matter when You Deal with Text Documents and They Matter in the Hierarchy as Well! • Weight function • if the text you read is quite long and contains many rare and complicated words you will probably stop reading it. This means this text is difficult for understanding and its weight function should be high (note here Zipf’s law) • Similarity measure • if two texts tell about similar things employing nearly the same words, these texts are similar and their similarity measure should be high

Weight Functions Examples Let X be a text document xi be the word frequencies in the document xi* be the word frequencies in the whole subject domain

Similarity measures Examples Let X, Y be text documents xi and yi be the word frequencies in documents X and Y Jaccard Association Taxonomic Distance Cosine Measure

Hierarchy ConstructionAlgorithm Input: natural language texts divided into “independent” text-units Output: corresponding hierarchy (i) Rank the units by their weights using the chosen weight function. (ii) Choose the unit with the lowest weight and place it as the root (i.e., the top node) of the hierarchy. (iii) Choose the unit with the lowest weight among the remaining ones and calculate the chosen similarity measure between this unit and each of the already chosen units. Put the newly chosen unit just below the one with the maximum similarity measure. (iv) Repeat step (iii) until the set of the remaining units is empty.

Construction of Hierarchical Subject DomainExampleStep 1: elimination of stop-words D1 public interface Document The Document is a container for text that serves as the model for swing text components. The goal for this interface is to scale from very simple needs (plain text textfield) to complex needs (HTML or XML documents for example). D2 Structure Text is rarely represented simply as featureless content. Rather, text typically has some sort of structure associated with it. Exactly what structure is modeled is up to a particular Document implementation. It might be … D3 Content At the simplest level, text can be modeled as a linear sequence of characters. To support internationalization, the Swing text model uses unicode characters. The sequence of characters displayed in a text component is generally referred … Filtration

Construction of Hierarchical Subject DomainExampleStep 2: calculating weight of each document, ordering the document array according to their weight: D1, D3, D2. Hence, document D1 is the root. D1 Document Interface Textfield Container Text Swing Component Model D2 Interface Text Docnument Element Field Content Structure Attribute Unit Model D3 Component Content Data Word Character Swing Text Model Document Sequence Weight Function Calculation D2 W = 5.33 D3 W = 5.24 D1 W = 5.14

Construction of Hierarchical Subject DomainExampleStep 3: adding to the hierarchy D3, indexing D3 Document Text Interface Swing Textfield Component Container Model D1 “Document” D3 “Content” Content Data Word Character Sequence

Construction of Hierarchical Subject DomainExample Step 4: similarity measure calculation between D2 and D1, D2 and D3 D1 D2 = 0.327 D2 D3 = 0.111

Construction of Hierarchical Subject DomainExampleStep 5: getting the hierarchy, indexing D2 D1 “Document” Document Text Interface Swing Textfield Component Container Model D2 “Structure” D3 “Content” Element Field Content Structure Attribute Unit Content Data Word Character Sequence

0 : Decreasing TLI 1 : Lowering auto insurance rates 1 : WLI and ULI 2 : VLI, ULI and participating WLI 1 : Having an accident 1 : Buying auto insurance 1 : Amount of life insurance 1 : Mortgage LI and other TLI 1 : Variable universal WLI 2 : Variable WLI 1 : Selling LI 1 : Participating WLI 1 : Physical exam 1 : Rental car 2 : New car 1 : Credit TLI 1 : Adjustable WLI 2 : Universal WLI 2 : Permanent LI 2 : Current assumption WLI 2 : Repairing vehicle 2 : 1035 exchange 3 : Tax issues 1 : Moving to another state 2 : SR-22 form 1 : Combination Policy 2 : Special Auto Policy 1 : Family income insurance 2 : Family Automobile Policy 3 : Automobile Insurance 1 : Liability lingo 2 : Liability Insurance 2 : Packaged policy 2 : Liability limits 1 : Term LI 1 : Underinsured Motorists 1 : Deposit TLI ResultsExperiments • A system performing • automatic construction of a subject domain • information retrieval • natural language interaction with the user • is built on the basis of Java and XML technologies • The main test base lies within • insurance subject domain • English language • 83 text units, i.e., hierarchy nodes • 27 typical natural language questions

ResultsRecall-Precision Curvesin terms of question answering • Non-hierarchical subject domain ( ) • Cosine similarity measure and weight function equal to the amount of words in a document ( ) • Cosine similarity measure and weight function inversely proportional to the product of word frequencies ( )

ResultsConclusion • Introduction of automatically constructed subject domains substantially improves performance of question answering systems. • Performance of a question answering system depends on the chosen combination of weight function and similarity measure. However, the combination that would be best for all cases is not found.

Thank you for attention! Contact us by e-mail: anna@sib3.ru, pavel@sib3.ru This presentation was created for the Andrei Ershov Fifth International Conference, Novosibirsk, Akademgorodok, Russia, 9 - 12 July, 2003.

An Approach to Automatic Construction of Hierarchical Subject Domain for Question Answering Systems