A Kernel-based Approach to Learning Semantic Parsers

A Kernel-based Approach to Learning Semantic Parsers Rohit J. Kate Doctoral Dissertation Proposal Supervisor:Raymond J. Mooney November 21, 2005

Outline • Semantic Parsing • Related Work • Background on Kernel-based Methods • Completed Research • Proposed Research • Conclusions

Semantic Parsing • Semantic Parsing: Transforming natural language (NL) sentences into computer executablecomplete meaning representations (MRs) • Importance of Semantic Parsing • Natural language communication with computers • Insights into human language acquisition • Example application domains • CLang: Robocup Coach Language • Geoquery: A Database Query Application

CLang: RoboCup Coach Language • In RoboCup Coach competition teams compete to coach simulated players • The coaching instructions are given in a formal language called CLang If our player 4 has the ball, our player 4 should shoot. Simulated soccer field Coach Semantic Parsing ((bowner our {4}) (do our {4} shoot)) CLang

Geoquery: A Database Query Application • Query application for U.S. geography database containing about 800 facts [Zelle & Mooney, 1996] Which rivers run through the states bordering Texas? User Semantic Parsing answer(traverse_2( next_to(stateid(‘texas’)))) Query

Learning Semantic Parsers • Assume meaning representation languages (MRLs) have deterministic context free grammars • true for almost all computer languages • MRs can be parsed unambiguously

ANSWER RIVER answer TRAVERSE_2 STATE traverse_2 NEXT_TO STATE next_to STATEID ‘texas’ stateid NL: Which rivers run through the states bordering Texas? MR: answer(traverse_2(next_to(stateid(‘texas’)))) Parse tree of MR: Non-terminals: ANSWER, RIVER, TRAVERSE_2, STATE, NEXT_TO, STATEID Terminals: answer, traverse_2, next_to, stateid, ‘texas’ Productions: ANSWER  answer(RIVER), RIVER  TRAVERSE_2(STATE), STATE  NEXT_TO(STATE), STATE  NEXT_TO(STATE) TRAVERSE_2  traverse_2, NEXT_TO  next_to, STATEID  ‘texas’

Learning Semantic Parsers • Assume meaning representation languages (MRLs) have deterministic context free grammars • true for almost all computer languages • MRs can be parsed unambiguously • Training data consists of NL sentences paired with their MRs • Induce a semantic parser which can map novel NL sentences to their correct MRs • Learning problem differs from that of syntactic parsing where training data has trees annotated over the NL sentences

Related Work: CHILL [Zelle & Mooney, 1996] • Uses Inductive Logic Programming (ILP) to induce a semantic parser • Learns rules to control actions of a deterministic shift-reduce parser • Processes sentence one word at a time making hard parsing decision each time • Brittle and ILP techniques do not scale to large corpora

Related Work: SILT [Kate, Wong & Mooney, 2005] • Transformation rules associate NL patterns with MRL templates • NL patterns matched in the sentence are replaced by the MRL templates • By the end of parsing, NL sentence gets transformed into its MR • Two versions: string patterns and syntactic tree patterns

Related Work: SILT contd. Weaknesses of SILT: • Hard-matching transformation rules are brittle: • For e.g. for NL pattern our left [3] penalty area “our left penalty area” “our left side of penalty area ” “left of our penalty area” “our ah.. left penalty area” • Parsing is done deterministically which is less robust than probabilistic parsing

Related Work: WASP [Wong, 2005] • Based on Synchronous Context-free Grammars • Uses Machine Translation technique of statistical word alignment to find good transformation rules • Builds a maximum entropy model for parsing • The transformation rules are hard-matching

S-bowner NP-player VP-bowner PRP$-team NN-player CD-unum VB-bowner NP-null our player 2 has DT-null NN-null the ball S-bowner NP-player VP-bowner PRP$-team NN-player CD-unum VB-bowner NP-null our player 2 has DT-null NN-null the ball Related Work: SCISSOR [Ge & Mooney, 2005] • Based on a fairly standard approach to compositional semantics [Jurafsky and Martin, 2000] • A statistical parser is used to generate a semantically augmented parse tree (SAPT) • Augment Collins’ head-driven model 2 (Bikel’s implementation, 2004) to incorporate semantic labels • Translate SAPT into a complete formal meaning representation

Related Work: Zettlemoyer & Collins [2005] • Uses Combinatorial Categorial Grammar (CCG) formalism to learn a statistical semantic parser • Generates CCG lexicon relating NL words to semantic types through general hand-built template rules • Uses maximum entropy model for compacting this lexicon and doing probabilistic CCG parsing

Information loss Feature Engineering Machine Learning Algorithm Traditional Machine Learning with Structured Data Examples Feature Vectors

Implicit mapping to potentially infinite number of features Kernel Computations Kernelized Machine Learning Algorithm Kernel-based Machine Learning with Structured Data Examples

Kernel Functions • A kernel K is a similarity function over domain X which maps any two objects x, y in X to their similarity score K(x,y) • For x1, x2 ,…, xn in X, the n-by-n matrix (K(xi,xj))ij should be symmetric and positive-semidefinite, then the kernel function calculates the dot-product of the implicit feature vectors in some high-dimensional feature space • Machine learning algorithms which use the data only to compute similarity can be kernelized (e.g. Support Vector Machines, Nearest Neighbor etc.)

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” K(s,t) = ?

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = left K(s,t) = 1+?

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = our K(s,t) = 2+?

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = penalty K(s,t) = 3+?

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = area K(s,t) = 4+?

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = left penalty K(s,t) = 5+?

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” K(s,t) = 11

Normalized String Subsequence Kernel • Normalize the kernel (range [0,1]) to remove any bias due to different string lengths • Lodhi et al. [2002] give O(n|s||t|) for computing string subsequence kernel • Used for Text Categorization [Lodhi et al, 2002] and Information Extraction [Bunescu & Mooney, 2005b]

Support Vector Machines • Mapping data to high-dimensional feature spaces can lead to overfitting of training data (“curse of dimensionality”) • Support Vector Machines (SVMs) are known to be resistant to this overfitting

ρ SVMs: Maximum Margin • Given positive and negative examples, SVMs find a separating hyperplane such that the marginρbetween the closest examples is maximized • Maximizing the margin is good according to intuition and PAC theory Separating hyperplane

SVMs: Probability Estimates • Probability estimate of a point belonging to a class can be obtained using its distance from the hyperplane [Platt, 1999]

Why Kernel-based Approach to Learning Semantic Parsers? • Natural language sentences are structured • Natural languages are flexible, various ways to express the same semantic concept CLang MR: (left (penalty-area our)) NL: our left penalty area our left side of penalty area left side of our penalty area left of our penalty area our penalty area towards the left side our ah.. left penalty area

Why Kernel-based Approach to Learning Semantic Parsers? right side of our penalty area left of our penalty area opponent’s right penalty area our left side of penalty area our ah.. left penalty area our left penalty area our right midfield left side of our penalty area our penalty area towards the left side Kernel methods can robustly capture the range of NL contexts.

KRISP: Kernel-based Robust Interpretation by Semantic Parsing • Learns semantic parser from NL sentences paired with their respective MRs given MRL grammar • Productions of MRL are treated like semantic concepts • SVM classifier is trained for each production with string subsequence kernel • These classifiers are used to compositionally build MRs of the sentences

Overview of KRISP MRL Grammar Collect positive and negative examples NL sentences with MRs Best semantic derivations (correct and incorrect) Train string-kernel-based SVM classifiers Training Semantic Parser Testing Novel NL sentences Best MRs

Overview of KRISP’s Semantic Parsing • We first define Semantic Derivation of an NL sentence • We define probability of a semantic derivation • Semantic parsing of an NL sentence involves finding its most probable semantic derivation • Straightforward to obtain MR from a semantic derivation

ANSWER RIVER answer TRAVERSE_2 STATE traverse_2 NEXT_TO STATE next_to STATEID ‘texas’ stateid Semantic Derivation of an NL Sentence MR parse with non-terminals on the nodes: Which rivers run through the states bordering Texas?

Semantic Derivation of an NL Sentence MR parse with productions on the nodes: ANSWER  answer(RIVER) RIVER  TRAVERSE_2(STATE) TRAVERSE_2  traverse_2 STATE  NEXT_TO(STATE) NEXT_TO  next_to STATE  STATEID STATEID  ‘texas’ Which rivers run through the states bordering Texas?

Semantic Derivation of an NL Sentence Semantic Derivation: Each node coversan NL substring: ANSWER  answer(RIVER) RIVER  TRAVERSE_2(STATE) TRAVERSE_2  traverse_2 STATE  NEXT_TO(STATE) NEXT_TO  next_to STATE  STATEID STATEID  ‘texas’ Which rivers run through the states bordering Texas?

(NEXT_TO  next_to, [5..7]) Semantic Derivation of an NL Sentence Semantic Derivation: Each node contains a production and the substring of NL sentence it covers: (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE),[1..9]) (TRAVERSE_2  traverse_2,[1..4]) (STATE  NEXT_TO(STATE),[5..9]) (STATE  STATEID,[8..9]) (STATEID  ‘texas’,[8..9]) Which rivers run through the states bordering Texas? 1 2 3 4 5 6 7 8 9

Semantic Derivation of an NL Sentence Substrings in NL sentence may be in a different order: ANSWER  answer(RIVER) RIVER  TRAVERSE_2(STATE) TRAVERSE_2  traverse_2 STATE  NEXT_TO(STATE) NEXT_TO  next_to STATE  STATEID STATEID  ‘texas’ Through the states that border Texas which rivers run?

(STATE  STATEID, [6..6]) (STATEID  ‘texas’, [6..6]) Semantic Derivation of an NL Sentence Nodes are allowed to permute the children productions from the original MR parse (ANSWER  answer(RIVER), [1..10]) (RIVER  TRAVERSE_2(STATE), [1..10]] (STATE  NEXT_TO(STATE), [1..6]) (TRAVERSE_2  traverse_2, [7..10]) (NEXT_TO  next_to, [1..5]) Through the states that border Texas which rivers run? 1 2 3 4 5 6 7 8 9 10

Probability of a Semantic Derivation • Let Pπ(s[i..j]) be the probability that production π covers the substring s[i..j], • For e.g., PNEXT_TO  next_to(“the states bordering”) • Obtained from the string-kernel-based SVM classifiers trained for each production π • Probability of a semantic derivation D: (NEXT_TO  next_to, [5..7]) the states bordering 5 6 7

(STATE NEXT_TO(STATE), [5..9]) (NEXT_TO  next_to, [5..7]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) the states bordering Texas? 5 6 7 8 9 Computing the Most Probable Semantic Derivation • Task of semantic parsing is to find the most probable semantic derivation • Let En,s[i..j], partial derivation, denote any subtree of a derivation tree with n as the LHS non-terminal of the root production covering sentence s from index i to j • Example of ESTATE,s[5..9]: • Derivation D is then EANSWER, s[1..|s|]

Computing the Most Probable Semantic Derivation contd. • Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] • This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) the states bordering Texas? 5 6 7 8 9

Computing the Most Probable Semantic Derivation contd. • Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] • This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) E*NEXT_TO,s[i..j] E*STATE,s[i..j] the states bordering Texas? 5 6 7 8 9

Computing the Most Probable Semantic Derivation contd. • Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] • This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) E*NEXT_TO,s[5..5] E*STATE,s[6..9] the states bordering Texas? 5 6 7 8 9

A Kernel-based Approach to Learning Semantic Parsers

A Kernel-based Approach to Learning Semantic Parsers

Presentation Transcript

Work based learning – a collaborative approach

Learning Semantic Parsers: An Important But Under-Studied Problem

Portfolio-based Assessment: A constructivist approach to measuring learning.

A Learning-Based Approach to Reactive Security *

A Semantic Approach to Discovering Schema Mapping

A Confidence-Based Approach to Multi-Robot Demonstration Learning

A Concept Space Approach to Semantic Exchange

A view-based approach for semantic service descriptions

A Constructivist Approach to Case Study Based Learning

A New Approach â€“ Project Based Learning

A Memory-Based Approach to Semantic Role Labeling

An Overview of Kernel-Based Learning Methods

A Policy Based Approach to Security for the Semantic Web

AdaMKL: A Novel Biconvex Multiple Kernel Learning Approach

A Probabilistic Approach to Semantic Representation

Linux Dionisys: A Kernel-Based Approach to QoS Management

A Kernel Approach for Learning From Almost Orthogonal Pattern *

A Latent Semantic Indexing-based approach to multilingual document clastering

ZemPod : A semantic web approach to podcasting

Problem-based Learning Approach

Parsers