1 / 112

A Kernel-based Approach to Learning Semantic Parsers

A Kernel-based Approach to Learning Semantic Parsers. Rohit J. Kate Doctoral Dissertation Proposal Supervisor: Raymond J. Mooney. November 21, 2005. Outline. Semantic Parsing Related Work Background on Kernel-based Methods Completed Research Proposed Research Conclusions.

fruma
Download Presentation

A Kernel-based Approach to Learning Semantic Parsers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Kernel-based Approach to Learning Semantic Parsers Rohit J. Kate Doctoral Dissertation Proposal Supervisor:Raymond J. Mooney November 21, 2005

  2. Outline • Semantic Parsing • Related Work • Background on Kernel-based Methods • Completed Research • Proposed Research • Conclusions

  3. Semantic Parsing • Semantic Parsing: Transforming natural language (NL) sentences into computer executablecomplete meaning representations (MRs) • Importance of Semantic Parsing • Natural language communication with computers • Insights into human language acquisition • Example application domains • CLang: Robocup Coach Language • Geoquery: A Database Query Application

  4. CLang: RoboCup Coach Language • In RoboCup Coach competition teams compete to coach simulated players • The coaching instructions are given in a formal language called CLang If our player 4 has the ball, our player 4 should shoot. Simulated soccer field Coach Semantic Parsing ((bowner our {4}) (do our {4} shoot)) CLang

  5. Geoquery: A Database Query Application • Query application for U.S. geography database containing about 800 facts [Zelle & Mooney, 1996] Which rivers run through the states bordering Texas? User Semantic Parsing answer(traverse_2( next_to(stateid(‘texas’)))) Query

  6. Learning Semantic Parsers • Assume meaning representation languages (MRLs) have deterministic context free grammars • true for almost all computer languages • MRs can be parsed unambiguously

  7. ANSWER RIVER answer TRAVERSE_2 STATE traverse_2 NEXT_TO STATE next_to STATEID ‘texas’ stateid NL: Which rivers run through the states bordering Texas? MR: answer(traverse_2(next_to(stateid(‘texas’)))) Parse tree of MR: Non-terminals: ANSWER, RIVER, TRAVERSE_2, STATE, NEXT_TO, STATEID Terminals: answer, traverse_2, next_to, stateid, ‘texas’ Productions: ANSWER  answer(RIVER), RIVER  TRAVERSE_2(STATE), STATE  NEXT_TO(STATE), STATE  NEXT_TO(STATE) TRAVERSE_2  traverse_2, NEXT_TO  next_to, STATEID  ‘texas’

  8. Learning Semantic Parsers • Assume meaning representation languages (MRLs) have deterministic context free grammars • true for almost all computer languages • MRs can be parsed unambiguously • Training data consists of NL sentences paired with their MRs • Induce a semantic parser which can map novel NL sentences to their correct MRs • Learning problem differs from that of syntactic parsing where training data has trees annotated over the NL sentences

  9. Outline • Semantic Parsing • Related Work • Background on Kernel-based Methods • Completed Research • Proposed Research • Conclusions

  10. Related Work: CHILL [Zelle & Mooney, 1996] • Uses Inductive Logic Programming (ILP) to induce a semantic parser • Learns rules to control actions of a deterministic shift-reduce parser • Processes sentence one word at a time making hard parsing decision each time • Brittle and ILP techniques do not scale to large corpora

  11. Related Work: SILT [Kate, Wong & Mooney, 2005] • Transformation rules associate NL patterns with MRL templates • NL patterns matched in the sentence are replaced by the MRL templates • By the end of parsing, NL sentence gets transformed into its MR • Two versions: string patterns and syntactic tree patterns

  12. Related Work: SILT contd. Weaknesses of SILT: • Hard-matching transformation rules are brittle: • For e.g. for NL pattern our left [3] penalty area “our left penalty area” “our left side of penalty area ” “left of our penalty area” “our ah.. left penalty area” • Parsing is done deterministically which is less robust than probabilistic parsing

  13. Related Work: WASP [Wong, 2005] • Based on Synchronous Context-free Grammars • Uses Machine Translation technique of statistical word alignment to find good transformation rules • Builds a maximum entropy model for parsing • The transformation rules are hard-matching

  14. S-bowner NP-player VP-bowner PRP$-team NN-player CD-unum VB-bowner NP-null our player 2 has DT-null NN-null the ball S-bowner NP-player VP-bowner PRP$-team NN-player CD-unum VB-bowner NP-null our player 2 has DT-null NN-null the ball Related Work: SCISSOR [Ge & Mooney, 2005] • Based on a fairly standard approach to compositional semantics [Jurafsky and Martin, 2000] • A statistical parser is used to generate a semantically augmented parse tree (SAPT) • Augment Collins’ head-driven model 2 (Bikel’s implementation, 2004) to incorporate semantic labels • Translate SAPT into a complete formal meaning representation

  15. Related Work: Zettlemoyer & Collins [2005] • Uses Combinatorial Categorial Grammar (CCG) formalism to learn a statistical semantic parser • Generates CCG lexicon relating NL words to semantic types through general hand-built template rules • Uses maximum entropy model for compacting this lexicon and doing probabilistic CCG parsing

  16. Outline • Semantic Parsing • Related Work • Background on Kernel-based Methods • Completed Research • Proposed Research • Conclusions

  17. Information loss Feature Engineering Machine Learning Algorithm Traditional Machine Learning with Structured Data Examples Feature Vectors

  18. Implicit mapping to potentially infinite number of features Kernel Computations Kernelized Machine Learning Algorithm Kernel-based Machine Learning with Structured Data Examples

  19. Kernel Functions • A kernel K is a similarity function over domain X which maps any two objects x, y in X to their similarity score K(x,y) • For x1, x2 ,…, xn in X, the n-by-n matrix (K(xi,xj))ij should be symmetric and positive-semidefinite, then the kernel function calculates the dot-product of the implicit feature vectors in some high-dimensional feature space • Machine learning algorithms which use the data only to compute similarity can be kernelized (e.g. Support Vector Machines, Nearest Neighbor etc.)

  20. String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” K(s,t) = ?

  21. String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = left K(s,t) = 1+?

  22. String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = our K(s,t) = 2+?

  23. String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = penalty K(s,t) = 3+?

  24. String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = area K(s,t) = 4+?

  25. String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = left penalty K(s,t) = 5+?

  26. String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” K(s,t) = 11

  27. Normalized String Subsequence Kernel • Normalize the kernel (range [0,1]) to remove any bias due to different string lengths • Lodhi et al. [2002] give O(n|s||t|) for computing string subsequence kernel • Used for Text Categorization [Lodhi et al, 2002] and Information Extraction [Bunescu & Mooney, 2005b]

  28. Support Vector Machines • Mapping data to high-dimensional feature spaces can lead to overfitting of training data (“curse of dimensionality”) • Support Vector Machines (SVMs) are known to be resistant to this overfitting

  29. ρ SVMs: Maximum Margin • Given positive and negative examples, SVMs find a separating hyperplane such that the marginρbetween the closest examples is maximized • Maximizing the margin is good according to intuition and PAC theory Separating hyperplane

  30. SVMs: Probability Estimates • Probability estimate of a point belonging to a class can be obtained using its distance from the hyperplane [Platt, 1999]

  31. Why Kernel-based Approach to Learning Semantic Parsers? • Natural language sentences are structured • Natural languages are flexible, various ways to express the same semantic concept CLang MR: (left (penalty-area our)) NL: our left penalty area our left side of penalty area left side of our penalty area left of our penalty area our penalty area towards the left side our ah.. left penalty area

  32. Why Kernel-based Approach to Learning Semantic Parsers? right side of our penalty area left of our penalty area opponent’s right penalty area our left side of penalty area our ah.. left penalty area our left penalty area our right midfield left side of our penalty area our penalty area towards the left side Kernel methods can robustly capture the range of NL contexts.

  33. Outline • Semantic Parsing • Related Work • Background on Kernel-based Methods • Completed Research • Proposed Research • Conclusions

  34. KRISP: Kernel-based Robust Interpretation by Semantic Parsing • Learns semantic parser from NL sentences paired with their respective MRs given MRL grammar • Productions of MRL are treated like semantic concepts • SVM classifier is trained for each production with string subsequence kernel • These classifiers are used to compositionally build MRs of the sentences

  35. Overview of KRISP MRL Grammar Collect positive and negative examples NL sentences with MRs Best semantic derivations (correct and incorrect) Train string-kernel-based SVM classifiers Training Semantic Parser Testing Novel NL sentences Best MRs

  36. Overview of KRISP MRL Grammar Collect positive and negative examples NL sentences with MRs Best semantic derivations (correct and incorrect) Train string-kernel-based SVM classifiers Training Semantic Parser Testing Novel NL sentences Best MRs

  37. Overview of KRISP’s Semantic Parsing • We first define Semantic Derivation of an NL sentence • We define probability of a semantic derivation • Semantic parsing of an NL sentence involves finding its most probable semantic derivation • Straightforward to obtain MR from a semantic derivation

  38. ANSWER RIVER answer TRAVERSE_2 STATE traverse_2 NEXT_TO STATE next_to STATEID ‘texas’ stateid Semantic Derivation of an NL Sentence MR parse with non-terminals on the nodes: Which rivers run through the states bordering Texas?

  39. Semantic Derivation of an NL Sentence MR parse with productions on the nodes: ANSWER  answer(RIVER) RIVER  TRAVERSE_2(STATE) TRAVERSE_2  traverse_2 STATE  NEXT_TO(STATE) NEXT_TO  next_to STATE  STATEID STATEID  ‘texas’ Which rivers run through the states bordering Texas?

  40. Semantic Derivation of an NL Sentence Semantic Derivation: Each node coversan NL substring: ANSWER  answer(RIVER) RIVER  TRAVERSE_2(STATE) TRAVERSE_2  traverse_2 STATE  NEXT_TO(STATE) NEXT_TO  next_to STATE  STATEID STATEID  ‘texas’ Which rivers run through the states bordering Texas?

  41. (NEXT_TO  next_to, [5..7]) Semantic Derivation of an NL Sentence Semantic Derivation: Each node contains a production and the substring of NL sentence it covers: (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE),[1..9]) (TRAVERSE_2  traverse_2,[1..4]) (STATE  NEXT_TO(STATE),[5..9]) (STATE  STATEID,[8..9]) (STATEID  ‘texas’,[8..9]) Which rivers run through the states bordering Texas? 1 2 3 4 5 6 7 8 9

  42. Semantic Derivation of an NL Sentence Substrings in NL sentence may be in a different order: ANSWER  answer(RIVER) RIVER  TRAVERSE_2(STATE) TRAVERSE_2  traverse_2 STATE  NEXT_TO(STATE) NEXT_TO  next_to STATE  STATEID STATEID  ‘texas’ Through the states that border Texas which rivers run?

  43. (STATE  STATEID, [6..6]) (STATEID  ‘texas’, [6..6]) Semantic Derivation of an NL Sentence Nodes are allowed to permute the children productions from the original MR parse (ANSWER  answer(RIVER), [1..10]) (RIVER  TRAVERSE_2(STATE), [1..10]] (STATE  NEXT_TO(STATE), [1..6]) (TRAVERSE_2  traverse_2, [7..10]) (NEXT_TO  next_to, [1..5]) Through the states that border Texas which rivers run? 1 2 3 4 5 6 7 8 9 10

  44. Probability of a Semantic Derivation • Let Pπ(s[i..j]) be the probability that production π covers the substring s[i..j], • For e.g., PNEXT_TO  next_to(“the states bordering”) • Obtained from the string-kernel-based SVM classifiers trained for each production π • Probability of a semantic derivation D: (NEXT_TO  next_to, [5..7]) the states bordering 5 6 7

  45. (STATE NEXT_TO(STATE), [5..9]) (NEXT_TO  next_to, [5..7]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) the states bordering Texas? 5 6 7 8 9 Computing the Most Probable Semantic Derivation • Task of semantic parsing is to find the most probable semantic derivation • Let En,s[i..j], partial derivation, denote any subtree of a derivation tree with n as the LHS non-terminal of the root production covering sentence s from index i to j • Example of ESTATE,s[5..9]: • Derivation D is then EANSWER, s[1..|s|]

  46. Computing the Most Probable Semantic Derivation contd. • Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] • This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) the states bordering Texas? 5 6 7 8 9

  47. Computing the Most Probable Semantic Derivation contd. • Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] • This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) E*NEXT_TO,s[i..j] E*STATE,s[i..j] the states bordering Texas? 5 6 7 8 9

  48. Computing the Most Probable Semantic Derivation contd. • Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] • This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) E*NEXT_TO,s[5..5] E*STATE,s[6..9] the states bordering Texas? 5 6 7 8 9

  49. Computing the Most Probable Semantic Derivation contd. • Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] • This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) E*NEXT_TO,s[5..6] E*STATE,s[7..9] the states bordering Texas? 5 6 7 8 9

  50. Computing the Most Probable Semantic Derivation contd. • Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] • This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) E*NEXT_TO,s[5..7] E*STATE,s[8..9] the states bordering Texas? 5 6 7 8 9

More Related