1 / 32

Lightweight Natural Language Database Interfaces

Lightweight Natural Language Database Interfaces. Jun. 23, 2004. In-Su Kang*, Seung-Hoon Na, Jong-Hyeok Lee, Gijoo Yang. Dept. of Computer Science & Engineering Pohang University of Science and Technology (POSTECH) R. of KOREA. Contents. Motivations Introduction to NLDBI

haines
Download Presentation

Lightweight Natural Language Database Interfaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lightweight Natural Language Database Interfaces Jun. 23, 2004 In-Su Kang*, Seung-Hoon Na, Jong-Hyeok Lee, Gijoo Yang Dept. of Computer Science & Engineering Pohang University of Science and Technology (POSTECH) R. of KOREA

  2. Contents • Motivations • Introduction to NLDBI • Issues & our concerns • Two motivations • Lightweight architecture • Lightweight NLDBI • Domain adaptation • Question answering • Conclusion

  3. Introduction • Natural Language DataBase Interfaces (NLDBI) • Access database data in natural languages [Androutsopoulos,1995] • Main components Natural Language Question Answer Meaning Representation Database Query DBMS Analysis Translation Linguistic Knowledge Translation Knowledge

  4. Terminology • Domain class • Refers to a table or a column • (e.g.) T_Customer, C_ID, C_Name • Domain class instance • Individual column value • (e.g.) 1034, 1035, “Bill Clinton”, “Jimmy Carter” • Class term • A lexical term referring to a domain class, such as “customer” • Value term • A lexical term indicating a domain class instance, such as “Bill” T_Customer C_ID C_Name 1034 Bill Clinton 1035 Jimmy Carter

  5. NLDBI Issues & Our Concerns • Process • Natural language understanding • Spoken language & meaning representation • Discourse analysis & dialogue model • Database query conversion (NL  DB) • Paraphrase problem : M-to-1 • Translation ambiguity problem: 1-to-N • Natural language generation • Co-operative answering • Knowledge management • Linguistic knowledge • Translation knowledge • Representation, acquisition • Domain transportability problem

  6. Motivation 1 • Previous translation knowledge acquisition • Complex translation knowledge representation • Expensive expertise required (AI/NLP/DBMS/Domain knowledge) • (e.g.) Devise conversion rules from parse trees to database query exp. • (e.g.) Define database relations for logical predicates • Difficulty in initial creation and scalable expansion • Cause domain transportability problem • No general solution • As one solution , domain tool methods are tightly coupled with underlying NLDBI systems • (e.g.) IRUS, CHAT-80, ASK, EUFID, TEAM, MASQUE, … • Our proposal • Semi-Automatic Acquisition by Simplifying Translation Knowledge Structures

  7. Motivation 2 • Translation ambiguity • Class term ambiguity • A class term refers to several domain classes • ‘address’  TB_Customer.Address, or TB_Employee.Address • Value term ambiguity • A value term refers to several domain class instances • ‘London’  TB_Flight.departure, or TB_Flight.arrival • Resolution of translation ambiguity • So far, no systematic disambiguation scheme • We propose a Noun Translation Technique based on an Information Retrieval Framework

  8. Lightweight NLDBI Architecture

  9. Contents • Motivations • Lightweight NLDBI • Domain adaptation • Semi-automatic acquisition of translation knowledge • Physical Entity-Relationship Schema (pER schema) • Translation knowledge structures • Translation knowledge construction • Examples • Question answering • Conclusion

  10. Semi-Automatic Acquisition • Procedure • Linguistic annotation by domain experts Initial Trans. Know. DB Linguistic Annotation Reverse Engineering Automatic Extraction Physical schema pER schema Within a DB modeling tool Input Guidelines • To each domain class, give a Linguistic Name (in the form of NP) • Make any linguistic description (called Domain Sentence) about or among domain classes (in the form of simple sentences). • In , an NP referring to a domain class should be either its linguistic name defined in , or a domain class itself

  11. Physical Entity-Relationship (pER) Schema • pER schema = pER graph + pER description • pER graph = a physical schema • Encode structural constraints among DB objects • Property-of b/w an entity and its attributes • Semantic relationship among entities and/or attributes • pER description = linguistic annotations on a pER graph • Bridge b/w DB objects and natural language expressions

  12. Translation Knowledge Structures • Class-referring info. (for paraphrase problem) • Class document for each domain class • Synonymous class terms, and their concept codes • Value document for each column • All-length ngrams / pattern-based 2grams generated from column data • Class-constraining info. (for translation ambiguity problem) • Valency-based selection restrictions • Domain verbs or case markers impose on domain classes •  order, {T_Customer, T_Product, T_Order.Date}  •  from, {T_Flight.Departure} ,  to, {T_Flight.Arrival}  • Collocation document for each domain class • Linguistic collocations of a domain class

  13. Translation Knowledge Construction NL Description DB column data Linguistic names Domain sentences Syntactic Analysis N-gram Value Indexing Class Term Extraction Value Terms Class Terms Valency-based Value Doc. Class Doc. Collocation Doc. Concept hierarchy Class-Referring information Class-Constraining information

  14. Contents • Motivations • Lightweight NLDBI • Domain adaptation • Semi-automatic acquisition of translation knowledge • Physical Entity-Relationship Schema (pER schema) • Translation knowledge structures • Translation knowledge construction • Examples • Question answering • Conclusion

  15. Semi-Automatic Acquisition: Physical DB Schema • Physical DB schema for a university course domain Reverse-Engineering by a DB modeling tool

  16. Semi-Automatic Acquisition:NL Descriptions • Domain experts annotate NL descriptions on physical schema

  17. Semi-Automatic Acquisition:Initial Translation Knowledge • Class-Referring Translation Knowledge Domain class: T2C2 Linguistic name All column values (non-alphanumeric) ‘Course name’ Statistics, Algorithms Class document: T2C2C Value document: T2C2V ‘Course name’ ‘Name’ Statistics Algorithms

  18. Semi-Automatic Acquisition:Initial Translation Knowledge • Class-Referring Translation Knowledge • Class and Value documents from linguistic names and DB tuples

  19. Semi-Automatic Acquisition:Initial Translation Knowledge • Class-Constraining Translation Knowledge All domain sentences Linguistic name: T1C3 “Students take courses in ‘T3C3’” Entrance year Take-student Take-course Take-(in) T3C3 Class document: T1 Student Take student, course, T3C3 Class Documents Collocation document: T1C3 Take Entrance, student T1, T2, T3C3

  20. Semi-Automatic Acquisition:Initial Translation Knowledge • Class-Constraining Translation Knowledge Collocation-based selection restriction  Valency-based selection restriction

  21. Semi-Automatic Acquisition: Expansion of Initial Translation Knowledge Initial Class Document: T1C3 Extended Class Document: T1C3 Instructor, professor Instructor, teacher0, educator1, pedagogue1, professional2, professional_person2, adult3, grownup3, person4, individual4, someone4, somebody4, mortal4, human4, soul4 Professor, academician1, academic1, faculty_member1, educator2, pedagogue2, professional3, professional_person3, adult4, grownup4, person5, individual5, someone5, somebody5, mortal5, human5, soul5 Person, … Adult, … Educator, … Instructor, … Paraphrase expansion by WordNet

  22. Contents • Motivations • Lightweight NLDBI • Domain adaptation • Question answering • Question analysis • Noun translation • Class retrieval • Class disambiguation • Query graph & SQL generation • Conclusion

  23. Question Analysis & Noun Translation • Question analysis by parsing • A set of question nouns • Each noun has features: question focus, value operator, etc. • A set of predicate-argument (P-A) pairs • Noun translation (or Domain class tagging) • Given a question noun, find the most probable domain class  Class retrieval • Retrieve candidate domain classes for each question noun • Lexically or conceptually equivalent domain classes Class disambiguation • Select the most likely domain class

  24. Question Analysis & Noun Translation • Question : “Show me the names of students who got A in statistics from 1999”

  25. Class Retrieval • Information Retrieval (IR) framework • Translation knowledge  a target document collection • Class/value/collocaton documents, valency-based selection restrictions • A question noun  an IR query • Class term  a surface word form & concept codes • ‘customer’, ‘product’ • Linguistic value term  all-length n-grams for Korean • ‘Bill’, ‘Bush’ • Alphanumeric value term  pattern-based 2-grams C1 : 1-byte character, C2 : 2-byte character, N : decimal, S : special character

  26. Class Disambiguation • Definition of a class retrieval function • Notation RC(t) means a set of domain classes retrieved from a document collection C using a query term t • Rref(t): retrieves from ref (a set of class/value documents) • Rval(t): retrieves from val (valency-based constraints) • Consider valency-based constraints as documents • Rcol(t): retrieves from col (collocation-based documents) • Class disambiguation by Boolean retrieval model • Valency-based • Rref(t)  Rval(head(t)) • Collocation-based • Rref(t)  Rcol(adjacent(t))

  27. Class Retrieval & Class Disambiguation Q: Show me the names of students who got A in statistics from 1999 Head Verb of ‘1999’ ‘Get’ Question Noun ‘1999’ Class/Value Documents Valency-Based Constraints Relevant Domain Classes {T1C3v, T3C3v } Valency-Based Constraint Get: {T1, T3, T3C4, T3C3} Value Term Ambiguity Disambiguation Rref(‘1999’)  Rval(head(‘1999’)) = {T3C3v }

  28. Class Retrieval & Class Disambiguation Q: Show me the names of students who got A in statistics from 1999 Adjacent Word of ‘Name’ ‘Student’ Question Noun ‘Name’ Class/Value Documents Collocation Documents Relevant Domain Classes {T1C2c, T2C2c} Collocation-Based Constraint {T1C1, T1C2, T1C3} Class Term Ambiguity Disambiguation Rref(‘Name’)  Rcol(adjacent(‘Name’)) = {T1C2c }

  29. Query Graph & SQL Generation • Query graph • A minimal connected sub-graph • A node is a disambiguated domain class for each question noun • Query graph is located from a physical schema graph using a Meng’s method (Meng et al. 1999) • SQL generation from a query graph • Entity nodes  SQL-FROM • Arcs b/w entity nodes  Join operations in SQL-WHERE • From question analysis • Domain class having question focus feature  SQL-SELECT • Domain class having value operator feature  SQL-WHERE

  30. Query Graph & SQL Generation SELECT T1C2 FROM T1, T2, T3 WHERE T1.T1C1 = T3.T3C1 and T2.T2C1 = T3.T3C2 and T2C2 = ‘Statistics’ and T3C3 = ‘A’ and T3C4 >= 1999

  31. Conclusion • Lightweight NLDBI • Domain adaptation (to deal with a paraphrase problem) • Simplification of translation knowledge in the form of documents • Semi-automatic construction of translation knowledge • Expansion of translation knowledge by dictionary • Question answering (to resolve translation ambiguities) • Noun translation technique based on an IR framework • Class retrieval • Class disambiguation

  32. Semi-Automatic Acquisition:Initial Translation Knowledge • Class-Referring Translation Knowledge All column values (alphanumeric) Domain class: T1C1 1999-0011, 2001-0027 1-byte char  C 2-byte char  C Special char  S Decimal  N Linguistic name ‘Student identification number’ n4s1n4 Class document: T1C1C n-grams ‘Student identification number’ ‘Identification number’ ‘Number’ Value document: T1C1V n4s1n4, n4, s1, n4s1, s1n4

More Related