1 / 30

UNIVER SITY OF MARIBOR

Inferring Context-Free Grammars for Domain-Specific Languages Matej Črepinšek, Marjan Mernik University of Maribor, Slovenia Barrett R. Bryant, Faizan Javed, Alan Sprague The University of Alabama at Birmingham , USA. FA CULTY OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE.

hamlin
Download Presentation

UNIVER SITY OF MARIBOR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inferring Context-Free Grammars for Domain-Specific LanguagesMatej Črepinšek,Marjan MernikUniversity of Maribor, SloveniaBarrett R. Bryant, Faizan Javed, Alan SpragueThe University of Alabama at Birmingham, USA FACULTY OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE UNIVERSITY OF MARIBOR

  2. Outline of the Presentation • Motivation • Related Work • Inferring CFG for DSLs • Results • Conclusion

  3. Motivation • Machine learning of grammars finds many applications in • syntactic pattern recognition, • computational biology, • computational linguistic, etc. • Can be grammatical inference useful also in software engineering?

  4. Motivation • Software engineers would like to recover grammar from legacy systems in order to automatically generate various software analysis and modification tools. • Currently, this can not be done for real GPL (e.g., Cobol) using grammatical inference. • Grammar can be semi-automatically recovered from compilers and language manuals [R. Laemmel, C. Verhoef. Semi-automatic Grammar Recovery. SP&E, Vol. 31, No. 15, 2001].

  5. Motivation • What about grammar inference for DSLs (e.g., FDL, VHDL)? • car: all( carBody, Transmission, Engine) Transmission: one-of( automatic, manual ) Engine: more-of( electric, gasoline ) • entity HALFADDER is   port( A, B: in   bit;       SUM, CARRY: out bit);end HALFADDER; • Currently, experiments were performed on theoretical sample languages only, such as L={ww | w  {a,b}+}, L={w=wR | w  {a,b}+}

  6. Motivation • Grammars are found in many applications outside language definition and implementation. • Grammar-based systems(GBSs) [M. Mernik, M. Črepinšek, T. Kosar, D. Rebernak, V. Žumer. Grammar-based Systems: Definition and Examples. Informatica, 28(3):245-254, 2004] • In this cases, the grammar needs to be extracted solely from artifacts represented as sentences/programs written in some unknown language.

  7. Motivation Metamodel Model – an instance of Metamodel

  8. Motivation VideoStore ::= MOVIES CUSTOMERS MOVIES ::= MOVIES MOVIE | MOVIE MOVIE ::= title type CUSTOMERS ::= CUSTOMERS CUSTOMER |  CUSTOMER ::= name days RENTALS RENTALS ::= RENTALS RENTAL | RENTAL RENTAL ::= MOVIE • TheRingregAndy3TheRingreg • TheRingregShrek2childAnn1Shrek2child

  9. 1..* NT5 title type NT11 1..* 1..* NT6 name days Motivation • TheRingregAndy3TheRingreg • TheRingregShrek2childAnn1Shrek2child NT15 ::= NT11 NT7 NT15 |  NT11 ::= NT10 NT6 NT10 ::= NT5 NT10 |  NT7 ::= NT5 NT7 |  NT6 ::= name days NT5 ::= title type

  10. Related Work • Gold Theorem - it is impossible to identify any of the four classes of languages in the Chomsky hierarchy using only positive samples. • Positive and negative samples are needed. • So far, grammar inference has been mainly successful in inferring regular languages.

  11. Related Work (Regular Grammars) A number of algorithms (e.g., RPNI) first construct the maximal canonical automaton(MCA(S+)) or prefix tree acceptor (PTA(+))from positive samples, and generalize the automaton by using a statemerging process.

  12. Related Work (Regular Grammars) The following equation enumerates the search space:

  13. Related Work (CF Grammars) • Learning context-free grammars G=(V, T, P, S) is more difficult than learning regular grammars. • Using representative positive samples (that is, positive samples which exercise every production rule in the grammar) along with negative samples did not result in the same level of success as with regular grammar inference.

  14.        + + num num num + num num Related Work (CF Grammars) • Hence, some researchers resorted to using additional knowledge to assist in the induction process (e.g., skeleton derivation trees - unlabelled derivation trees).

  15. Inferring CFG • What is the search space in the case of CFG inference? • If we limit ourselves to binary trees (CNF), then all possible unlabelled derivations trees is given by n-th Catalan number:

  16. Inferring CFG • For example, there are 14 different full binary trees when l=5 …

  17. Inferring CFG • For full binary trees to be valid derivation trees, the interior nodes need to be labeled with non-terminals.

  18. Inferring CFG Search space of context-free grammar inference

  19. Inferring CFG • For effective use of an evolutionary algorithm we have to choose a suitable representation of the problem, suitable parameters and genetic operators, and the evaluation function to determine the fitness of chromosomes.

  20. E  T E T  operator E T  E  int T T  operator E T  Crossover point E  int T E  T E  T  Mutation point E  int T T  E E T  E  int T T  operator E T  E  T E E  T E  T  Inferring CFG

  21. Option point E  int T T  operator F T  F  E F  E  int T T  operator E T  Inferring CFG To enhance the search, the following heuristic operators have been proposed: • option operator, • iteration* operator, and • iteration+ operator.

  22. fitness cases (positive and negative samples) Population of grammars run parser on each fitness case sucessfulness of parsing Test grammars Selection fitness value generated parser for each grammar in the population Crossover and mutation LISA compiler generator parser generation Inferring CFG

  23. Inferring CFG For the given grammar[i] its fitness fj(grammar[i]) on the j-fitness case is defined as: fj(grammar[i]) = length(successfully parsed programj)/length(programj)*2 Finally, the total fitness f(grammar[i]) is defined as: f(grammar[i])=(Nk=1 fk(grammar[i]))/N

  24. Inferring CFG If a grammar correctly recognized all positive samples than it is tested also on negative samples. Its fitness value is defined as: f(grammar[i]) = 1.0 -(m/M*2) where m=number of fully parsed negative samples M=number of all negative samples

  25. NT8 NT7 NT5 NT6 NT4 NT2 NT1 NT3 NT1 #id := #int + #int Inferring CFG • Initial population should not be completly randomly generated. NT8 -> NT7 NT1 NT7 -> NT5 NT6 NT6 -> NT1 NT3 NT5 -> NT4 NT2 NT4 -> #id NT3 -> + NT2 -> := NT1 -> #int

  26. Inferring CFG • Identify sub-languages and construct derivation trees for sub-programs first. But this is as hard as the original problem. • We can use an approximation: frequent sequences. • A string of symbols is called a frequent sequence if it appears at least  times, where  is some preset threshold.

  27. Inferring CFG • GIE-BF tool

  28. Results • Using presented approach we were able to infer grammars for small DSLs (Table 2 in the paper). • An example of positive/negative samples and control parameters (Table 3 in the paper). • Comparison of inferred and original grammars (Table 4 and 5 in the paper).

  29. Conclusion • An ongoing research work on context-free grammarinference was presented. • So far, we have been able to infer grammars for DSLs which are bigger in size and more pragmatic than in other research efforts. • We are convinced that this approach, when enhanced with other data mining techniques and heuristics, is scalable and feasible to infer grammars of more realistically sized languages.

  30. Thank you! http://www.cis.uab.edu/softcom/GenParse

More Related