1 / 85

Grammatical inference Vs Grammar induction

Grammatical inference Vs Grammar induction. London 21-22 June 2007. Colin de la Higuera. Summary. Why study the algorithms and not the grammars Learning in the exact setting Learning in a probabilistic setting. 1 Why study the process and not the result?.

gbaumeister
Download Presentation

Grammatical inference Vs Grammar induction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

  2. Summary • Why study the algorithms and not the grammars • Learning in the exact setting • Learning in a probabilistic setting

  3. 1 Why study the process and not the result? • Usual approach in grammatical inference is to build a grammar (automaton), small and adapted in some way to the data from which we are supposed to learn from.

  4. Grammatical inference • Is about learning a grammar given information about a language.

  5. Grammar induction • Is about learning a grammar given information about a language.

  6. Grammatical inference Grammar induction Difference? Data G

  7. Motivating* example #1 • Is 17 a random number? • Is 17 more random than 25? • Suppose I had a random number generator, would I convince you by showing how well it does on an example? On various examples ? *(and only slightly provocative)

  8. Motivating example #2 • Is 01101101101101010110001111 a random sequence? • What about aaabaaabababaabbba?

  9. Motivating example #3 • Let X be a sample of strings. Is grammar G the correct grammar for sample X? • Or is it G’ ? • Correct meaning something like “the one we should learn”

  10. Back to the definition • Grammar induction and grammatical inference are about finding a/the grammar from some information about the language. • But once we have done that, what can we say?

  11. What would we like to say? • That the grammar is the smallest, best (re a score).  Combinatorial characterisation • What we really want to say is that having solved some complex combinatorial question we have an Occam, Compression-MDL-Kolmogorov like argument proving that what we have found is of interest.

  12. What else might we like to say? • That in the near future, given some string, we can predict if this string belongs to the language or not. • It would be nice to be able to bet £100 on this.

  13. What else would we like to say? • That if the solution we have returned is not good, then that is because the initial data was bad (insufficient, biased). • Idea: blame the data, not the algorithm.

  14. Suppose we cannot say anything of the sort? • Then that means that we may be terribly wrong even in a favourable setting.

  15. Motivating example #4 • Suppose we have an algorithm that ‘learns’ a grammar by applying iteratively the following two operations: • Merge two non-terminals whenever some nice MDL-like rule holds • Add a new non-terminal and rule corresponding to a substring when needed

  16. Two learning operators Creation of non terminals and rules NPART ADJ NOUN NPART ADJ ADJ NOUN NPART AP1 NPART ADJ AP1 AP1  ADJ NOUN

  17. Merging two non terminals NPART AP1 NPART AP2 AP1  ADJ NOUN AP2  ADJ AP1 NPART AP1 AP1  ADJ NOUN AP1  ADJ AP1

  18. What is bound to happen? • We will learn a context-free grammar that can only generate a regular language. • Brackets are not found. • This is a hidden bias.

  19. But how do we say that a learning algorithm is good? • By accepting the existence of a target. • The question is that of studying the process of finding this target (or something close to this target). This is an inference process.

  20. If you don’t believe there is a target? • Or that the target belongs to another class • You will have to come up with another bias. For example, believing that simplicity (eg MDL) is the correct way to handle the question.

  21. If you are prepared to accept there is a target but.. • Either the target is known and what is the point or learning? • Or we don’t know it in the practical case (with this data set) and it is of no use…

  22. Then you are doing grammar induction.

  23. Careful • Some statements that are dangerous • Algorithm A can learn {anbncn: nN} • Algorithm B can learn this rule with just 2 examples • Looks to me close to wanting free lunch

  24. A compromise • You only need to believe there is a target while evaluating the algorithm. • Then, in practice, there may not be one!

  25. End of provocative example • If I run my random number generator and get 999999, I can only keep this number if I believe in the generator itself.

  26. Credo (1) • Grammatical inference is about measuring the convergence of a grammar learning algorithm in a typical situation.

  27. Credo(2) • Typical can be: • In the limit: learning is always achieved, one day • Probabilistic • There is a distribution to be used (Errors are measurably small) • There is a distribution to be found

  28. Credo(3) • Complexity theory should be used: the total or update runtime, the size of the data needed, the number of mind changes, the number and weight of errors… • …should be measured and limited.

  29. 2 Non probabilistic setting • Identification in the limit • Resource bounded identification in the limit • Active learning (query learning)

  30. Identification in the limit • The definitions, presentations • The alternatives • Order free or not • Randomised algorithm

  31. A presentation is a function f: NX where X is any set, • yields: Presentations  Languages • If f(N)=g(N)thenyields(f)= yields(g)

  32. Learning function • Given a presentation f, fn is the set of the first n elements in f. • A learning algorithmais a function that takes as input a set fn ={f(0),…,f (n-1)} and returns a grammar. • Given a grammar G, L(G) is the language generated/recognised/ represented by G.

  33. Identification in the limit f(N)=g(N) yields(f)=yields(g) yields A class of languages L Pres  NX a L A learner The naming function G A class of grammars n N:k>n L(a(fk))=yields(f)

  34. What about efficiency? • We can try to bound • global time • update time • errors before converging • mind changes • queries • good examples needed

  35. What should we try to measure? • The size of G ? • The size of L ? • The size of f ? • The size of fn ?

  36. Some candidates for polynomial learning • Total runtime polynomial in ║L║ • Update runtime polynomial in ║L║ • # mind changes polynomial in ║L║ • # implicit prediction errors polynomial in ║L║ • Size of characteristic sample polynomial in ║L║

  37. f(0) f(1) f(n-1) f(k) f1 f2 fn fk a a a a G1 G2 Gn Gn

  38. Some selected results (1)

  39. Some selected results (2)

  40. Some selected results (3)

  41. 3 Probabilistic setting • Using the distribution to measure error • Identifying the distribution • Approximating the distribution

  42. Probabilistic settings • PAC learning • Identification with probability 1 • PAC learning distributions

  43. Learning a language from sampling • We have a distribution over * • We sample twice: • Once to learn • Once to see how well we have learned • The PAC setting Probably approximately correct

  44. PAC learning(Valiant 84, Pitt 89) • L a set of languages • G a set of grammars • >0 and  >0 • m a maximal length over the strings • n a maximal size of grammars

  45. Polynomially PAC learnable • There is an algorithm that samples reasonably and returns with probability at least 1- a grammar that will make at most  errors.

  46. Results • Using cryptographic assumptions, we cannot PAC learn DFA. • Cannot PAC learn NFA, CFGs with membership queries either.

  47. Learning distributions • No error • Small error

  48. No error • This calls for identification in the limit with probability 1. • Means that the probability of not converging is 0.

  49. Results • If probabilities are computable, we can learn with probability 1 finite state automata. • But not with bounded (polynomial) resources.

  50. With error • PAC definition • But error should be measured by a distance between the target distribution and the hypothesis • L1,L2,L ?

More Related