670 likes | 794 Views
Learning from Text. Colin de la Higuera University of Nantes. Acknowledgements.
E N D
Learning from Text Colin de la Higuera University of Nantes Zadar, August 2010
Acknowledgements • Laurent Miclet, Jose Oncina, Tim Oates, Anne-Muriel Arigon, Leo Becerra-Bonache, Rafael Carrasco, Paco Casacuberta, Pierre Dupont, Rémi Eyraud, Philippe Ezequel, Henning Fernau, Jean-Christophe Janodet, Satoshi Kobayachi, Thierry Murgue, Frédéric Tantini, Franck Thollard, Enrique Vidal, Menno van Zaanen,... http://pagesperso.lina.univ-nantes.fr/~cdlh/ http://videolectures.net/colin_de_la_higuera/ Zadar, August 2010
Outline • Motivations, definition and difficulties • Some negative results • Learning k-testable languages from text • Learning k-reversible languages from text • Conclusions http://pagesperso.lina.univ-nantes.fr/~cdlh/slides/ Chapters 8 and 11 Zadar, August 2010
1 Identification in the limit yields A class of languages L Pres ℕX a L A learner The naming function G A class of grammars L(a())=yields() (ℕ)=(ℕ) yields()=yields() Zadar, August 2010
Learning from text • Only positive examples are available • Danger of over-generalization: why not return *? • The problem is “basic”: • Negative examples might not be available • Or they might be heavily biased: near-misses, absurd examples… • Base line: all the rest is learning with help Zadar, August 2010
GI as a search problem PTA ? Zadar, August 2010
Questions? • Data is unlabelled… • Is this a clustering problem? • Is this a problem posed in other settings? Zadar, August 2010
2 The theory • Gold 67: No super-finite class can be identified from positive examples (or text) only • Necessary and sufficient conditions for learning • Literature: • inductive inference, • ALT series, … Zadar, August 2010
Limit point • A class L of languages has a limit pointif there exists an infinite sequence Lnnℕ of languages in L such that L0 L1 … Ln …, and there exists another language L L such that L= nℕLn • L is called a limit point of L Zadar, August 2010
L is a limit point L0 L1 Li L2 L L3 Zadar, August 2010
Theorem If L admits a limit point, then L is not learnable from text Proof: Let sibe a presentation in length-lex order for Li,and s be a presentation in length-lex order for L. Thennℕi / kn sik= sk Note: having a limit point is a sufficient condition for non learnability; not a necessary condition Zadar, August 2010
Mincons classes • A class is mincons if there is an algorithm which, given a sample S, builds a GG such that S L L(G) L =L(G) • Ie there is a unique minimum (for inclusion) consistent grammar Zadar, August 2010
Accumulation point (Kapur 91) A class L of languages has an accumulation pointif there exists an infinite sequence Sn nℕ of sets such that S0 S1 … Sn …, and L= nℕSn L …and for any nℕ there exists a language Ln’ in L such that Sn Ln’ L. The language L is called an accumulation point of L Zadar, August 2010
Ln’ S0 S1 S2 S3 L is an accumulation point Sn L Zadar, August 2010
Theorem (for Mincons classes) L admits an accumulation point iff L is not learnable from text Zadar, August 2010
Infinite Elasticity • If a class of languages has a limit point there exists an infinite ascending chain of languages L0 L1 … Ln …. • This property is called infinite elasticity Zadar, August 2010
Infinite Elasticity x0 x1 xi Xi+1 Xi+2 Xi+3 Xi+4 x2 x3 Zadar, August 2010
Finite elasticity L has finite elasticity if it does not have infinite elasticity Zadar, August 2010
Theorem (Wright) If L(G) has finite elasticity and is mincons, then G is learnable. Zadar, August 2010
L(G’) TG x1 x3 x2 x4 Tell tale sets Forbidden L(G) Zadar, August 2010
Theorem (Angluin) G is learnable iff there is a computable partial function : Gℕ* such that: • nℕ, (G,n) is defined iffGG and L(G) • GG, TM={(G,n): nℕ} is a finite subset of L(G) called a tell-tale subset • G,G’M, if TM L(G’) then L(G’) L(G) Zadar, August 2010
Proposition (Kapur 91) A language L in L has a tell-tale subsetiffL is not an accumulation point. (for mincons) Zadar, August 2010
Summarizing • Many alternative ways of proving that identification in the limit is not feasible • Methodological-philosophical discussion • We still need practical solutions Zadar, August 2010
3 Learning k-testable languages P. García and E. Vidal. Inference of K-testable languages in the strict sense and applications to syntactic pattern recognition. Pattern Analysis and Machine Intelligence, 12(9):920–925, 1990 P. García, E. Vidal, and J. Oncina. Learning locally testable languages in the strict sense. In Workshop on Algorithmic Learning Theory (Alt 90), pages 325–338, 1990 Zadar, August 2010
Definition Let k0, a k-testable language in the strict sense (k-TSS) is a 5-tuple Zk=(, I, F, T, C) with: • a finite alphabet • I, Fk-1(allowed prefixes of length k-1 and suffixes of length k-1) • Tk (allowed segments) • C<k contains all strings of length less than k • Note that I∩F=C∩Σk-1 Zadar, August 2010
The k-testable language is L(Zk)=I* *F - *(k-T)*C • Strings (of length at least k) have to use a good prefix and a good suffix of length k-1, and all sub-strings have to belong to T. Strings of length less than k should be in C • Or: k-T defines the prohibited segments • Key idea: use a window of size k Zadar, August 2010
a b a a a b An example (2-testable) I={a} F={a} T={aa, ab, ba} C={,a} Zadar, August 2010
Window language • By sliding a window of size 2 over a string we can parse • ababaaababababaaaab OK • aaabbaaaababab not OK Zadar, August 2010
The hierarchy of k-TSS languages • k-TSS()={L*: L is k-TSS} • All finite languages are in k-TSS() if k is large enough! • k-TSS() [k+1]-TSS() • (bak)* [k+1]-TSS() • (bak)* k-TSS() Zadar, August 2010
a a a b b a A language that is not k-testable Zadar, August 2010
K-TSS inference Given a sample S, L(ak-TSS(S))= Zk where Zk=((S), I(S), F(S), T(S), C(S) ) and • (S) is the alphabet used in S • C(S)=(S)<kS • I(S)=(S)k-1Pref(S) • F(S)= (S)k-1Suff(S) • T(S)=(S)k {v: uvwS} Zadar, August 2010
Example • S={a, aa, abba, abbbba} • Let k=3 • (S)={a, b} • I(S)= {aa, ab} • F(S)= {aa, ba} • C(S)= {a , aa} • T(S)={abb, bbb, bba} • L(a3-TSS(S))= ab*a+a Zadar, August 2010
Building the corresponding automaton • Each string in IC and PREF(IC) is a state • Each substring of length k-1 of strings in T is a state • is the initial state • Add a transition labeled b from u to ub for each state ub • Add a transition labeled b from au to ub for each aub in T • Each state/substring that is in F is a final state • Each state/substring that is in C is a final state Zadar, August 2010
a aa aa a a a I={aa, ab} b F={aa, ba} ab ab T={abb, bbb, bba} C={a, aa} b bb a bb ba ba b Running the algorithm S={a, aa, abba, abbbba} Zadar, August 2010
Properties (1) • S L(ak-TSS(S)) • L(ak-TSS(S)) is the smallest k-TSS language that contains S • If there is a smaller one, some prefix, suffix or substring has to be absent Zadar, August 2010
Properties (2) • ak-TSS identifies any k-TSS language in the limit from polynomial data • Once all the prefixes, suffixes and substrings have been seen, the correct automaton is returned • If YS, L(ak-TSS(Y)) L(ak-TSS(S)) Zadar, August 2010
Properties (3) • L(ak+1-TSS(S)) L(ak-TSS(S)) In Ik+1 (resp. Fk+1 and Tk+1) there are less allowed prefixes (resp. suffixes or substrings) than in Ik (resp. Fk and Tk) • k>maxxSx, L(ak-TSS(S))= S • Because for a large k, Tk(S)= Zadar, August 2010
4 Learning k-reversible languages from text D. Angluin. Inference of reversible languages. Journal of the Association for Computing Machinery, 29(3):741–765, 1982 Zadar, August 2010
The k-reversible languages • The class was proposed by Angluin (1982) • The class is identifiable in the limit from text • The class is composed by regular languages that can be accepted by a DFA such that its reverse is deterministic with a look-ahead of k Zadar, August 2010
Let A=(, Q, , I, F) be a NFA, we denote by AT=(, Q, T, F, I) the reversal automaton with: T(q,a)={q’Q: q(q’,a)} Zadar, August 2010
a b A a 2 b 1 0 a 4 a a 3 a AT b a 2 b 1 0 a 4 a a 3 Zadar, August 2010
Some definitions • u is a k-successor of q if │u│=k and (q,u) • u is a k-predecessor of q if │u│=k and T(q,uT) • is 0-successor and 0-predecessor of any state Zadar, August 2010
a b A a 2 b • aa is a 2-successor of 0 and 1 but not of 3 • a is a 1-successor of 3 • aa is a 2-predecessor of 3 but not of 1 1 0 a 4 a a 3 Zadar, August 2010
A NFA is deterministic with look-ahead kifq,q’Q: qq’ (q,q’I) (q,q’(q”,a)) (u is a k-successor of q) (v is a k-successor of q’) uv Zadar, August 2010
Prohibited: u 1 │u│=k a a u 2 Zadar, August 2010
Example a b This automaton is not deterministic with look-ahead 1 but is deterministic with look-ahead 2 a 2 b 1 0 a 4 a a 3 Zadar, August 2010
K-reversible automata • A is k-reversible if A is deterministic and AT is deterministic with look-ahead k • Example b b a a b a 2 b a 2 1 0 1 0 b b deterministic with look-ahead 1 deterministic Zadar, August 2010
Notations • RL(, k) is the set of all k reversible languages over alphabet • RL() is the set of all k-reversible languages over alphabet (ie for all values of k) • ak-RL is the learning algorithm we describe Zadar, August 2010
Properties • There are some regular languages that are not in RL() • RL(,k)RL(,k-1) check Zadar, August 2010
Violation of k-reversibility Two states q, q’ violate the k-reversibility condition if • they violate the deterministic condition: q,q’(q”,a) or • they violate the look-ahead condition: • q,q’F, uk: u is k-predecessor of both q and q’ • uk, (q,a)=(q’,a) and u is k-predecessor of both q and q’ Zadar, August 2010