1 / 45

Languages, grammars, and regular expressions

Languages, grammars, and regular expressions. LING 570 Fei Xia Week 2: 10/03/07. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A. Today. Hw1 is due: 1% penalty per hour after 3pm. Probability Theory (from last time)

gilda
Download Presentation

Languages, grammars, and regular expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A

  2. Today • Hw1 is due: 1% penalty per hour after 3pm. • Probability Theory (from last time) • Formal languages, formal grammars, and regular expression  J&M 2 • Hw2

  3. Coming up • Next Monday • Finite-state automaton: J&M Chapter 2 • Next Wed • Finite-state transducer: J&M Section 3.4 • Hw2 is due

  4. Probability theory(from last time)

  5. Common trick #1: Maximum likelihood estimation • An example: toss a coin 3 times, and got two heads. What is the probability of getting a head with one toss? • Maximum likelihood: (ML) * = arg max P(data | ) • In the example, • P(X=2) = 3 * p * p * (1-p)  when p=2/3, P(X=2) reaches the maximum.

  6. Common trick #2:Chain rule

  7. Common trick #3:joint prob Marginal prob

  8. Common trick #4:Bayes’ rule

  9. Independent random variables • Two random variables X and Y are independent iff the value of X has no influence on the value of Y and vice versa. • P(X,Y) = P(X) P(Y) • P(Y|X) = P(Y) • P(X|Y) = P(X) • Our previous coin examples: P(X, C) != P(X) P(C)

  10. Conditional independence Once we know the value of Z, knowing the value of Y does not help us predict the value of X, and vice versa. • P(X,Y | Z) = P(X|Z) P(Y|Z) • P(X|Y,Z) = P(X | Z) • P(Y|X, Z) = P(Y |Z)

  11. Independence and conditional independence • If A and B are independent, are they conditional independent? • Example: • Burglar, Earthquake • Alarm

  12. Common trick #5:Independence assumption

  13. An example • P(w1 w2 … wn) = P(w1) P(w2 | w1) P(w3 | w1 w2) * … * P(wn | w1 …, wn-1) ¼ P(w1) P(w2 | w1) …. P(wn | wn-1) • Why do we make independence assumption which we know are not true?

  14. Summary of elementaryprobability theory • Basic concepts: sample space, event space, random variable, random vector • Joint / conditional /marginal probability • Independence and conditional independence • Five common tricks: • Max likelihood estimation • Chain rule • Calculating marginal probability from joint probability • Bayes’ rule • Independence assumption

  15. Formal languages, formal grammars andregular languages

  16. Unit #1 • Formal grammar, language and regular expression • Finite-state automaton (FSA) • Finite-state transducer (FST) • Morphological analysis using FST

  17. Other units in ling570 • Unit #2: LM and smoothing • Unit #3: HMM and POS tagging • Unit #4: Classification and sequence labeling • Unit #5: Introduction to other common tasks (e.g., IR/IE/WSD)

  18. Regular expression • Two concepts: • Regular expression in formal language theory • Regular expression (or pattern) in pattern matching: it is a way of expressing a pattern for the purpose of matching a string • Both concepts describe a set of strings. • The two concepts are closely related, but the latter is often more expressive than the former.

  19. Outline • Formal language • Regular language • Regular expression in formal language theory • Formal grammars • Regular grammars • Regular expression in pattern matching

  20. Formal languages

  21. Definition of formal language • An alphabet is a finite set of symbols: • Ex: § = {a, b, c} • A string is a finite sequence of symbols from a particular alphabet juxtaposed: • Ex: the string “baccab” • Ex: empty string ² • A formal languageis a set of strings defined over some alphabet. • Ex1: {aa, bb, cc, aaaa, abba, acca, baab, bbbb, ….} • Ex2: {an bn | n > 0} • Ex3: the empty set Á

  22. Definition of regular languages • The class of regular languages over an alphabet § is formally defined as: • The empty set, Á, is a regular language • 8 a 2§[², {a} is a regular language. • If L1 and L2 are regular languages, then so are: (a) L1² L2 = {xy | x 2 L1; y 2 L2} (concatenation) (b) L1 L2 (union or disjunction) (c) L1* = {x1 x2 …, xn | xi2 L1 , n 2 N} (Kleene closure) • There are no other regular languages

  23. Kleene star Another way to define L*: • L2 = L ² L • Ln = Ln-1² L • L* = { ²}  L1 L2 … Examples: • L = {a, bc} • L2 = {aa, abc, bca, bcbc} • L* = {abcbca, ….} = { (a|bc)*}

  24. Properties • Regular languages are closed under • Concatenation • Union • Kleene closure • Regular languages are also closed under: • Intersection: L1Å L2 • Difference: L1 – L2 • Complementation: §* - L1 • Reversal

  25. Are the following languages regular? • {a, aa, aaa, ….} • Any finite set of strings • {xy | x2§*, and y is the reverse of x} • {xx | x 2§*} • {an bn | n 2 N} • {an bn cn | n 2 N}

  26. Regular expression

  27. Definition of Regular expression(as in formal language theory) • The set of regular expressions is defined as follows: (1) Every symbol of is a regular expression (2) ² is a regular expression (3) If r1and r2are regular expressions, so are (r1), r1 r2, r1| r2 , r1* (4) Nothing else is a regular expression.

  28. Examples • ab*c • a (0|1|2|..|9)* b • (CV | CCV)+ C?C?: C is a consonant, V is a vowel Other operations that we can use: • a+ = a a* • a? = (a | ²)

  29. Relation between regular language and Regex • With every regular expression we can associate a regular language. • Conversely, every regular language can be obtained from a regular expression. • Examples: • Regex = ab*c • Regular language = {ac, abc, abbc, ….}

  30. Formal grammar

  31. Definition of formal grammar A formal grammar is a concise description of a formal language. It is a (N, §, P, S) tuple: • A finite set N of nonterminal symbols • A finite set Σ of terminal symbols that is disjoint from N • A finite set P of production rules, each of the form: (§ N)* N (§ N)* ! (§ N)* • A distinguished symbol S 2 N that is the start symbol

  32. Chomsky hierarchy The left-hand side of a rule must contain at least one non-terminal. ®, ¯, °2 (N §)*, A,B 2 N, a 2§ • Type 0: unrestricted grammar: no other constraints. • Type 1: Context-sensitive grammar: The rules must be of the form: ® A ¯!®°¯ • Type 2: Context-free grammar (CFGs): The rules must be of the form: A !® • Type 3: Regular grammar: The rules are of the forms: right regular grammar: A ! a, A ! aB, or A !² left regular grammar: A ! a, A ! Ba, or A !² Are there other kinds of grammars?

  33. Strings generated from a grammar • The rules are: S → x | y | z | S + S | S - S | S * S | S/S | (S) • What strings can be generated? • A grammar is ambiguous if there exists at least one string which has multiple parse trees. • Is this grammar ambiguous?

  34. Languages generated by grammars • Given a grammar G, L(G) is the set of strings that can be generated from G. • Ex: G = (N, §, P, S) N = {S}, § = {a, b, c} P = { S ! a S b, S ! c } What is L(G)?

  35. The relation between regular grammars and regular languages • The regular grammars describe exactly all regular languages. • All the following are equivalent: • Regular languages • Regular grammars • Regular expression • Finite state automaton (FSA)

  36. Relation between grammars and languages (from wikipedia page)

  37. How about human languages? • Are they formal languages? • What is alphabet? • What is string? • What type of formal languages are they?

  38. Outline • Formal language • Regular language • Regular expression in formal language theory • Formal grammar • Regular grammar • Patterns in pattern matching  J&M 2.1

  39. Patterns in Perl [ab] a|b . match any character ^ the starting position in a string $ the ending position in a string (..) defines a marked subexpression a* match “a” zero or more times a+ match “a” one or more time a? match “a” zero or one time a{n,m} “a” appears n to m times

  40. Special symbols in the patterns \s match any whitespace char \d match any digit \w match any letter or digit \S match any non-whitespace char … \+, \-, \., \?, \*, …

  41. Examples Integer: (\+|\-)?\d+ Real: (\+|\-)?\d+\.\d+ Scientific notation: (\+|\-)? \d+ (\.\d+)?e (\+|\-)?\d+ Any of the three: (\+|\-)? \d+ (\.\d+)? (e (\+|\-)?\d+)?

  42. Patterns in Perl and Regex /^(.*)\1$/  { xx | x 2§*} /^(.+)a(.+)\1\2$/  {xayxy | x, y 2§*}  The extra power comes from the ability to refer to marked subexpression.

  43. Outline • Formal language • Regular language • Regular expression in formal language theory • Formal grammars • Regular grammars • Regex patterns in pattern matching

  44. Homework #2

  45. Part I: probability • Part II: formal grammar • Part III: regular expression

More Related