1 / 53

LING 408/508: Computational Techniques for Linguists

LING 408/508: Computational Techniques for Linguists. Lecture 23 10/15/2012. Outline. Formal language theory: strings and languages Coding CFGs Generating strings and trees Short assignment #15. Note: the following is mathematical notation. It is not Python code. Alphabet.

kelda
Download Presentation

LING 408/508: Computational Techniques for Linguists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING 408/508: Computational Techniques for Linguists Lecture 23 10/15/2012

  2. Outline • Formal language theory: strings and languages • Coding CFGs • Generating strings and trees • Short assignment #15

  3. Note: the following is mathematical notation. It is not Python code.

  4. Alphabet • Let Σ be a finite alphabet. • The elements of Σ may be multi-character symbols • Examples: • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } • LETTERS = { A, B, C, …, X, Y, Z } • ENGLISH_WORDS = { a, aa, …, zyzzogeton, zyzzyva }

  5. Strings • A string is a sequence of zero or more symbols taken from an alphabet. • Example: • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } • s1 = 0123456789 • s2 = 321654

  6. Empty string • The empty string, ε, consists of zero symbols • Pronounced as “epsilon” • Example: • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } • s1 = 0123456789 • s2 = 321654 • s3 = ε

  7. Concatenation • A string is produced by concatenation of symbols from an alphabet. The concatenation of symbols is written by placing two symbols immediately next to each other. • Strings can also be concatenated. • Example: • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } • 1 є DIGITS and 2 є DIGITS. Concatenate 1 and 2 to produce the string 12 • Concatenate 123 and 456 to produce the string 123456

  8. Repetition of a string • Use an “exponent” with an integer value >= 0 to indicate the number of times a symbol is repeated in a string • Zero repetitions is equivalent to є • Examples: • a5 = aaaaa • a1 = a • a0 = ε

  9. Repetition: uses parentheses for grouping • ab2 = abb • (ab)2 = abab • abc(def)2 = abcdefdef • abc(def)1 = abcdef • abc(def)0 = abcε = abc

  10. Repetition of alphabet • An “exponent” on an alphabet indicates a set of strings, such that each string consists of any symbols from that alphabet, and is of the length specified • Zero repetitions is equivalent to { є } • Example: • Σ = { 1, 2, 3 } • Σ0 = { ε } • Σ1 = { 1, 2, 3 } • Σ2 = { 11, 12, 13, 21, 22, 23, 31, 32, 33 }

  11. Closure of an alphabet • An “exponent” on an alphabet indicates a set of strings, such that each string consists of any symbols from that alphabet, and is of the length specified • Zero repetitions is equivalent to { є } • Closure: * indicates zero or more symbols Σ* = Σ0υΣ1υΣ2υΣ3υ … • Example: • Σ = { 1, 2, 3 } • Σ0 = { ε } • Σ1 = { 1, 2, 3 } • Σ2 = { 11, 12, 13, 21, 22, 23, 31, 32, 33 } • Σ* = { ε } υ { 1, 2, 3 } υ { 11, 12, 13, 21, 22, 23, 31, 32, 33 } υ … = { є, 1, 2, 3, 11, 12, 13, 21, 22, 23, 31, 32, 33, … }

  12. I’m not talking about regular expressions • Finite repetition such as a5 is not available within regular expression syntax • In regular expression, instead specify as aaaaa • Closure on symbols such a* (in a regular expression) as is not available within formal language theory • By “a*”, one probably means { ε, a, aa, aaa, … }, but there is another way to specify this in formal language theory

  13. Languages • A language is a set of strings • May be finite or infinite • Examples: • L1 = { apples, bananas, pears } • L2 = { 0, 1, 2, 3, 4, … } • L3 = { ε, a, b, aa, ab, ba, bb, aaa, aab, aba, … }

  14. “…” is imprecise • Finite sets are easy to describe because you can list the elements. • With infinite languages, the use of “…” is not precise. We need better mathematical notation to describe these sets. • Example: • L4 = { 1, 2, 3, … } • L4 could be { 1, 2, 3, 4, 5, 6, … }, or it could be { 1, 2, 3, 11, 12, 13, … }, or many other sets

  15. Predicative definitions of languages • Describe a set through a pattern and remarks on the properties of that pattern • Read “|” as “such that” • Examples: • { x | x is a positive integer } = {1, 2, 3, 4, 5, … } • { an | n is an integer >= 0 } = { ε, a, aa, aaa, … } • { an | n is an integer and 1 <= n <= 5 } = { a, aa, aaa, aaaa, aaaaa } • { anbn | n >= 0 } = { ε, ab, aabb, aaabbb, … } • { anbm | n >= 0 and m > n } = { b, bb, …, abb, abbb, …, aabbb, aabbbb, aabbbbb, …, aaabbbb, aaabbbbb, …}

  16. Computational descriptions of languages • A predicate definition of a language tells us about the properties of the strings in a language, but it does not tell us a procedure by which the language can be generated • Alternatives: • Inductive definitions • Grammar-based definitions

  17. Inductive definitions • State base case(s). At least one base case. • State inductive case(s). Zero or more. • Example: • L = { anbn | n >= 0 } = { є, ab, aabb, aaabbb, … } • Base case: ε є L • Inductive case: if x є L, then a x b є L

  18. Work through example • Base case: ε є L • Inductive case: if x є L, then a x b є L • L = { } • Begin with empty set. • From the base case, ε є L. So we place ε in L • ε is a string in L. From the inductive case, a ε b = a b є L • abis a string in L. From the inductive case, a ab b = aabbє L • etc. • This inductive definition generates L = { anbn | n >= 0 } = { є, ab, aabb, aaabbb, … } ε , ab , aabb

  19. Grammar-based definitions of languages • A grammar is a formalism (with rules for computation) that generates a set of strings • Examples: • Regular expression • Finite-state automaton • Context-free grammar • Pushdown automaton • etc.

  20. Differences in generative capacity of grammars • Some languages can(not) be generated by certain types of grammars • Examples L1 = { an | n >= 0 } = { є, a, aa, aaa, … } • Can be generated by regular expression • Can be generated by context-free grammar L2 = { anbn | n >= 0 } = { є, ab, aabb, aaabbb, … } • Cannot be generated by regular expression • Can be generated by context-free grammar L3 = { anbncn | n >= 0 } = { є, abc, aabbcc, aaabbbccc, … } • Cannot be generated by regular expression • Cannot be generated by context-free grammar

  21. Outline • Formal language theory: strings and languages • Coding CFGs • Generating strings and trees • Short assignment #15

  22. Write a CFG in a multi-line string toy_cfg_str = """ S -> NP VP NP -> DT N VP -> V DT -> a N -> flight V -> left V -> arrived """

  23. Arithmetic expression grammar expr_cfg_str = """ E -> ( E ) E -> E + E E -> E - E E -> E * E E -> E / E E -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 """

  24. General format of a CFG • Example: NONTERM1 -> a b NONTERM2 -> c NONTERM2 -> d NONTERM3 -> a | a NONTERM3 • LHS (left-hand side) of rule is a nonterminal • RHS is the sequence of symbols that the rule rewrites as • May be multiple rules for a single nonterminal (NONTERM2) • Multiple rules may be written as a disjunction (NONTERM3)

  25. Represent CFG as a dictionary • Map each (LHS) nonterminal to a list of a list of strings • List of strings: symbols (nonterminals and terminals) on RHS of a particular rule • List of list of strings: there may be multiple rules for a single nonterminal

  26. Example • Grammar: NONTERM1 -> a b NONTERM2 -> c NONTERM2 -> d NONTERM3 -> a | a NONTERM3 • Represent as dictionary: cfg = {} cfg['NONTERM1'] = [['a', 'b']] cfg['NONTERM2'] = [['c'], ['d']] cfg['NONTERM3'] = [['a'], ['a', 'NONTERM3']]

  27. Parse a CFG string • Want to read string for CFG, return dictionary representation of the grammar • Parsing code will also allow for: • Empty lines in the grammar • Comments • Apply split to RHS of a rule to get disjunction of rules

  28. Splitting multi-line strings: code will allow for empty lines in the grammar >>> s = """ S -> NP VP # comment: S is the start symbol V -> eat """ >>> s '\nS -> NP VP\n\nV -> eat\n' >>> s.split() # don't want this as the result ['S', '->', 'NP', 'VP', 'V', '->', 'eat'] >>> s.split('\n') # each rule is in separate string ['', 'S -> NP VP', '', 'V -> eat', '']

  29. Read list of rules on RHS • When there is a disjunction of rules in a line, need to get list of strings for each rule >>> s = 'a | a NONTERM3' >>> s.split('|') ['a ', ' a NONTERM3'] >>> for x in s.split('|'): # split again to get print(x.split()) # rid of whitespace ['a'] ['a', 'NONTERM3']

  30. Parse a CFG string def read_grammar(cfg_str): cfg = {} for ln in cfg_str.split('\n'): comment_idx = ln.find('#') # ignore comments if comment_idx!=-1: ln = ln[:comment_idx] if ln=='': continue # ignore empty lines lhs = ln[:ln.index('->')] # split the line rhs = ln[ln.index('->')+3:] # into lhs and rhs rhs_rules = cfg.get(lhs, []) # assign rules to lhs for symbols in rhs.split('|'): # disjunct. of rhs rhs_rules.append(symbols.split()) cfg[lhs] = rhs_rules return cfg

  31. Print a CFG def print_cfg(cfg): for (lhs, rhs_rules) in cfg.items(): for symbols in rhs_rules: s = '{} -> {}'.format(lhs, ' '.join(symbols)) print(s)

  32. Testing code toy_cfg = read_grammar(toy_cfg_str) expr_cfg = read_grammar(expr_cfg_str) print('toy cfg:') print_cfg(toy_cfg) print('\nexprcfg:') print_cfg(expr_cfg)

  33. Output toy cfg: N -> flight DT -> a VP -> V S -> NP VP V -> left V -> arrived NP -> DT N

  34. Output exprcfg: E -> ( E ) E -> E + E E -> E - E E -> E * E E -> E / E E -> 0 E -> 1 E -> 2 E -> 3 E -> 4 E -> 5 E -> 6 E -> 7 E -> 8 E -> 9

  35. Show the nonterminals def get_nonterminals(cfg): return set(cfg.keys()) toy_cfg_nonterms = get_nonterminals(toy_cfg) expr_cfg_nonterms = get_nonterminals(expr_cfg) print('\ntoy_cfg_nonterms:', toy_cfg_nonterms) print('expr_cfg_nonterms:', expr_cfg_nonterms) # Output: # toy_cfg_nonterms: set(['N', 'DT', 'VP', 'S', # 'V', 'NP']) # expr_cfg_nonterms: set(['E'])

  36. Outline • Formal language theory: strings and languages • Coding CFGs • Generating strings and trees • Short assignment #15

  37. Generate a random string from a CFG • Procedure: • For a given nonterminal, choose a rule at random • For each symbol in the RHS of that rule: • If it is a terminal symbol, add to string • If it is a nonterminal, recursively generate a string from that nonterminal • Generate a string from the grammar through the start symbol

  38. Example of recursive string generation • Generate a string from this CFG A -> x Y z Y -> c • Begin with start symbol A • Look up rule for nonterminal: A -> x Y z • Generate items on right-hand side of rule • Generate terminal: x • Recursively generate string for nonterminalY • Look up rule for nonterminal: Y -> c • Generate items on right-hand side of rule • Generate terminal: c • Generate terminal: z • Final string generated: x c z

  39. import random def generate_string(cfg, lhs, nonterms): s = '' nonterm_rules = cfg[lhs] # randomly choose a rule r_idx = random.randint(0,len(nonterm_rules)-1) rule = nonterm_rules[r_idx] for sym in rule: # step through symbols in rule if sym in nonterms: # recursive case s += generate_string(cfg, sym, nonterms) + ' ' else: # terminal symbol, concatenate to string s += sym + ' ' return s[:-1] # remove last ' '

  40. Testing code print '\nRandom strings generated by toy cfg:' for i in range(10): print(generate_string(toy_cfg,'S',toy_cfg_nonterms)) print '\nRandom strings generated by exprcfg:' for i in range(10): print(generate_string(expr_cfg,'E',expr_cfg_nonterms))

  41. Output Random strings from toy cfg: a flight left a flight left a flight arrived a flight left a flight left a flight arrived a flight arrived a flight arrived a flight arrived a flight left

  42. Output Random strings from exprcfg: ( 6 ) - 5 * 5 5 9 5 / 5 1 2 - ( 0 - 0 + 3 ) + 1 4 3 + 9 / 0 / 5 / 0 + 7 3 2

  43. Generate some long expressions print('\nLong strings generated by exprcfg:') for i in range(1000): s = generate_string(expr_cfg,'E',expr_cfg_nonterms) if len(s) > 50: print(s) Output: random strings from exprcfg: 3 * 9 * 6 / ( 0 + 6 - ( 9 / 8 - 2 ) + 3 / 1 - 4 ) * 4 4 + 5 * 5 + 1 / 5 - ( 1 ) + ( 4 ) / ( 9 ) + 6 + 2 - 1

  44. Generate random trees • Let’s modify our tree representation to allow an arbitrary number of children: (value, list-of-children) • Parent node: (nonterminal, list-of-child-nodes) • Leaf node: (terminal, [])

  45. Example • A is the parent of b and C and d. C is the parent of e. ('A', [('b', []), ('C', [('e', [])]), ('d', [])]) • Parent node: (nonterminal, list-of-child-nodes) • Leaf node: (terminal, [])

  46. import random def generate_tree(cfg, lhs, nonterms): # randomly choose a rule nonterm_rules = cfg[lhs] r_idx = random.randint(0,len(nonterm_rules)-1) rule = nonterm_rules[r_idx] children = [] for sym in rule: if sym in nonterms: # recursive case ch_node = generate_tree(cfg, sym, nonterms) children.append(ch_node) else: # base case: leaf node children.append((sym, [])) parent = (lhs, children) return parent

  47. Print out a tree def pretty_print(node): pass # this is your homework problem

  48. Testing code toy_tree = generate_tree(toy_cfg, 'S', toy_cfg_nonterms) pretty_print(toy_tree) expr_tree = generate_tree(expr_cfg, 'E',expr_cfg_nonterms) pretty_print(expr_tree)

  49. Output S NP DT a N flight VP V arrived

  50. Output E E E 0 + E E 4 * E 1 + E E 8 / E 7

More Related