121 Views

Download Presentation
##### LING 408/508: Computational Techniques for Linguists

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**LING 408/508: Computational Techniques for Linguists**Lecture 23 10/15/2012**Outline**• Formal language theory: strings and languages • Coding CFGs • Generating strings and trees • Short assignment #15**Note: the following is mathematical notation. It is not**Python code.**Alphabet**• Let Σ be a finite alphabet. • The elements of Σ may be multi-character symbols • Examples: • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } • LETTERS = { A, B, C, …, X, Y, Z } • ENGLISH_WORDS = { a, aa, …, zyzzogeton, zyzzyva }**Strings**• A string is a sequence of zero or more symbols taken from an alphabet. • Example: • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } • s1 = 0123456789 • s2 = 321654**Empty string**• The empty string, ε, consists of zero symbols • Pronounced as “epsilon” • Example: • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } • s1 = 0123456789 • s2 = 321654 • s3 = ε**Concatenation**• A string is produced by concatenation of symbols from an alphabet. The concatenation of symbols is written by placing two symbols immediately next to each other. • Strings can also be concatenated. • Example: • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } • 1 є DIGITS and 2 є DIGITS. Concatenate 1 and 2 to produce the string 12 • Concatenate 123 and 456 to produce the string 123456**Repetition of a string**• Use an “exponent” with an integer value >= 0 to indicate the number of times a symbol is repeated in a string • Zero repetitions is equivalent to є • Examples: • a5 = aaaaa • a1 = a • a0 = ε**Repetition: uses parentheses for grouping**• ab2 = abb • (ab)2 = abab • abc(def)2 = abcdefdef • abc(def)1 = abcdef • abc(def)0 = abcε = abc**Repetition of alphabet**• An “exponent” on an alphabet indicates a set of strings, such that each string consists of any symbols from that alphabet, and is of the length specified • Zero repetitions is equivalent to { є } • Example: • Σ = { 1, 2, 3 } • Σ0 = { ε } • Σ1 = { 1, 2, 3 } • Σ2 = { 11, 12, 13, 21, 22, 23, 31, 32, 33 }**Closure of an alphabet**• An “exponent” on an alphabet indicates a set of strings, such that each string consists of any symbols from that alphabet, and is of the length specified • Zero repetitions is equivalent to { є } • Closure: * indicates zero or more symbols Σ* = Σ0υΣ1υΣ2υΣ3υ … • Example: • Σ = { 1, 2, 3 } • Σ0 = { ε } • Σ1 = { 1, 2, 3 } • Σ2 = { 11, 12, 13, 21, 22, 23, 31, 32, 33 } • Σ* = { ε } υ { 1, 2, 3 } υ { 11, 12, 13, 21, 22, 23, 31, 32, 33 } υ … = { є, 1, 2, 3, 11, 12, 13, 21, 22, 23, 31, 32, 33, … }**I’m not talking about regular expressions**• Finite repetition such as a5 is not available within regular expression syntax • In regular expression, instead specify as aaaaa • Closure on symbols such a* (in a regular expression) as is not available within formal language theory • By “a*”, one probably means { ε, a, aa, aaa, … }, but there is another way to specify this in formal language theory**Languages**• A language is a set of strings • May be finite or infinite • Examples: • L1 = { apples, bananas, pears } • L2 = { 0, 1, 2, 3, 4, … } • L3 = { ε, a, b, aa, ab, ba, bb, aaa, aab, aba, … }**“…” is imprecise**• Finite sets are easy to describe because you can list the elements. • With infinite languages, the use of “…” is not precise. We need better mathematical notation to describe these sets. • Example: • L4 = { 1, 2, 3, … } • L4 could be { 1, 2, 3, 4, 5, 6, … }, or it could be { 1, 2, 3, 11, 12, 13, … }, or many other sets**Predicative definitions of languages**• Describe a set through a pattern and remarks on the properties of that pattern • Read “|” as “such that” • Examples: • { x | x is a positive integer } = {1, 2, 3, 4, 5, … } • { an | n is an integer >= 0 } = { ε, a, aa, aaa, … } • { an | n is an integer and 1 <= n <= 5 } = { a, aa, aaa, aaaa, aaaaa } • { anbn | n >= 0 } = { ε, ab, aabb, aaabbb, … } • { anbm | n >= 0 and m > n } = { b, bb, …, abb, abbb, …, aabbb, aabbbb, aabbbbb, …, aaabbbb, aaabbbbb, …}**Computational descriptions of languages**• A predicate definition of a language tells us about the properties of the strings in a language, but it does not tell us a procedure by which the language can be generated • Alternatives: • Inductive definitions • Grammar-based definitions**Inductive definitions**• State base case(s). At least one base case. • State inductive case(s). Zero or more. • Example: • L = { anbn | n >= 0 } = { є, ab, aabb, aaabbb, … } • Base case: ε є L • Inductive case: if x є L, then a x b є L**Work through example**• Base case: ε є L • Inductive case: if x є L, then a x b є L • L = { } • Begin with empty set. • From the base case, ε є L. So we place ε in L • ε is a string in L. From the inductive case, a ε b = a b є L • abis a string in L. From the inductive case, a ab b = aabbє L • etc. • This inductive definition generates L = { anbn | n >= 0 } = { є, ab, aabb, aaabbb, … } ε , ab , aabb**Grammar-based definitions of languages**• A grammar is a formalism (with rules for computation) that generates a set of strings • Examples: • Regular expression • Finite-state automaton • Context-free grammar • Pushdown automaton • etc.**Differences in generative capacity of grammars**• Some languages can(not) be generated by certain types of grammars • Examples L1 = { an | n >= 0 } = { є, a, aa, aaa, … } • Can be generated by regular expression • Can be generated by context-free grammar L2 = { anbn | n >= 0 } = { є, ab, aabb, aaabbb, … } • Cannot be generated by regular expression • Can be generated by context-free grammar L3 = { anbncn | n >= 0 } = { є, abc, aabbcc, aaabbbccc, … } • Cannot be generated by regular expression • Cannot be generated by context-free grammar**Outline**• Formal language theory: strings and languages • Coding CFGs • Generating strings and trees • Short assignment #15**Write a CFG in a multi-line string**toy_cfg_str = """ S -> NP VP NP -> DT N VP -> V DT -> a N -> flight V -> left V -> arrived """**Arithmetic expression grammar**expr_cfg_str = """ E -> ( E ) E -> E + E E -> E - E E -> E * E E -> E / E E -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 """**General format of a CFG**• Example: NONTERM1 -> a b NONTERM2 -> c NONTERM2 -> d NONTERM3 -> a | a NONTERM3 • LHS (left-hand side) of rule is a nonterminal • RHS is the sequence of symbols that the rule rewrites as • May be multiple rules for a single nonterminal (NONTERM2) • Multiple rules may be written as a disjunction (NONTERM3)**Represent CFG as a dictionary**• Map each (LHS) nonterminal to a list of a list of strings • List of strings: symbols (nonterminals and terminals) on RHS of a particular rule • List of list of strings: there may be multiple rules for a single nonterminal**Example**• Grammar: NONTERM1 -> a b NONTERM2 -> c NONTERM2 -> d NONTERM3 -> a | a NONTERM3 • Represent as dictionary: cfg = {} cfg['NONTERM1'] = [['a', 'b']] cfg['NONTERM2'] = [['c'], ['d']] cfg['NONTERM3'] = [['a'], ['a', 'NONTERM3']]**Parse a CFG string**• Want to read string for CFG, return dictionary representation of the grammar • Parsing code will also allow for: • Empty lines in the grammar • Comments • Apply split to RHS of a rule to get disjunction of rules**Splitting multi-line strings: code will allow for empty**lines in the grammar >>> s = """ S -> NP VP # comment: S is the start symbol V -> eat """ >>> s '\nS -> NP VP\n\nV -> eat\n' >>> s.split() # don't want this as the result ['S', '->', 'NP', 'VP', 'V', '->', 'eat'] >>> s.split('\n') # each rule is in separate string ['', 'S -> NP VP', '', 'V -> eat', '']**Read list of rules on RHS**• When there is a disjunction of rules in a line, need to get list of strings for each rule >>> s = 'a | a NONTERM3' >>> s.split('|') ['a ', ' a NONTERM3'] >>> for x in s.split('|'): # split again to get print(x.split()) # rid of whitespace ['a'] ['a', 'NONTERM3']**Parse a CFG string**def read_grammar(cfg_str): cfg = {} for ln in cfg_str.split('\n'): comment_idx = ln.find('#') # ignore comments if comment_idx!=-1: ln = ln[:comment_idx] if ln=='': continue # ignore empty lines lhs = ln[:ln.index('->')] # split the line rhs = ln[ln.index('->')+3:] # into lhs and rhs rhs_rules = cfg.get(lhs, []) # assign rules to lhs for symbols in rhs.split('|'): # disjunct. of rhs rhs_rules.append(symbols.split()) cfg[lhs] = rhs_rules return cfg**Print a CFG**def print_cfg(cfg): for (lhs, rhs_rules) in cfg.items(): for symbols in rhs_rules: s = '{} -> {}'.format(lhs, ' '.join(symbols)) print(s)**Testing code**toy_cfg = read_grammar(toy_cfg_str) expr_cfg = read_grammar(expr_cfg_str) print('toy cfg:') print_cfg(toy_cfg) print('\nexprcfg:') print_cfg(expr_cfg)**Output**toy cfg: N -> flight DT -> a VP -> V S -> NP VP V -> left V -> arrived NP -> DT N**Output**exprcfg: E -> ( E ) E -> E + E E -> E - E E -> E * E E -> E / E E -> 0 E -> 1 E -> 2 E -> 3 E -> 4 E -> 5 E -> 6 E -> 7 E -> 8 E -> 9**Show the nonterminals**def get_nonterminals(cfg): return set(cfg.keys()) toy_cfg_nonterms = get_nonterminals(toy_cfg) expr_cfg_nonterms = get_nonterminals(expr_cfg) print('\ntoy_cfg_nonterms:', toy_cfg_nonterms) print('expr_cfg_nonterms:', expr_cfg_nonterms) # Output: # toy_cfg_nonterms: set(['N', 'DT', 'VP', 'S', # 'V', 'NP']) # expr_cfg_nonterms: set(['E'])**Outline**• Formal language theory: strings and languages • Coding CFGs • Generating strings and trees • Short assignment #15**Generate a random string from a CFG**• Procedure: • For a given nonterminal, choose a rule at random • For each symbol in the RHS of that rule: • If it is a terminal symbol, add to string • If it is a nonterminal, recursively generate a string from that nonterminal • Generate a string from the grammar through the start symbol**Example of recursive string generation**• Generate a string from this CFG A -> x Y z Y -> c • Begin with start symbol A • Look up rule for nonterminal: A -> x Y z • Generate items on right-hand side of rule • Generate terminal: x • Recursively generate string for nonterminalY • Look up rule for nonterminal: Y -> c • Generate items on right-hand side of rule • Generate terminal: c • Generate terminal: z • Final string generated: x c z**import random**def generate_string(cfg, lhs, nonterms): s = '' nonterm_rules = cfg[lhs] # randomly choose a rule r_idx = random.randint(0,len(nonterm_rules)-1) rule = nonterm_rules[r_idx] for sym in rule: # step through symbols in rule if sym in nonterms: # recursive case s += generate_string(cfg, sym, nonterms) + ' ' else: # terminal symbol, concatenate to string s += sym + ' ' return s[:-1] # remove last ' '**Testing code**print '\nRandom strings generated by toy cfg:' for i in range(10): print(generate_string(toy_cfg,'S',toy_cfg_nonterms)) print '\nRandom strings generated by exprcfg:' for i in range(10): print(generate_string(expr_cfg,'E',expr_cfg_nonterms))**Output**Random strings from toy cfg: a flight left a flight left a flight arrived a flight left a flight left a flight arrived a flight arrived a flight arrived a flight arrived a flight left**Output**Random strings from exprcfg: ( 6 ) - 5 * 5 5 9 5 / 5 1 2 - ( 0 - 0 + 3 ) + 1 4 3 + 9 / 0 / 5 / 0 + 7 3 2**Generate some long expressions**print('\nLong strings generated by exprcfg:') for i in range(1000): s = generate_string(expr_cfg,'E',expr_cfg_nonterms) if len(s) > 50: print(s) Output: random strings from exprcfg: 3 * 9 * 6 / ( 0 + 6 - ( 9 / 8 - 2 ) + 3 / 1 - 4 ) * 4 4 + 5 * 5 + 1 / 5 - ( 1 ) + ( 4 ) / ( 9 ) + 6 + 2 - 1**Generate random trees**• Let’s modify our tree representation to allow an arbitrary number of children: (value, list-of-children) • Parent node: (nonterminal, list-of-child-nodes) • Leaf node: (terminal, [])**Example**• A is the parent of b and C and d. C is the parent of e. ('A', [('b', []), ('C', [('e', [])]), ('d', [])]) • Parent node: (nonterminal, list-of-child-nodes) • Leaf node: (terminal, [])**import random**def generate_tree(cfg, lhs, nonterms): # randomly choose a rule nonterm_rules = cfg[lhs] r_idx = random.randint(0,len(nonterm_rules)-1) rule = nonterm_rules[r_idx] children = [] for sym in rule: if sym in nonterms: # recursive case ch_node = generate_tree(cfg, sym, nonterms) children.append(ch_node) else: # base case: leaf node children.append((sym, [])) parent = (lhs, children) return parent**Print out a tree**def pretty_print(node): pass # this is your homework problem**Testing code**toy_tree = generate_tree(toy_cfg, 'S', toy_cfg_nonterms) pretty_print(toy_tree) expr_tree = generate_tree(expr_cfg, 'E',expr_cfg_nonterms) pretty_print(expr_tree)**Output**S NP DT a N flight VP V arrived**Output**E E E 0 + E E 4 * E 1 + E E 8 / E 7