ling 408 508 computational techniques for linguists
Download
Skip this Video
Download Presentation
LING 408/508: Computational Techniques for Linguists

Loading in 2 Seconds...

play fullscreen
1 / 53

LING 408/508: Computational Techniques for Linguists - PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on

LING 408/508: Computational Techniques for Linguists. Lecture 23 10/15/2012. Outline. Formal language theory: strings and languages Coding CFGs Generating strings and trees Short assignment #15. Note: the following is mathematical notation. It is not Python code. Alphabet.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' LING 408/508: Computational Techniques for Linguists' - kelda


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • Formal language theory: strings and languages
  • Coding CFGs
  • Generating strings and trees
  • Short assignment #15
alphabet
Alphabet
  • Let Σ be a finite alphabet.
    • The elements of Σ may be multi-character symbols
  • Examples:
  • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
  • LETTERS = { A, B, C, …, X, Y, Z }
  • ENGLISH_WORDS = { a, aa, …, zyzzogeton, zyzzyva }
strings
Strings
  • A string is a sequence of zero or more symbols taken from an alphabet.
  • Example:
  • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
  • s1 = 0123456789
  • s2 = 321654
empty string
Empty string
    • The empty string, ε, consists of zero symbols
    • Pronounced as “epsilon”
  • Example:
  • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
  • s1 = 0123456789
  • s2 = 321654
  • s3 = ε
concatenation
Concatenation
  • A string is produced by concatenation of symbols from an alphabet. The concatenation of symbols is written by placing two symbols immediately next to each other.
  • Strings can also be concatenated.
  • Example:
  • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
  • 1 є DIGITS and 2 є DIGITS. Concatenate 1 and 2 to produce the string 12
  • Concatenate 123 and 456 to produce the string 123456
repetition of a string
Repetition of a string
  • Use an “exponent” with an integer value >= 0 to indicate the number of times a symbol is repeated in a string
  • Zero repetitions is equivalent to є
  • Examples:
  • a5 = aaaaa
  • a1 = a
  • a0 = ε
repetition uses parentheses for grouping
Repetition: uses parentheses for grouping
  • ab2 = abb
  • (ab)2 = abab
  • abc(def)2 = abcdefdef
  • abc(def)1 = abcdef
  • abc(def)0 = abcε = abc
repetition of alphabet
Repetition of alphabet
  • An “exponent” on an alphabet indicates a set of strings, such that each string consists of any symbols from that alphabet, and is of the length specified
  • Zero repetitions is equivalent to { є }
  • Example:
  • Σ = { 1, 2, 3 }
  • Σ0 = { ε }
  • Σ1 = { 1, 2, 3 }
  • Σ2 = { 11, 12, 13, 21, 22, 23, 31, 32, 33 }
closure of an alphabet
Closure of an alphabet
  • An “exponent” on an alphabet indicates a set of strings, such that each string consists of any symbols from that alphabet, and is of the length specified
  • Zero repetitions is equivalent to { є }
  • Closure: * indicates zero or more symbols

Σ* = Σ0υΣ1υΣ2υΣ3υ …

  • Example:
  • Σ = { 1, 2, 3 }
  • Σ0 = { ε }
  • Σ1 = { 1, 2, 3 }
  • Σ2 = { 11, 12, 13, 21, 22, 23, 31, 32, 33 }
  • Σ* = { ε } υ { 1, 2, 3 } υ { 11, 12, 13, 21, 22, 23, 31, 32, 33 } υ …

= { є, 1, 2, 3, 11, 12, 13, 21, 22, 23, 31, 32, 33, … }

i m not talking about regular expressions
I’m not talking about regular expressions
  • Finite repetition such as a5 is not available within regular expression syntax
    • In regular expression, instead specify as aaaaa
  • Closure on symbols such a* (in a regular expression) as is not available within formal language theory
    • By “a*”, one probably means { ε, a, aa, aaa, … }, but there is another way to specify this in formal language theory
languages
Languages
  • A language is a set of strings
    • May be finite or infinite
  • Examples:
  • L1 = { apples, bananas, pears }
  • L2 = { 0, 1, 2, 3, 4, … }
  • L3 = { ε, a, b, aa, ab, ba, bb, aaa, aab, aba, … }
is imprecise
“…” is imprecise
  • Finite sets are easy to describe because you can list the elements.
  • With infinite languages, the use of “…” is not precise. We need better mathematical notation to describe these sets.
  • Example:
  • L4 = { 1, 2, 3, … }
  • L4 could be { 1, 2, 3, 4, 5, 6, … }, or it could be { 1, 2, 3, 11, 12, 13, … }, or many other sets
predicative definitions of languages
Predicative definitions of languages
  • Describe a set through a pattern and remarks on the properties of that pattern
    • Read “|” as “such that”
  • Examples:
  • { x | x is a positive integer } = {1, 2, 3, 4, 5, … }
  • { an | n is an integer >= 0 } = { ε, a, aa, aaa, … }
  • { an | n is an integer and 1 <= n <= 5 } = { a, aa, aaa, aaaa, aaaaa }
  • { anbn | n >= 0 } = { ε, ab, aabb, aaabbb, … }
  • { anbm | n >= 0 and m > n } = { b, bb, …, abb, abbb, …, aabbb, aabbbb, aabbbbb, …, aaabbbb, aaabbbbb, …}
computational descriptions of languages
Computational descriptions of languages
  • A predicate definition of a language tells us about the properties of the strings in a language, but it does not tell us a procedure by which the language can be generated
  • Alternatives:
    • Inductive definitions
    • Grammar-based definitions
inductive definitions
Inductive definitions
  • State base case(s). At least one base case.
  • State inductive case(s). Zero or more.
  • Example:
  • L = { anbn | n >= 0 } = { є, ab, aabb, aaabbb, … }
  • Base case: ε є L
  • Inductive case: if x є L, then a x b є L
work through example
Work through example
  • Base case: ε є L
  • Inductive case: if x є L, then a x b є L
  • L = { }
  • Begin with empty set.
  • From the base case, ε є L. So we place ε in L
  • ε is a string in L. From the inductive case, a ε b = a b є L
  • abis a string in L. From the inductive case, a ab b = aabbє L
  • etc.
  • This inductive definition generates

L = { anbn | n >= 0 } = { є, ab, aabb, aaabbb, … }

ε

, ab

, aabb

grammar based definitions of languages
Grammar-based definitions of languages
  • A grammar is a formalism (with rules for computation) that generates a set of strings
  • Examples:
    • Regular expression
    • Finite-state automaton
    • Context-free grammar
    • Pushdown automaton
    • etc.
differences in generative capacity of grammars
Differences in generative capacity of grammars
  • Some languages can(not) be generated by certain types of grammars
  • Examples

L1 = { an | n >= 0 } = { є, a, aa, aaa, … }

      • Can be generated by regular expression
      • Can be generated by context-free grammar

L2 = { anbn | n >= 0 } = { є, ab, aabb, aaabbb, … }

      • Cannot be generated by regular expression
      • Can be generated by context-free grammar

L3 = { anbncn | n >= 0 } = { є, abc, aabbcc, aaabbbccc, … }

      • Cannot be generated by regular expression
      • Cannot be generated by context-free grammar
outline1
Outline
  • Formal language theory: strings and languages
  • Coding CFGs
  • Generating strings and trees
  • Short assignment #15
write a cfg in a multi line string
Write a CFG in a multi-line string

toy_cfg_str = """

S -> NP VP

NP -> DT N

VP -> V

DT -> a

N -> flight

V -> left

V -> arrived

"""

arithmetic expression grammar
Arithmetic expression grammar

expr_cfg_str = """

E -> ( E )

E -> E + E

E -> E - E

E -> E * E

E -> E / E

E -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

"""

general format of a cfg
General format of a CFG
  • Example:

NONTERM1 -> a b

NONTERM2 -> c

NONTERM2 -> d

NONTERM3 -> a | a NONTERM3

  • LHS (left-hand side) of rule is a nonterminal
  • RHS is the sequence of symbols that the rule rewrites as
    • May be multiple rules for a single nonterminal (NONTERM2)
    • Multiple rules may be written as a disjunction (NONTERM3)
represent cfg as a dictionary
Represent CFG as a dictionary
  • Map each (LHS) nonterminal to a list of a list of strings
    • List of strings: symbols (nonterminals and terminals) on RHS of a particular rule
    • List of list of strings: there may be multiple rules for a single nonterminal
example
Example
  • Grammar:

NONTERM1 -> a b

NONTERM2 -> c

NONTERM2 -> d

NONTERM3 -> a | a NONTERM3

  • Represent as dictionary:

cfg = {}

cfg[\'NONTERM1\'] = [[\'a\', \'b\']]

cfg[\'NONTERM2\'] = [[\'c\'], [\'d\']]

cfg[\'NONTERM3\'] = [[\'a\'], [\'a\', \'NONTERM3\']]

parse a cfg string
Parse a CFG string
  • Want to read string for CFG, return dictionary representation of the grammar
  • Parsing code will also allow for:
    • Empty lines in the grammar
    • Comments
  • Apply split to RHS of a rule to get disjunction of rules
splitting multi line strings code will allow for empty lines in the grammar
Splitting multi-line strings: code will allow for empty lines in the grammar

>>> s = """

S -> NP VP # comment: S is the start symbol

V -> eat

"""

>>> s

\'\nS -> NP VP\n\nV -> eat\n\'

>>> s.split() # don\'t want this as the result

[\'S\', \'->\', \'NP\', \'VP\', \'V\', \'->\', \'eat\']

>>> s.split(\'\n\') # each rule is in separate string

[\'\', \'S -> NP VP\', \'\', \'V -> eat\', \'\']

read list of rules on rhs
Read list of rules on RHS
  • When there is a disjunction of rules in a line,

need to get list of strings for each rule

>>> s = \'a | a NONTERM3\'

>>> s.split(\'|\')

[\'a \', \' a NONTERM3\']

>>> for x in s.split(\'|\'): # split again to get

print(x.split()) # rid of whitespace

[\'a\']

[\'a\', \'NONTERM3\']

parse a cfg string1
Parse a CFG string

def read_grammar(cfg_str):

cfg = {}

for ln in cfg_str.split(\'\n\'):

comment_idx = ln.find(\'#\') # ignore comments

if comment_idx!=-1: ln = ln[:comment_idx]

if ln==\'\': continue # ignore empty lines

lhs = ln[:ln.index(\'->\')] # split the line

rhs = ln[ln.index(\'->\')+3:] # into lhs and rhs

rhs_rules = cfg.get(lhs, []) # assign rules to lhs

for symbols in rhs.split(\'|\'): # disjunct. of rhs

rhs_rules.append(symbols.split())

cfg[lhs] = rhs_rules

return cfg

print a cfg
Print a CFG

def print_cfg(cfg):

for (lhs, rhs_rules) in cfg.items():

for symbols in rhs_rules:

s = \'{} -> {}\'.format(lhs, \' \'.join(symbols))

print(s)

testing code
Testing code

toy_cfg = read_grammar(toy_cfg_str)

expr_cfg = read_grammar(expr_cfg_str)

print(\'toy cfg:\')

print_cfg(toy_cfg)

print(\'\nexprcfg:\')

print_cfg(expr_cfg)

output
Output

toy cfg:

N -> flight

DT -> a

VP -> V

S -> NP VP

V -> left

V -> arrived

NP -> DT N

output1
Output

exprcfg:

E -> ( E )

E -> E + E

E -> E - E

E -> E * E

E -> E / E

E -> 0

E -> 1

E -> 2

E -> 3

E -> 4

E -> 5

E -> 6

E -> 7

E -> 8

E -> 9

show the nonterminals
Show the nonterminals

def get_nonterminals(cfg):

return set(cfg.keys())

toy_cfg_nonterms = get_nonterminals(toy_cfg)

expr_cfg_nonterms = get_nonterminals(expr_cfg)

print(\'\ntoy_cfg_nonterms:\', toy_cfg_nonterms)

print(\'expr_cfg_nonterms:\', expr_cfg_nonterms)

# Output:

# toy_cfg_nonterms: set([\'N\', \'DT\', \'VP\', \'S\',

# \'V\', \'NP\'])

# expr_cfg_nonterms: set([\'E\'])

outline2
Outline
  • Formal language theory: strings and languages
  • Coding CFGs
  • Generating strings and trees
  • Short assignment #15
generate a random string from a cfg
Generate a random string from a CFG
  • Procedure:
    • For a given nonterminal, choose a rule at random
    • For each symbol in the RHS of that rule:
      • If it is a terminal symbol, add to string
      • If it is a nonterminal, recursively generate a string from that nonterminal
  • Generate a string from the grammar through the start symbol
example of recursive string generation
Example of recursive string generation
  • Generate a string from this CFG

A -> x Y z

Y -> c

  • Begin with start symbol A
  • Look up rule for nonterminal: A -> x Y z
  • Generate items on right-hand side of rule
    • Generate terminal: x
    • Recursively generate string for nonterminalY
      • Look up rule for nonterminal: Y -> c
      • Generate items on right-hand side of rule
        • Generate terminal: c
    • Generate terminal: z
  • Final string generated: x c z
slide39

import random

def generate_string(cfg, lhs, nonterms):

s = \'\'

nonterm_rules = cfg[lhs]

# randomly choose a rule

r_idx = random.randint(0,len(nonterm_rules)-1)

rule = nonterm_rules[r_idx]

for sym in rule: # step through symbols in rule

if sym in nonterms: # recursive case

s += generate_string(cfg, sym, nonterms) + \' \'

else: # terminal symbol, concatenate to string

s += sym + \' \'

return s[:-1] # remove last \' \'

testing code1
Testing code

print \'\nRandom strings generated by toy cfg:\'

for i in range(10):

print(generate_string(toy_cfg,\'S\',toy_cfg_nonterms))

print \'\nRandom strings generated by exprcfg:\'

for i in range(10):

print(generate_string(expr_cfg,\'E\',expr_cfg_nonterms))

output2
Output

Random strings from toy cfg:

a flight left

a flight left

a flight arrived

a flight left

a flight left

a flight arrived

a flight arrived

a flight arrived

a flight arrived

a flight left

output3
Output

Random strings from exprcfg:

( 6 ) - 5 * 5

5

9

5 / 5

1

2 - ( 0 - 0 + 3 ) + 1

4

3 + 9 / 0 / 5 / 0 + 7

3

2

generate some long expressions
Generate some long expressions

print(\'\nLong strings generated by exprcfg:\')

for i in range(1000):

s = generate_string(expr_cfg,\'E\',expr_cfg_nonterms)

if len(s) > 50:

print(s)

Output:

random strings from exprcfg:

3 * 9 * 6 / ( 0 + 6 - ( 9 / 8 - 2 ) + 3 / 1 - 4 ) * 4

4 + 5 * 5 + 1 / 5 - ( 1 ) + ( 4 ) / ( 9 ) + 6 + 2 - 1

generate random trees
Generate random trees
  • Let’s modify our tree representation to allow an arbitrary number of children:

(value, list-of-children)

  • Parent node: (nonterminal, list-of-child-nodes)
  • Leaf node: (terminal, [])
example1
Example
  • A is the parent of b and C and d. C is the parent of e.

(\'A\', [(\'b\', []),

(\'C\', [(\'e\', [])]),

(\'d\', [])])

  • Parent node: (nonterminal, list-of-child-nodes)
  • Leaf node: (terminal, [])
slide46

import random

def generate_tree(cfg, lhs, nonterms):

# randomly choose a rule

nonterm_rules = cfg[lhs]

r_idx = random.randint(0,len(nonterm_rules)-1)

rule = nonterm_rules[r_idx]

children = []

for sym in rule:

if sym in nonterms: # recursive case

ch_node = generate_tree(cfg, sym, nonterms)

children.append(ch_node)

else: # base case: leaf node

children.append((sym, []))

parent = (lhs, children)

return parent

print out a tree
Print out a tree

def pretty_print(node):

pass

# this is your homework problem

testing code2
Testing code

toy_tree = generate_tree(toy_cfg, \'S\', toy_cfg_nonterms)

pretty_print(toy_tree)

expr_tree = generate_tree(expr_cfg, \'E\',expr_cfg_nonterms)

pretty_print(expr_tree)

output4
Output

S

NP

DT

a

N

flight

VP

V

arrived

output5
Output

E

E

E

0

+

E

E

4

*

E

1

+

E

E

8

/

E

7

next time parsing
Next time: parsing
  • Given a string (sentence) generated by an arbitrary CFG, determine its parse tree
    • Or parse trees, if the string (sentence) is ambiguous
outline3
Outline
  • Formal language theory: strings and languages
  • Coding CFGs
  • Generating strings and trees
  • Short assignment #15
due 8 17
Due 8/17
  • Download short assignment from course web page
  • #1: i, l, m, q
  • #2: a, b, c, g
  • #3: a, b, e, f
ad