Ling 408 508 computational techniques for linguists
This presentation is the property of its rightful owner.
Sponsored Links
1 / 53

LING 408/508: Computational Techniques for Linguists PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on
  • Presentation posted in: General

LING 408/508: Computational Techniques for Linguists. Lecture 23 10/15/2012. Outline. Formal language theory: strings and languages Coding CFGs Generating strings and trees Short assignment #15. Note: the following is mathematical notation. It is not Python code. Alphabet.

Download Presentation

LING 408/508: Computational Techniques for Linguists

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ling 408 508 computational techniques for linguists

LING 408/508: Computational Techniques for Linguists

Lecture 23

10/15/2012


Outline

Outline

  • Formal language theory: strings and languages

  • Coding CFGs

  • Generating strings and trees

  • Short assignment #15


Ling 408 508 computational techniques for linguists

  • Note: the following is mathematical notation. It is not Python code.


Alphabet

Alphabet

  • Let Σ be a finite alphabet.

    • The elements of Σ may be multi-character symbols

  • Examples:

  • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }

  • LETTERS = { A, B, C, …, X, Y, Z }

  • ENGLISH_WORDS = { a, aa, …, zyzzogeton, zyzzyva }


Strings

Strings

  • A string is a sequence of zero or more symbols taken from an alphabet.

  • Example:

  • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }

  • s1 = 0123456789

  • s2 = 321654


Empty string

Empty string

  • The empty string, ε, consists of zero symbols

  • Pronounced as “epsilon”

  • Example:

  • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }

  • s1 = 0123456789

  • s2 = 321654

  • s3 = ε


  • Concatenation

    Concatenation

    • A string is produced by concatenation of symbols from an alphabet. The concatenation of symbols is written by placing two symbols immediately next to each other.

    • Strings can also be concatenated.

    • Example:

    • DIGITS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }

    • 1 є DIGITS and 2 є DIGITS. Concatenate 1 and 2 to produce the string 12

    • Concatenate 123 and 456 to produce the string 123456


    Repetition of a string

    Repetition of a string

    • Use an “exponent” with an integer value >= 0 to indicate the number of times a symbol is repeated in a string

    • Zero repetitions is equivalent to є

    • Examples:

    • a5 = aaaaa

    • a1 = a

    • a0 = ε


    Repetition uses parentheses for grouping

    Repetition: uses parentheses for grouping

    • ab2 = abb

    • (ab)2 = abab

    • abc(def)2 = abcdefdef

    • abc(def)1 = abcdef

    • abc(def)0 = abcε = abc


    Repetition of alphabet

    Repetition of alphabet

    • An “exponent” on an alphabet indicates a set of strings, such that each string consists of any symbols from that alphabet, and is of the length specified

    • Zero repetitions is equivalent to { є }

    • Example:

    • Σ = { 1, 2, 3 }

    • Σ0 = { ε }

    • Σ1 = { 1, 2, 3 }

    • Σ2 = { 11, 12, 13, 21, 22, 23, 31, 32, 33 }


    Closure of an alphabet

    Closure of an alphabet

    • An “exponent” on an alphabet indicates a set of strings, such that each string consists of any symbols from that alphabet, and is of the length specified

    • Zero repetitions is equivalent to { є }

    • Closure: * indicates zero or more symbols

      Σ* = Σ0υΣ1υΣ2υΣ3υ …

    • Example:

    • Σ = { 1, 2, 3 }

    • Σ0 = { ε }

    • Σ1 = { 1, 2, 3 }

    • Σ2 = { 11, 12, 13, 21, 22, 23, 31, 32, 33 }

    • Σ* = { ε } υ { 1, 2, 3 } υ { 11, 12, 13, 21, 22, 23, 31, 32, 33 } υ …

      = { є, 1, 2, 3, 11, 12, 13, 21, 22, 23, 31, 32, 33, … }


    I m not talking about regular expressions

    I’m not talking about regular expressions

    • Finite repetition such as a5 is not available within regular expression syntax

      • In regular expression, instead specify as aaaaa

    • Closure on symbols such a* (in a regular expression) as is not available within formal language theory

      • By “a*”, one probably means { ε, a, aa, aaa, … }, but there is another way to specify this in formal language theory


    Languages

    Languages

    • A language is a set of strings

      • May be finite or infinite

    • Examples:

    • L1 = { apples, bananas, pears }

    • L2 = { 0, 1, 2, 3, 4, … }

    • L3 = { ε, a, b, aa, ab, ba, bb, aaa, aab, aba, … }


    Is imprecise

    “…” is imprecise

    • Finite sets are easy to describe because you can list the elements.

    • With infinite languages, the use of “…” is not precise. We need better mathematical notation to describe these sets.

    • Example:

    • L4 = { 1, 2, 3, … }

    • L4 could be { 1, 2, 3, 4, 5, 6, … }, or it could be { 1, 2, 3, 11, 12, 13, … }, or many other sets


    Predicative definitions of languages

    Predicative definitions of languages

    • Describe a set through a pattern and remarks on the properties of that pattern

      • Read “|” as “such that”

    • Examples:

    • { x | x is a positive integer } = {1, 2, 3, 4, 5, … }

    • { an | n is an integer >= 0 } = { ε, a, aa, aaa, … }

    • { an | n is an integer and 1 <= n <= 5 } = { a, aa, aaa, aaaa, aaaaa }

    • { anbn | n >= 0 } = { ε, ab, aabb, aaabbb, … }

    • { anbm | n >= 0 and m > n } = { b, bb, …, abb, abbb, …, aabbb, aabbbb, aabbbbb, …, aaabbbb, aaabbbbb, …}


    Computational descriptions of languages

    Computational descriptions of languages

    • A predicate definition of a language tells us about the properties of the strings in a language, but it does not tell us a procedure by which the language can be generated

    • Alternatives:

      • Inductive definitions

      • Grammar-based definitions


    Inductive definitions

    Inductive definitions

    • State base case(s). At least one base case.

    • State inductive case(s). Zero or more.

    • Example:

    • L = { anbn | n >= 0 } = { є, ab, aabb, aaabbb, … }

    • Base case: ε є L

    • Inductive case: if x є L, then a x b є L


    Work through example

    Work through example

    • Base case: ε є L

    • Inductive case: if x є L, then a x b є L

    • L = { }

    • Begin with empty set.

    • From the base case, ε є L. So we place ε in L

    • ε is a string in L. From the inductive case, a ε b = a b є L

    • abis a string in L. From the inductive case, a ab b = aabbє L

    • etc.

    • This inductive definition generates

      L = { anbn | n >= 0 } = { є, ab, aabb, aaabbb, … }

    ε

    , ab

    , aabb


    Grammar based definitions of languages

    Grammar-based definitions of languages

    • A grammar is a formalism (with rules for computation) that generates a set of strings

    • Examples:

      • Regular expression

      • Finite-state automaton

      • Context-free grammar

      • Pushdown automaton

      • etc.


    Differences in generative capacity of grammars

    Differences in generative capacity of grammars

    • Some languages can(not) be generated by certain types of grammars

    • Examples

      L1 = { an | n >= 0 } = { є, a, aa, aaa, … }

      • Can be generated by regular expression

      • Can be generated by context-free grammar

        L2 = { anbn | n >= 0 } = { є, ab, aabb, aaabbb, … }

      • Cannot be generated by regular expression

      • Can be generated by context-free grammar

        L3 = { anbncn | n >= 0 } = { є, abc, aabbcc, aaabbbccc, … }

      • Cannot be generated by regular expression

      • Cannot be generated by context-free grammar


    Outline1

    Outline

    • Formal language theory: strings and languages

    • Coding CFGs

    • Generating strings and trees

    • Short assignment #15


    Write a cfg in a multi line string

    Write a CFG in a multi-line string

    toy_cfg_str = """

    S -> NP VP

    NP -> DT N

    VP -> V

    DT -> a

    N -> flight

    V -> left

    V -> arrived

    """


    Arithmetic expression grammar

    Arithmetic expression grammar

    expr_cfg_str = """

    E -> ( E )

    E -> E + E

    E -> E - E

    E -> E * E

    E -> E / E

    E -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

    """


    General format of a cfg

    General format of a CFG

    • Example:

      NONTERM1 -> a b

      NONTERM2 -> c

      NONTERM2 -> d

      NONTERM3 -> a | a NONTERM3

    • LHS (left-hand side) of rule is a nonterminal

    • RHS is the sequence of symbols that the rule rewrites as

      • May be multiple rules for a single nonterminal (NONTERM2)

      • Multiple rules may be written as a disjunction (NONTERM3)


    Represent cfg as a dictionary

    Represent CFG as a dictionary

    • Map each (LHS) nonterminal to a list of a list of strings

      • List of strings: symbols (nonterminals and terminals) on RHS of a particular rule

      • List of list of strings: there may be multiple rules for a single nonterminal


    Example

    Example

    • Grammar:

      NONTERM1 -> a b

      NONTERM2 -> c

      NONTERM2 -> d

      NONTERM3 -> a | a NONTERM3

    • Represent as dictionary:

      cfg = {}

      cfg['NONTERM1'] = [['a', 'b']]

      cfg['NONTERM2'] = [['c'], ['d']]

      cfg['NONTERM3'] = [['a'], ['a', 'NONTERM3']]


    Parse a cfg string

    Parse a CFG string

    • Want to read string for CFG, return dictionary representation of the grammar

    • Parsing code will also allow for:

      • Empty lines in the grammar

      • Comments

    • Apply split to RHS of a rule to get disjunction of rules


    Splitting multi line strings code will allow for empty lines in the grammar

    Splitting multi-line strings: code will allow for empty lines in the grammar

    >>> s = """

    S -> NP VP # comment: S is the start symbol

    V -> eat

    """

    >>> s

    '\nS -> NP VP\n\nV -> eat\n'

    >>> s.split() # don't want this as the result

    ['S', '->', 'NP', 'VP', 'V', '->', 'eat']

    >>> s.split('\n') # each rule is in separate string

    ['', 'S -> NP VP', '', 'V -> eat', '']


    Read list of rules on rhs

    Read list of rules on RHS

    • When there is a disjunction of rules in a line,

      need to get list of strings for each rule

      >>> s = 'a | a NONTERM3'

      >>> s.split('|')

      ['a ', ' a NONTERM3']

      >>> for x in s.split('|'): # split again to get

      print(x.split()) # rid of whitespace

      ['a']

      ['a', 'NONTERM3']


    Parse a cfg string1

    Parse a CFG string

    def read_grammar(cfg_str):

    cfg = {}

    for ln in cfg_str.split('\n'):

    comment_idx = ln.find('#') # ignore comments

    if comment_idx!=-1: ln = ln[:comment_idx]

    if ln=='': continue # ignore empty lines

    lhs = ln[:ln.index('->')] # split the line

    rhs = ln[ln.index('->')+3:] # into lhs and rhs

    rhs_rules = cfg.get(lhs, []) # assign rules to lhs

    for symbols in rhs.split('|'): # disjunct. of rhs

    rhs_rules.append(symbols.split())

    cfg[lhs] = rhs_rules

    return cfg


    Print a cfg

    Print a CFG

    def print_cfg(cfg):

    for (lhs, rhs_rules) in cfg.items():

    for symbols in rhs_rules:

    s = '{} -> {}'.format(lhs, ' '.join(symbols))

    print(s)


    Testing code

    Testing code

    toy_cfg = read_grammar(toy_cfg_str)

    expr_cfg = read_grammar(expr_cfg_str)

    print('toy cfg:')

    print_cfg(toy_cfg)

    print('\nexprcfg:')

    print_cfg(expr_cfg)


    Output

    Output

    toy cfg:

    N -> flight

    DT -> a

    VP -> V

    S -> NP VP

    V -> left

    V -> arrived

    NP -> DT N


    Output1

    Output

    exprcfg:

    E -> ( E )

    E -> E + E

    E -> E - E

    E -> E * E

    E -> E / E

    E -> 0

    E -> 1

    E -> 2

    E -> 3

    E -> 4

    E -> 5

    E -> 6

    E -> 7

    E -> 8

    E -> 9


    Show the nonterminals

    Show the nonterminals

    def get_nonterminals(cfg):

    return set(cfg.keys())

    toy_cfg_nonterms = get_nonterminals(toy_cfg)

    expr_cfg_nonterms = get_nonterminals(expr_cfg)

    print('\ntoy_cfg_nonterms:', toy_cfg_nonterms)

    print('expr_cfg_nonterms:', expr_cfg_nonterms)

    # Output:

    # toy_cfg_nonterms: set(['N', 'DT', 'VP', 'S',

    # 'V', 'NP'])

    # expr_cfg_nonterms: set(['E'])


    Outline2

    Outline

    • Formal language theory: strings and languages

    • Coding CFGs

    • Generating strings and trees

    • Short assignment #15


    Generate a random string from a cfg

    Generate a random string from a CFG

    • Procedure:

      • For a given nonterminal, choose a rule at random

      • For each symbol in the RHS of that rule:

        • If it is a terminal symbol, add to string

        • If it is a nonterminal, recursively generate a string from that nonterminal

    • Generate a string from the grammar through the start symbol


    Example of recursive string generation

    Example of recursive string generation

    • Generate a string from this CFG

      A -> x Y z

      Y -> c

    • Begin with start symbol A

    • Look up rule for nonterminal: A -> x Y z

    • Generate items on right-hand side of rule

      • Generate terminal: x

      • Recursively generate string for nonterminalY

        • Look up rule for nonterminal: Y -> c

        • Generate items on right-hand side of rule

          • Generate terminal: c

      • Generate terminal: z

    • Final string generated: x c z


    Ling 408 508 computational techniques for linguists

    import random

    def generate_string(cfg, lhs, nonterms):

    s = ''

    nonterm_rules = cfg[lhs]

    # randomly choose a rule

    r_idx = random.randint(0,len(nonterm_rules)-1)

    rule = nonterm_rules[r_idx]

    for sym in rule: # step through symbols in rule

    if sym in nonterms: # recursive case

    s += generate_string(cfg, sym, nonterms) + ' '

    else: # terminal symbol, concatenate to string

    s += sym + ' '

    return s[:-1] # remove last ' '


    Testing code1

    Testing code

    print '\nRandom strings generated by toy cfg:'

    for i in range(10):

    print(generate_string(toy_cfg,'S',toy_cfg_nonterms))

    print '\nRandom strings generated by exprcfg:'

    for i in range(10):

    print(generate_string(expr_cfg,'E',expr_cfg_nonterms))


    Output2

    Output

    Random strings from toy cfg:

    a flight left

    a flight left

    a flight arrived

    a flight left

    a flight left

    a flight arrived

    a flight arrived

    a flight arrived

    a flight arrived

    a flight left


    Output3

    Output

    Random strings from exprcfg:

    ( 6 ) - 5 * 5

    5

    9

    5 / 5

    1

    2 - ( 0 - 0 + 3 ) + 1

    4

    3 + 9 / 0 / 5 / 0 + 7

    3

    2


    Generate some long expressions

    Generate some long expressions

    print('\nLong strings generated by exprcfg:')

    for i in range(1000):

    s = generate_string(expr_cfg,'E',expr_cfg_nonterms)

    if len(s) > 50:

    print(s)

    Output:

    random strings from exprcfg:

    3 * 9 * 6 / ( 0 + 6 - ( 9 / 8 - 2 ) + 3 / 1 - 4 ) * 4

    4 + 5 * 5 + 1 / 5 - ( 1 ) + ( 4 ) / ( 9 ) + 6 + 2 - 1


    Generate random trees

    Generate random trees

    • Let’s modify our tree representation to allow an arbitrary number of children:

      (value, list-of-children)

    • Parent node: (nonterminal, list-of-child-nodes)

    • Leaf node: (terminal, [])


    Example1

    Example

    • A is the parent of b and C and d. C is the parent of e.

      ('A', [('b', []),

      ('C', [('e', [])]),

      ('d', [])])

    • Parent node: (nonterminal, list-of-child-nodes)

    • Leaf node: (terminal, [])


    Ling 408 508 computational techniques for linguists

    import random

    def generate_tree(cfg, lhs, nonterms):

    # randomly choose a rule

    nonterm_rules = cfg[lhs]

    r_idx = random.randint(0,len(nonterm_rules)-1)

    rule = nonterm_rules[r_idx]

    children = []

    for sym in rule:

    if sym in nonterms: # recursive case

    ch_node = generate_tree(cfg, sym, nonterms)

    children.append(ch_node)

    else: # base case: leaf node

    children.append((sym, []))

    parent = (lhs, children)

    return parent


    Print out a tree

    Print out a tree

    def pretty_print(node):

    pass

    # this is your homework problem


    Testing code2

    Testing code

    toy_tree = generate_tree(toy_cfg, 'S', toy_cfg_nonterms)

    pretty_print(toy_tree)

    expr_tree = generate_tree(expr_cfg, 'E',expr_cfg_nonterms)

    pretty_print(expr_tree)


    Output4

    Output

    S

    NP

    DT

    a

    N

    flight

    VP

    V

    arrived


    Output5

    Output

    E

    E

    E

    0

    +

    E

    E

    4

    *

    E

    1

    +

    E

    E

    8

    /

    E

    7


    Next time parsing

    Next time: parsing

    • Given a string (sentence) generated by an arbitrary CFG, determine its parse tree

      • Or parse trees, if the string (sentence) is ambiguous


    Outline3

    Outline

    • Formal language theory: strings and languages

    • Coding CFGs

    • Generating strings and trees

    • Short assignment #15


    Due 8 17

    Due 8/17

    • Download short assignment from course web page

    • #1: i, l, m, q

    • #2: a, b, c, g

    • #3: a, b, e, f


  • Login