1 / 32

Advanced Topics in NLP

Advanced Topics in NLP. Finite State Machinery Xerox Tools. Finite State Methods. Many Domains of Application Tokenization Sentence breaking Spelling correction Morphology (analysis/generation) Phonological disambiguation (Speech Recognition) Morphological disambiguation (“Tagging”)

goldied
Download Presentation

Advanced Topics in NLP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Topics in NLP Finite State Machinery Xerox Tools Advanced Topics in NLP

  2. Finite State Methods • Many Domains of Application • Tokenization • Sentence breaking • Spelling correction • Morphology (analysis/generation) • Phonological disambiguation (Speech Recognition) • Morphological disambiguation (“Tagging”) • Pattern matching (“Named Entity Recognition”) • Shallow Parsing Advanced Topics in NLP

  3. The Xerox Approach • Lauri Karttunen, Martin Kay, Ronald Kaplan, Kimmo Koskienniemi. • Meta-languages for describing regular languages and regular relations. • Compiler for mapping meta-language "programs" into efficient FS machinery • Several tools and applications Advanced Topics in NLP

  4. xerox tools • xfstXerox Finite-State Tool • lexcFinite-State Lexicon Compiler • twolcTwo-Level Rule Compiler Advanced Topics in NLP

  5. xerox tools • All of these applications are built around a central library, now written in C, called c-fsm. • The library defines the data structures, provides the input/output routines, and implements the fundamental operations on finite-state networks. • All based on long-term Xerox research, originated by Ronald M. Kaplan and Martin Kay at PARC in the early 1980s. Advanced Topics in NLP

  6. Textbook CLSI Publications Studies in Computational Linguistics series See also www.fsmbook.com website Advanced Topics in NLP

  7. xfst • xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers. • xfst and other Xerox tools employ a special "xfst notation" (more powerful than that used in Unix, Perl, C# etc.) Advanced Topics in NLP

  8. Simple Regular Expressions • Atomic Expressions • Complex Expressions Advanced Topics in NLP

  9. Atomic Expressions • The simplest kind of RE is a symbol. Typically, a symbol is the sort of item that can appear on the arc of a network. • For example, the symbol a is an RE that designates the language containing the string "a" and nothing else • Multicharacter symbols such as Plur are also symbols, but they happen to have multicharacter print names. Advanced Topics in NLP

  10. Special Atomic Expressions • The epsilon (e) symbol 0 denotes the empty string language {""}. • The ANY symbol ? denotes the language of all single symbol strings. • The empty string is not included in ?. Advanced Topics in NLP

  11. Complex REs: Union • If A and B are arbitrary REs, [A | B] is the union of A and B which denotes the union of the languages denoted by A and B respectively. • If A is an arbitrarily complex RE, [A] is equivalent to A. • Checkpoint: Write down the strings in the language denoted by [a | b | ab]. Advanced Topics in NLP

  12. Complex REs: Intersection • If A and B are arbitrary REs, [A & B] is the intersection of A and B which denotes the intersection of the languages denoted by A and B respectively. • Checkpoint: Write down the strings in the language denoted by [a | b | c | d | e] & [d | e | f | g] Advanced Topics in NLP

  13. Complex REs: Concatenation • If A and B are arbitrary REs [A B] is the concatenation of A and B • Checkpoint: note the difference between • [d o g] • dog • [d og] Advanced Topics in NLP

  14. Regular Expression E1: = [a|b] E2: = [c|d] E1 E2 = [a|b] [c|d] Language L1 = {"a", "b"} L2 = {"c", "d"} L1 L2 = {"ac", "ad", "bc", "bd"} Concatenation over Reg. Expression and Language Advanced Topics in NLP

  15. Concatenation overFS Automata a c + b d a c = b d Advanced Topics in NLP

  16. Complex REs: Closures • A+ denotes the concatenation of A with itself zero or more times. • A* (Kleene Star) denotes [A+ | 0]. Advanced Topics in NLP

  17. Other Operations • Minus: [A - B] denotes the set difference of the languages denoted by A and B. ([A-B] = [A & ˜B]) • Checkpoint: What is the language denoted by [dog | cat | elephant] - [elephant | horse | cow] Advanced Topics in NLP

  18. Some Other Conventions A* Closure (Kleene Star) (A) Optional Element ? Any symbol \b Any symbol other than b ~A Complement (= [?* - A ]) 0 Empty string language $A [ ?* A ?* ] Advanced Topics in NLP

  19. Simple Commands • In addition to the language there are also commands: • define: give a name to an RE • print: print information • read: read information • various stack operations • file interaction • various command line options Advanced Topics in NLP

  20. define command • define name regexp xfst[0]: define foo [d o g] | [c a t]; xfst[0]: define R1 [a | b | c | d]; xfst[0]: define R2 [d | e | f | g]; xfst[0]: define R3 [f | g | h | i | j]; x0 Advanced Topics in NLP

  21. print command • print words name - see the words in the language called name • print net name - see detailed information about the network name. xfst[0]: print words foo; xfst[0]: print net baz; xfst[0]: define baz R1 & R2; Advanced Topics in NLP

  22. Exercise • Compute the words in • R1 minus R2. • R2 intersect R1 • Define a network that contains the words "eeny", "meeny", "miny", "mo". • Determine how many states there are in each result. Advanced Topics in NLP

  23. Basic Stack Operations • read regex: push network onto stack: • print stack: list items on stack • print net: detailed info on top stack item • pop stack: remove top item from stack • define name: set name to value of top stack item Advanced Topics in NLP

  24. Stack Operations • Normally the stack is loaded with suitable arguments, • Command is issued requiring N arguments. • These are popped from the stack, the operation is performed, and the result written back onto the stack. • For correct results, items should be pushed onto the stack in reverse order. Advanced Topics in NLP

  25. Stack Demo 1 xfst[0]: clear stack; xfst[0]: read regex [d |c |e | b | w] xfst[1]: read regex [b | s | h | w] xfst[2]: read regex [s | d | c | f | w] xfst[3]: print stack xfst[3]: intersect net xfst[1]: print stack xfst[1]: print net xfst[1]: print words Advanced Topics in NLP

  26. Stack Exercise 2 xfst[0]: clear stack; xfst[0]: read regex [e d | i n g | s |[]] xfst[1]: read regex [t a l k | k i c k] xfst[2]: print stack xfst[2]: print net xfst[2]: print words xfst[2]: concatenate net xfst[1]: print words Advanced Topics in NLP

  27. lexc Source File Compiled Network lexc • lexc is a high level programming language and compiler that is well suited for defining NL lexicons. • The output is a compiled form of FS network in a format identical to other Xerox tools (xfst, twolc). Advanced Topics in NLP

  28. lexc source file !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! ex0-lex.txt LEXICON Root dine #; dines #; dined #; line #; lines #; lined #; END !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Advanced Topics in NLP

  29. lexc ! ex1-lex.txt LEXICON Root Noun; Verb; LEXICON Noun line NounSuffix; LEXICON Verb dine VerbSuffix; line VerbSuffix; LEXICON NounSuffix s #; #; LEXICON VerbSuffix s #; d #; #; Advanced Topics in NLP

  30. Running lexc lexc> compile-source ex1-lex.txt
 Opening 'ex1-lex.txt'... Root...2, Noun...1, Verb...2, NounSuffix...2, VerbSuffix...3 Building lexicon...Minimizing...Done! SOURCE: 6 states, 7 arcs, 6 words lexc> Advanced Topics in NLP

  31. lexc • The resulting lexicon contains the same six words • The form lines actually gets constructed twice, once as a verb, once as a noun. • After minimization, only one of them remains. • The compiler first processes each sublexicon separately, keeping track of continuation pointers, and then joins the structures to a single network which is determinized and minimized. Advanced Topics in NLP

  32. Resulting FSA s d i n e l d Advanced Topics in NLP

More Related