1 / 31

CSA405: Advanced Topics in NLP

CSA405: Advanced Topics in NLP. Computational Morphology IV: xfst. What is xfst?. xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers.

jfair
Download Presentation

CSA405: Advanced Topics in NLP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSA405: Advanced Topicsin NLP Computational Morphology IV: xfst CSA4050: Computational Morphology IV

  2. What is xfst? • xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers. • xfst and other Xerox tools employ a notation very close to the notation we have been using so far. • For full documentation on the syntax and semantics of Xerox REs, see • http://www.fsmbook.com CSA4050: Computational Morphology IV

  3. Simple Commands • command line (via babe)> xfst • define: give a name to an RE • print: print information • read: read information • various stack operations • file interaction CSA4050: Computational Morphology IV

  4. define command • definename regexp ; xfst[0]: define foo [d o g] | [c a t]; xfst[0]: define R1 [a | b | c | d]; xfst[0]: define R2 [d | e | f | g]; xfst[0]: define R3 [f | g | h | i | j]; xfst[0]: define baz R1 & R2; CSA4050: Computational Morphology IV

  5. print words print wordsname - see the words in the language called name xfst[0]: print words R1 d c b a xfst[0]: CSA4050: Computational Morphology IV

  6. print net print net name - see detailed information about the network name. xfst[0]: define z R1 & R2; xfst[0]: define baz R1 & R2; xfst[0]: print net z Sigma: a b c d e f g Size: 7 Net: FC370 Flags: deterministic, pruned, minimized, epsilon_free, loop_free Arity: 1 s0: d -> fs1. fs1: (no arcs) xfst[0]: CSA4050: Computational Morphology IV

  7. Some Properties of Networks • epsilon free: there are no arcs labeled with the epsilon symbol • deterministic: no state has more than one outgoing arc • minimised: there is no other network with exactly the same paths that has fewer states. • These make sense for FSAs – not necessarily for FSTs. CSA4050: Computational Morphology IV

  8. Equivalent? a:0 a A no. states? no. paths? relation encoded? a b a:0 a B b CSA4050: Computational Morphology IV

  9. Remarks • A and B encode the same relation{<“aa”,”a”>,<“ab”,”ab”>} • They are both deterministic and minimal • They have different numbers of states. • Arcs labeled with a pair containing an epsilon on one side can sometimes be redistributed or eliminated, reducing the number of states. • This situation does not occur with FSAs CSA4050: Computational Morphology IV

  10. FST Determinism:Sequential vs. Unambiguous • Unambiguous: for any input there is at most one output. • Transducer A is unambiguous in either direction. • Sequential: No state has more than one arc with the same symbol on the input side. • Transducer A is not sequential in one direction. • A transducer is sequentiable if the relation it encodes is unambiguous and all the local ambiguities resolve themselves in a fixed number of steps CSA4050: Computational Morphology IV

  11. Basic Stack Operations • read regex: push network onto stack: • print stack: list items on stack • print net: detailed info on top stack item • pop stack: remove top item from stack • define name: set name to value of top stack item CSA4050: Computational Morphology IV

  12. Stack Operations:intersect net; union net, etc. • Load stack with N suitable arguments. • Ensure that arguments are pushed onto stack in correct (reverse) order. • intersect net command is issued. • These are popped from the stack, the operation is performed, and the result written back onto the stack. CSA4050: Computational Morphology IV

  13. Stack Example 1 xfst[0]: clear stack; xfst[0]: read regex [d |c |e | b | w] xfst[1]: read regex [b | s | h | w] xfst[2]: read regex [s | d | c | f | w] xfst[3]: print stack xfst[3]: intersect net xfst[1]: print stack xfst[1]: print net xfst[1]: print words x1 CSA4050: Computational Morphology IV

  14. Stack Example 2 xfst[0]: clear stack; xfst[0]: read regex [e d | i n g | s |[]] xfst[1]: read regex [t a l k | k i c k] xfst[2]: print stack xfst[2]: print net xfst[2]: print words xfst[2]: concatenate net xfst[1]: print words x2/a CSA4050: Computational Morphology IV

  15. Creating Relations • A simple example of a transducer can be shown using the crossproduct operator: xfst[0] clear stack xfst[0] define Y [d o g | c a t]; xfst[0] define Z [c h i e n | c h a t]; xfst[0] read regex Y .x. Z • We can now use apply up and apply down to test the transducer’s behaviour. x3ab CSA4050: Computational Morphology IV

  16. apply up; apply down • applyup(arg,R) = {x | <x,arg> in R} • applydown(arg,R) = {x | <arg,x> in R} xfst[0] read regex [d o g | c a t].x.[c h i e n | c h a t]; xfst[1] apply up chien dog cat xfst[1] apply down cat chien chat CSA4050: Computational Morphology IV

  17. Exercise for .x. • What RE would perform the correct translations? • Define it in xfst. • Define an RE in xfst which relates the surface forms "sing", "sang" and "sung" to the lexical form "sing". x3c CSA4050: Computational Morphology IV

  18. Replace Rules • Xerox RE notation, includes replace rules. • Replace rules do not increase the descriptive power of REs; however, they do provide a powerful abbreviated rule-like notation. • There are two main types of replace rules:unconditional and conditional CSA4050: Computational Morphology IV

  19. Unconditional Replace Rules • The most straightforward kind of unconditional replace rule is: a -> b • This denotes an FS relation in which every symbol a in the upper language corresponds to a symbol b in the lower language. • Checkpoint: how does this differ from a:b? What is the FST that computes this relation CSA4050: Computational Morphology IV

  20. Unconditional Replace e.g. xfst[0]: read regex c -> r xfst[0]: apply down cat xfst[0]: apply down dog • Where there is no match, the string is identity mapped. • The general pattern for simple Replace rules is A -> B, where A and B are REs denoting arbitrarily complex languages (not relations) x4ab CSA4050: Computational Morphology IV

  21. Definition of A → B • A → B = [no_A [A .x. B]]* no_Awhere no_A ~$[A – 0] • N.B. if upper does not contain empty str~$[upper – 0] = ~$[upper]otherwise ~$[upper] is null whereas~$[upper – 0] contains at least the empty str. CSA4050: Computational Morphology IV

  22. Conditional Replace Rules • More complex replace rules can also specify left and right context, as in A -> B || L _ R • each lexical substring A is related to a substring B when the left context ends with L and the right context starts with R. • A, B, L and R are REs denoting languages not relations. x4c CSA4050: Computational Morphology IV

  23. Special Cases • The symbol .#. refers to the absolute beginning or end of string in left and right contexts. For example e -> i || .#. p _ r • Checkpoint: write a replace rule that brings lexical "go" into correspondence with surface "went". CSA4050: Computational Morphology IV

  24. The kaNpat exercise • Suppose we have a language in which kaNpat is a lexical string consisting of the morpheme kaN concatenated with the suffix pat. • N just before nasal p gets realised as m. • p occurring just after an m is realised as m. CSA4050: Computational Morphology IV

  25. kaNpat rules • We can write the following two rules to account for this behaviour: Rule 1. [N -> m || _ p] • Notice that the lh context is empty, meaning that any context will do. Rule 2. [p -> m || m _] • Note that the linguist must keep track of the order in which rules are applied. CSA4050: Computational Morphology IV

  26. Derivation of kammat Lexical: kaNpat apply [N -> m || _ p] Intermediate: kampat apply [p -> m || m _] surface: kammat • The first rule feeds the second • Checkpoint: what happens if rules are applied in reverse order? CSA4050: Computational Morphology IV

  27. Composing the Relations • Each rule describes a certain relation: call these R1 and R2 • If R1 maps X to Y and R2 maps Y to Z, then there must exist a single relation which maps directly from X to Z without passing through Y. • Mathematically, that relation is the composition of R1 and R2. CSA4050: Computational Morphology IV

  28. Composing the Rules • Each rule is compiled into an FST. • If Rule1 compiles to F1, and Rule2 to F2, then there must be an F3 which computes the composition of F1 and F2. • Checkpoint: write the RE corresponding to the composition of the original 2 rules. CSA4050: Computational Morphology IV

  29. Testing the kaNpat grammar • First get rules onto stack xfst[0] read regex [N->m || _p] .o. [p->m||m_]; • Try the following and explain • apply down (kaNpat; kampat; kammat) • apply up kammat • Try the above but with rules in reverse order X5ab CSA4050: Computational Morphology IV

  30. Practical use of xfst • Regular expression files (text) xfst[0] read regexp < regexpfile • Binary files (compiled networks) xfst[1]: save stack binfile xfst[0]: load stack binfile • Scripts (xfst commands) xfst[0] source scriptfile % xfst -f myscript % xfst -l myscript CSA4050: Computational Morphology IV

  31. A’ is the sequentiable a:0 a A no. states? no. paths? relation encoded? a b a a:0 A’ 0:b b:a CSA4050: Computational Morphology IV

More Related