1 / 12

Regular expressions Day 2

Regular expressions Day 2. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. Regular expressions. SLP 2.1. Questions. What is a string? A sequence of symbols. In text, a sequence of alphanumeric characters.

miles
Download Presentation

Regular expressions Day 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regular expressionsDay 2 LING 681.02 Computational Linguistics Harry Howard Tulane University

  2. Course organization LING 681.02, Prof. Howard, Tulane University

  3. Regular expressions SLP 2.1

  4. Questions • What is a string? • A sequence of symbols. • In text, a sequence of alphanumeric characters. • What is a regular expression (RE or regex)? • A language for specifying text search strings, requiring a pattern to search for and and a corpus to search through. • What is an algebra? • A set of elements and a group of operations defined for them • e.g. the set of real numbers and the operations +, –, *, and /. • What is a false positive? • a string that is incorrectly matched > decreases accuracy • What is a false negative? • a string that is incorrectly excluded > decreases coverage • What is precedence? LING 681.02, Prof. Howard, Tulane University

  5. * + - ^ ? . | () {n} \b \w $ \1 0 or more occurrences of the previous character or RE 1 or more occurrences of the previous character or RE The two ends of a range Not (negation) or beginning of line; "caret" the previous character is optional any character either … or "pipe" grouping or put in a register n occurrences of previous character or RE word boundary white space end of line replace with RE in register 1 Notation in Perl LING 681.02, Prof. Howard, Tulane University

  6. Exercise 2.1: REs • The set of all alphabetic strings. • [a-zA-Z][a-zA-Z]* • [a-zA-Z]+ • The set of all lower case alphabetic strings ending in a b. • [a-z]*b • The set of all strings with two consecutive repeated words (e.g., “Humbert Humbert” and “the the” but not “the bug” or “the big bug”). • ([a-zA-Z]+)\s+\1 LING 681.02, Prof. Howard, Tulane University

  7. Exercise 2.1: REs, cont. • The set of all strings from the alphabet a, b such that each a is immediately preceded by and immediately followed by a b. • (b+(ab+)+)? • All strings that start at the beginning of the line with an integer and that end at the end of the line with a word. • ˆ\d+\b.*\b[a-zA-Z]+$ LING 681.02, Prof. Howard, Tulane University

  8. Exercise 2.1: REs, cont. • All strings that have both the word grotto and the word raven in them (but not, e.g., words like grottos that merely contain the word grotto). • \bgrotto\b.*\braven\b|\braven\b.*\bgrotto\b • Write a pattern that places the first word of an English sentence in a register. Deal with punctuation. • ˆ[ˆa-zA-Z]*([a-zA-Z]+) LING 681.02, Prof. Howard, Tulane University

  9. Exercise 2.2 • patterns • (r"\b(i’m|i am)\b", "YOU ARE"), • (r"\b(i|me)\b", "YOU"), • (r"\b(my)\b", "YOUR"), • (r"\b(well,?) ", ""), • (r".* YOU ARE (depressed|sad) .*", r"I AM SORRY TO HEAR YOU ARE \1"), • (r".* YOU ARE (depressed|sad) .*", r"WHY DO YOU THINK YOU ARE \1"), • (r".* all .*", "IN WHAT WAY"), • (r".* always .*", "CAN YOU THINK OF A SPECIFIC EXAMPLE"), • (r"[%s]" % re.escape(string.punctuation), ""), LING 681.02, Prof. Howard, Tulane University

  10. NLPP

  11. REs in Python • The re module provides Perl-type regular expression patterns, see http://www.amk.ca/python/howto/regex/ • NLPP goes into REs in §3.4, p. 97ff LING 681.02, Prof. Howard, Tulane University

  12. Next time SLP Automata: §2.2-end & Ex. 2.3-end NLPP: finish §1, do as many of the exercises as you can

More Related