1 / 23

Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching

Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( xzy@buaa.edu.cn ). Draft – Xu Zhenya( 2002/10/01 ) Rev1.0 – Xu Zhenya( 2002/10/09 ). Agenda. 1. An Introduction to Regular Expression in UNIX BRE & ERE, GNU-RE 2. grep, egrep, fgrep

elana
Download Presentation

Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( xzy@buaa.edu.cn ) Draft – Xu Zhenya( 2002/10/01 ) Rev1.0 – Xu Zhenya( 2002/10/09 )

  2. Agenda • 1. An Introduction to Regular Expression in UNIX • BRE & ERE, GNU-RE • 2. grep, egrep, fgrep • 3. sed • Chapter 4( 4.1 & 4.2 )

  3. An Introduction to Regular Expression(1) • UNIX commands combined with REs allow us perform three tasks: • Pattern matching • Search a particular pattern. RE specifies the pattern. • UNIX command searches: ed ex sed vi grep awk • Modify • Search for a particular pattern and change it. RE specifies the pattern and sometimes how to change. • UNIX command searches and modifies: ed ex sed vi • Programming • awk provides a programming language which can use REs. • Perl, python, tcl, etc • lex ( flex in GNU tools ) • POSIX defines a standard regular expression library include the libc: • regcomp, regexec… • Three different regular expression definitions: • the shells : for filename/pathname expansion • Simple/Basic ( BRE ) • grep, sed, vi • Extended ( ERE ) • egrep, awk, perl, etc

  4. Meta-characters • Meta-characters can be divided into three categories: • matching characters: the primary building blocks of REs • grouping and repeat characters: • Tagging and back-referencing: • Some of the RE meta characters are also shell meta characters. For example: • $ egrep r* /etc/passwd # “filename expansion • => so REs meta characters should be quoted.

  5. Matching Characters ( 1 )

  6. Matching Characters ( 2 ) • Example REs • helloMatch the string hello • ...\....Match any three letters, followed by a ., followed by three more letters • ^...\....Match the same as the previous one but it must appear at the start of the line. • ^hello$Match any line which contains hello ONLY • [a-z][^a-z]Match any two characters where the first one is between a-z and the second isn't. • \[\]\\ The characters '[', ']', '\' in order and contiguous.

  7. Grouping and Repetition Characters( 1 )

  8. Grouping and Repetition Characters( 2 ) • Examples • OO+Match two or more Os • /bin/(tcsh|bash)$Match /bin/tcsh or /bin/bash which occur at the end of the line. • ^[^:]*:[^:]*:[0-9] • the start of a line (^), followed by • 0 or more characters which aren't :s ([^:]*), followed by • a : (:), followed by • 0 or more characters which aren't :s ([^:]*), followed by • a : (:), followed by • a single number ([0-9]), followed by • a : (:) • (\+|-)?[0-9]+ • [+-]?[0-9]+ • an optionally signed integer (a plus or minus or nothing followed by an integer).

  9. Tagging and back-referencing (1) • \( \) is used to tag/remember the RE you wish to back reference. • Contents are placed into a numeric register 1, 2, ... • access the contents of a register using \N where N is the number of the register • Examples • \(hello\) \1Matches hello hello • \([0-9]*\),\([0-9]*\),\([0-9]*\) \3,\2,\1Matches patterns like12,55,34 34,55,121023,5321,934 934,5321,1023

  10. Tagging and back-referencing (2) $ sed -e 's/\([a-zA-Z]\+\) \([a-zA-Z]\+\)/\2, \1/' | | | | | | | extract memory 1 | | extract memory 2 | Put “UNIX" to memory 2 put “Programming" to memory 1 Programming UNIX UNIX, Programming

  11. BRE

  12. ERE

  13. Conclusion • 1. There are a few metacharacters common, both in representation and meaning, to all three definitions. • The back slash (\) • The square brackets ([]) • There are also two metacharacters within the brackets which are common among all three forms of RE. • If the caret (^) is the first character within the brackets, the complement of the set of characters given is meant. • If the hyphen (-) occurs within the brackets, it indicates a range of characters. • 2. In addition to BRE, extended REs add • the parentheses (()) for grouping: • the vertical bar (|) (also called pipe) for alternation. • the plus sign (+) meaning repeat the preceding item at least once. • the question mark (?) meaning match the preceding item either zero or one times.

  14. Conclusion • In basic regular expressions the metacharacters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \). • Large repetition counts in the {m,n} construct may cause grep to use lots of memory. In addition, certain other obscure regular expressions require exponential time and space, and may cause grep to run out of memory. • Backreferences are very slow, and may require exponential time.

  15. Overview of Text Manipulation Utils • ed was a very early UNIX line editor. It included a number of commands to manipulate the file being edited. • Other UNIX commands like sed, ex, vi were built on ed and use the same commands. • Ex/vi command syntax:[ from_address [ , to_address ] command [ parameters ] • Read the textbook – Appendix A • Supplementary Reading: • An Introduction to Display Editing with Vi, William Joy,  Mark Horton • VI (and Clones) Editor Reference Manual,  Miles O'Neal

  16. 2. Grep, egrep, fgrep • grep: BRE • egrep: ERE • fgrep: f = fixed string • Option:-r recursively • Reading the textbook: p72

  17. 3. sed • sed(1) is a (s)tream (ed)itor, which manipulates the data according given rules. • The sed command line syntax is: • $ sed [OPTIONS] -e 'INSTRUCTION' [-e 'INSTRUCTION' ..] FILE • $ sed [OPTIONS] -f SCRIPT.sed FILE • $ cat FILE | sed -f SCRIPT.sed "COMMANDS" • Some most used options: • -f SCRIPT.sed : Read commands from file SCEIPT.sed • -e "SED-EXPRESSION" : Expression follows immediately. You can give this option multiple times. • -n : Do not print lines unless p command used.

  18. sed (2) • The INSTRUCTION choices can be in format [OPTION]/RE/COMMAND | | | | | what to do | | p = print | | d = delete | | | | regular expression to search | Optional, can be left out. g = global option. Do the COMMAND for all lines

  19. sed (3) • commands and options • 1. The delete line command • $ sed -e '/this/d' text1.txt • 2. The print lines options • $ sed -n -e '/this/p' text1.txt

  20. sed (4) • 3. The substitution of text in the line • The command that sed uses most is the (s)ubstitute and the syntax is bit different: [address]s/RE/replacement/[flag] | substitute command here • Examples: • s/this-word//g • s/UPE/UNIX Programming Environment/g • The (g)lobal flag causes regular expression search to continue to the end of line, so all words on the line will be replaced.

  21. sed (5) • Address in substitute command • The [address] says, where the command does its work. It can be • numeric address • special marker; like $ which denotes the end of file • regular expression to delimit the the commands to certain lines. • # 1. We refer to explicit lines by a number here: • 1s/BJ/Beijing/g Do substitution only at line 1. • 10,20s/BJ/Beijing/g RANGE: from 10 to 20. • #2. We can mix the number with the other address markers: • 50,$s/BJ/Beijing/g from line 50 to the end of file. • 1,/^$/s/BJ/Beijing/g from line 1 to next empty line. • #3. Or use only regular expression to delimit the line area • $ sed -e '/BEGIN/,/END/s/variable1/variable2/g' some.code • BEGIN • line1; • line2 • variable1 = variable1 + variableX; • END

  22. sed (6) • Examples: • $ sed –e ‘s/sweeping \(.*\) of \(.*\) steel/sweeping \2 \1 of/g’ • sweeping blade of flashing steel • sweeping flashing blade of steel • $ sed –e ‘s/sweeping \(.*\) of \(.*\) steel/evil &/g’ • sweeping blade of flashing steel • evil sweeping blade of flashing steel • & : specifys the entire expression

  23. Supplementary Readings • Regular Expression in Unix • Including two parts. The first part is an simple and incomplete introduction to regular expressions in UNIX. The other is the RE specification intercepted from SUSv2, including the formal grammar for BER and ERE. • External Filters, Programms and Commands in Unix , Mendel Cooper

More Related