1 / 60

Advanced Text Processing

Advanced Text Processing. Lecture Overview. Character manipulation commands cut, paste, tr Line manipulation commands sort, uniq, diff Regular expressions and grep Text replacement using sed. Cutting Lines – cut. The cut command extracts sections from each line of the input file

bikita
Download Presentation

Advanced Text Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Text Processing

  2. Lecture Overview • Character manipulation commands • cut, paste, tr • Line manipulation commands • sort, uniq, diff • Regular expressions and grep • Text replacement using sed

  3. Cutting Lines –cut • The cut command extracts sections from each line of the input file • Command line options for cut: • -c– output only these characters • -f– output only these fields • -d– use this character as the field delimiter cut options [files]

  4. Cutting Lines –cut • With cut, at least one of the selection options (-c or -f) must be specified • The value given with -c or -f can be: • A number – specifies a single character position • A range – specifies a sequence of positions • A comma separated list – specifies multiple positions or ranges

  5. cut– Examples • Given a file called 'my_phones.txt': ADAMS, Andrew 7583 BARRETT, Bruce 6466 BAYES, Ryan 6585 BECK, Bill 6346 BENNETT, Peter 7456 GRAHAM, Linda 6141 HARMER, Peter 7484 MAKORTOFF, Peter 7328 MEASDAY, David 6494 NAKAMURA, Satoshi 6453 REEVE, Shirley 7391 ROSNER, David 6830

  6. cut– Examples head -3 my_phones.txt | cut -c3-16 AMS, Andrew 75 RRETT, Bruce 6 YES, Ryan 6585 head -3 my_phones.txt | cut -d" " -f2 Andrew Bruce Ryan head -3 my_phones.txt | cut -c1-3,10,12,15-18 ADAde7583 BARBu 646 BAYa 85

  7. Merging Files –paste • The paste command merges multiple files by concatenating corresponding lines • Command line options for paste: • -d– provide a list of separator characters • -s– paste one file at a time instead of in parallel (each file becomes a single line) paste [options] [files]

  8. paste– Examples • Assume that we are given 3 input files: first.txt last.txt num.txt Andrew Bruce Ryan Bill Peter Linda Peter Peter David Satoshi ADAMS BARRETT BAYES BECK BENNETT GRAHAM HARMER MAKORTOFF MEASDAY NAKAMURA 7583 6466 6585 6346 7456 6141 7484 7328 6494 6453

  9. paste– Examples paste first.txt last.txt num.txt | head -3 Andrew ADAMS 7583 Bruce BARRETT 6466 Ryan BAYES 6585 paste -d" :" first.txt last.txt num.txt | head -3 Andrew ADAMS:7583 Bruce BARRETT:6466 Ryan BAYES:6585 paste -s last.txt first.txt num.txt | cut -f1-5,10 ADAMS BARRETT BAYES BECK BENNETT NAKAMURA Andrew Bruce Ryan Bill Peter Satoshi 7583 6466 6585 6346 7456 6453

  10. Translating Characters –tr • The tr command is used to translate between one character set and another • Input is read from standard input and written to standard output (no files) • With no options, tr accepts two character sets with equal lengths, and replaces each character with the corresponding one tr [options] set1 [set2]

  11. Deleting or Squeezing Characters –tr • Sets contain literal characters, or character ranges, such as: 'a-z' or 'DEFa-z' • With command line options, tr can also be used to delete or squeeze characters • Command line options for tr: • -d– delete characters in set1 • -s– replace sequence of characters with one

  12. Defining Sets for tr • tr has some interpreted sequences to simplify the definition of sets: • [:alpha:]– all letters • [:digit:]– all digits • [:alnum:]– all letters and digits • [:space:]– all whitespace • [:punct:]– all punctuation characters • [CHAR*REPEAT]–REPEAT copies of CHAR • [CHAR*]– copies of CHAR until set1 length

  13. tr– Examples • Change lower case to capital, and replace the digits 6, 7, 8 with the letters x, y, z head -3 padded_phones.txt ADAMS Andrew 7583 BARRETT Bruce 6466 BAYES Ryan 6585 head -3 padded_phones.txt | tr 'a-z678' 'A-Zxyz' ADAMS ANDREW y5z3 BARRETT BRUCE x4xx BAYES RYAN x5z5

  14. tr– Examples • Squeeze sequences of spaces into one: • Delete spaces, and digits 7 and 8: head -3 padded_phones.txt | tr -s " " ADAMS Andrew 7583 BARRETT Bruce 6466 BAYES Ryan 6585 head -3 padded_phones.txt | tr -d " 78" ADAMSAndrew53 BARRETTBruce6466 BAYESRyan655

  15. Reading from Standard Input • Many UNIX commands accept one or more input files listed in the command line(tr is one of the few that don't) • If no input file is given, these commands will read from the standard input • Alternately, if the file list contains a '-', the standard input will be inserted in its place

  16. Standard Input – Example cat last.txt | tr "A-Z" "a-z" | \ paste –d"_" first.txt - number.txt | head -10 Andrew_adams_7583 Imelda_aguilar_6518 Daniel_albers_7540 Pierre_amaudruz_7567 Friedhelm_ames_7581 Willy_andersson_6238 Andrei_andreyev_6491 Jonathan_aoki_6820 Donald_arseneau_6295 Danny_ashery_6188

  17. Lecture Overview • Character manipulation commands • cut, paste, tr • Line manipulation commands • sort, uniq, diff • Regular expressions and grep • Text replacement using sed

  18. Sorting Files –sort • The sort command reorders the lines ina file (or files), and sends the result to the standard output • Command line options for sort: • -f– ignore case (fold lowercase to uppercase) • -r– sort in reverse order • -n– sort in numeric order sort [options] [files]

  19. Sorting Files –sort • With no options given, the input is sorted based on the ASCII code order • The sort command has many more options for selecting which fields to sort by, and for changing the way input is treated • As always, you should read the man pages for the full details

  20. sort– Example: Using Ignore-Case Andrew Bruce Ryan bill peter sort Bruce Ryan peter Andrew bill Andrew bill Bruce peter Ryan sort -f

  21. sort– Example: Sorting Numbers 1256875 18 38 575 66 sort 38 18 1256875 66 575 18 38 66 575 1256875 sort -n

  22. Removing Duplicate Lines –uniq • The uniq command removes adjacent duplicate lines from its input file • If input is sorted, removes all duplicate lines • Command line options for uniq: • -i– ignore case • -c– prefix lines by the number of occurrences • -d– only print duplicate lines • -u– only print unique lines

  23. uniq– Example Andrew Bill David Peter Ryan uniq Andrew Bill David David Peter Peter Peter Ryan 1 Andrew 1 Bill 2 David 3 Peter 1 Ryan uniq -c

  24. uniq– Example uniq -d Andrew Bill David David Peter Peter Peter Ryan David Peter Andrew Bill Ryan uniq -u

  25. Example – File Processing Using Pipes • Task – go over the book "War and Peace" and count the appearances of each word • Step 1: remove all punctuation marks • Step 2: put each word in a separate line • Step 3: sort words cat war_and_peace.txt | tr -d '[:punct:]' cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" | sort

  26. Example – File Processing Using Pipes • Step 4: count appearances of each word • Step 5: sort result by number of appearances • Step 6: write output to file cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" | sort | uniq -c cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" | sort | uniq -c | sort -nr cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" | sort | uniq -c | sort -nr > words.txt

  27. Comparing Text Files –diff • The diff command takes two input files, and compares them • The output contains only the different lines, with their line numbers • Command line options for diff: • -i– ignore case • -b– ignore changes in amount of white space • -B– ignore insertion or deletion of blank lines

  28. diff– Examples ADAMS Andrew 7583 BARRETT Bruce 6466 BAYES Ryan 6585 BECK Bill 6346 BENNETT Peter 7456 2,3c2,3 < BARRETT Bruce 6466 < BAYES Ryan 6585 --- > BARRETT Bruce 3333 > BAYES Ryan 6585 5c5 < BENNETT Peter 7456 --- > Bennett peter 7456 diff ADAMS Andrew 7583 BARRETT Bruce 3333 BAYES Ryan 6585 BECK Bill 6346 Bennett peter 7456

  29. diff– Examples 2c2 < BARRETT Bruce 6466 --- > BARRETT Bruce 3333 5c5 < BENNETT Peter 7456 --- > Bennett peter 7456 ADAMS Andrew 7583 BARRETT Bruce 6466 BAYES Ryan 6585 BECK Bill 6346 BENNETT Peter 7456 diff -b ADAMS Andrew 7583 BARRETT Bruce 3333 BAYES Ryan 6585 BECK Bill 6346 Bennett peter 7456 2c2 < BARRETT Bruce 6466 --- > BARRETT Bruce 3333 diff -bi

  30. Maintaining Output Consistency • During program development, assume that we have reached the correct output • We want to verify that it does not change • Create reference output file: • After changing the program, compare output: prog > prog.out prog | diff – prog.out

  31. Lecture Overview • Character manipulation commands • cut, paste, tr • Line manipulation commands • sort, uniq, diff • Regular expressions and grep • Text replacement using sed

  32. Searching For Matching Patterns –grep • The grep command searches files for patterns, and prints matching lines • The mandatory regexp argument defines a regular expression • A regular expression is a formula for matching strings that follow some pattern grep [options] regexp [files]

  33. Searching For Matching Patterns –grep • The simplest regular expression is just a sequence of characters • This regular expression matches only a single string – itself • The following command prints all lines from any of files that contain word: grep word files

  34. Searching For Matching Patterns –grep • The power of grep lies in using more sophisticated regular expressions • Command line options for grep: • -v– print all lines that don't match • -c– print only a count of matched lines • -n– print line numbers • -h– don't print file names (for multiple files) • -l– print file name but not matching line

  35. Regular Expressions • Regular expressions are a powerful tool for searching and selecting text • Their origin is in the UNIX grep command (and further back in automata theory) • They have since been copied into many other tools and languages such as awk, sed, perl and Java

  36. Regular Expressions vs.Filename Expansion • Note that regular expressions are different from filename expansion • Filename expansion uses some regular expression concepts and symbols, but: • Filename expansion is done by the shell • Regular expressions are passed as arguments to specific commands or utilities

  37. Matching a Single Character • A period (.) matches any single character • For example:

  38. Matching a Character Class • Square brackets ([]) match any single character within the brackets • If the first character following the left bracket is a '^', the expression matches any character not in the brackets • A '-' can be used to indicate a range,such as: [a-z]

  39. Matching a Character Class

  40. Matching a Character Class • The same predefined character classes used for tr can also be used here • For portability reasons, [:alpha:] is always preferable to [A-Za-z] • Note: the brackets are part of the symbolic names, and must be included in addition to the enclosing brackets, i. e. [[:alpha:]]

  41. Matching Repetitions • An asterisk (*) represents zero or more matches of the regular expression it follows

  42. Matching Special Characters • Sometimes we want to literally matcha character that has a special meaning, such as '*' or '[' • There are two ways to do that: • Precede the character with a '\' • Use square brackets – any character inside is taken literally

  43. Matching Special Characters

  44. Matching the Beginning orthe End of a Line • A regular expression that begins with a caret (^) can match a string only at the beginning of a line • Similarly, a regular expression that ends with a dollar sign ($) can match a string only at the end of a line

  45. Matching the Beginning orthe End of a Line

  46. Using Regular Expressions with grep– Examples grep 'b.g' bugs.txt cat bugs.txt big boy bad bug bag bigger bag big boy bad bug bag bigger bag better boogie nights grep 'b.*g.' bugs.txt big boy bigger bag boogie nights grep 'b.g.' bugs.txt big boy bigger bag

  47. Using Regular Expressions with grep– Examples cat f.txt grep '[[:alpha:]],' f.txt ADAMS, Andrew 7583 BARRETT, Bruce 6466 BAYES, Ryan 6585 ADAMS, BARRETT, BAYES, grep '^[^[:alpha:]0-3]*$' f.txt 6466 6585 grep '^[C-Z][[:lower:]]*$' f.txt Ryan

  48. Pipes and Regular Expressions – Example • Task: create a file containing the names of all source files in the current directory, sorted by the number of lines in each file • Step 1: count lines in each file • Step 2: leave only '.c' and '.h' files • Step 3: sort in reverse order (largest first) wc -l * wc -l * | grep '\.[ch]$' wc -l * | grep '\.[ch]$' | sort -nr

  49. Pipes and Regular Expressions – Example • Step 4: squeeze leading spaces (into one) • Step 5: remove number field • Step 6: write output to file wc -l * | grep '\.[ch]$' | sort -nr | tr -s " " wc -l * | grep '\.[ch]$' | sort -nr | tr -s " " | cut -d" " –f3 wc -l * | grep '\.[ch]$' | sort -nr | tr -s " " | cut -d" " –f3 > sorted_source_files.txt

  50. Which grep to Use? • In addition to grep itself, there are two more variants of it: egrep and fgrep • Use grep for most standard text finding tasks • Use egrep for complex tasks, where basic regular expressions are just not enough, and you need to use extended regular expressions • Use fgrep when only fixed strings are searched, and speed is of the essence

More Related