Text Processing

Text Processing

Searching Inside Files • grep - searches for patterns within files • grep [options] [[-e] pattern] filename [...] • -n shows line numbers • -A [NUM] prints match and [NUM] lines after match • -B [NUM] prints match and preceding [NUM] lines • -C [NUM] prints match and [NUM] lines before and after For -C, [NUM] defaults to 2 • -i performs case insensitive match • -v inverts match; prints what doesn't match • --color highlight matched string in color • The grep command in Linux searches a file or files for a pattern and by default prints the lines containing matches. • This default behavior is shown in the following example where grep returns the entire line that contains the pattern nobody: • $ grep nobody /etc/passwd • nobody:x:99:99:Nobody:/:

grep Examples: 1. Consider the following extremely simple examples of using the grep command: $ cat file mouse cat dog bear 2. Print all lines that contain the letter e including their line numbers: $ grep -n e file 1:mouse 4:bear 3. Print all lines that do not contain the letter e including their line numbers: $ grep -nv e file 2:cat 3:dog 4. Print lines containing the pattern cat plus one line preceding each match: $ grep -B 1 cat file mouse cat 5. Print lines containing the case insensitive pattern BEAR: $ grep -i BEAR file bear

The Streaming Editor • sed - A [s]treaming [ed]itor • sed [options] filename [...] • • performs edits on a stream of text (usually the output of another program) • • often used to automate edits on many files quickly • • small and very efficient • • -i switch for in place edits with modern versions • Example: • $ cat letter • I love Windows. Windows is my favorite operating • Then sed works its magic fixing the statement with a simple search and replace command: • $ sed s/Windows/Linux/g letter • I love Linux. Linux is my favorite operating system.

Text Processing with awk • awk - pattern scanning and processing language • $ awk -f awk_script_name /path/to/file • • Turning complete programming language • • splits lines into fields (like cut) • • regex pattern matching (like grep) • • math operations, control statements, variables, IO... • awk Command Examples • Print the lines that end with the string bash: • $ awk ‘/bash$/’ /etc/passwd • . . . output omitted . . . • Print the names of the users (field one) for each line that end with the string bash: • $ awk -F: ‘/bash$/ {print $1}’ /etc/passwd • . . . output omitted . . .

Replacing Text Characters • tr - translates, squeezes & deletes characters • tr [options] [set1] [set2] • • translates one set of characters into another commonly used to convert lower case into upper case $ tr a-z A-Z • • squeeze collapses duplicate characters commonly used to merge multiple blank lines into one $tr -s ‘\n’ • • deletes a set of characters commonly used to delete special characters • tr -d ‘\000’ • To display the contents of the lower.txt file and convert all lower-case characters to upper case: • $ cat lower.txt | tr a-z A-Z • THESE ARE CHARACTERS THAT WERE TYPED INTO THIS FILE IN LOWER CASE • To display the contents of the lower.txt file and delete all occurrences of the letter e: • $ cat lower.txt | tr -d e • ths ar charactrs that wr typd into this fil in lowr cas

Text Sorting • sort - Sorts text • sort [options] filename [...] • • can sort on different columns • • by default sorts in lexicographical order • 1, 2, 234, 265, 29, 3, 4, 5 • • can be told to sort numerically • 1, 2, 3, 4, 5, 29, 234, 265 • • can merge and sort multiple files simultaneously • • can sort in reverse order • • often used to prepare input for the uniq command • -n  sort numerically • -r  sort in reverse order • -m  do not sort, only merge; this is faster but only works if the input is already sorted • -t separator use this as a column separator • -k number sort by this column number, counting the first column as 1 • -o filename output to the specified file instead of STDOUT

Duplicate Removal Utility • uniq - Removes duplicate lines from sorted text • uniq [options] [filename [filename]] • • cleanly combines lists of overlapping but not identical information • • -c prefixes each line of output with a number indicating number of occurrences • • taking this output and performing a reverse sort produces a sorted list based on number of occurrences • -i  ignore case, ie b is equivalent to B • -D  print all duplicated lines • -d  only print duplicated lines • -u  only print unique lines • -c  prefix lines by the number of occurrences

uniq Command Examples Consider the following example which shows several of the features of the uniq command: $ cat file $ uniq -c /tmp/file mouse 2 mouse mouse 1 Mouse Mouse 1 cat Cat 1 dog dog 1 cat cat $ uniq /tmp/file $ uniq -d /tmp/file mouse mouse Mouse Cat dog cat $ uniq -i /tmp/file mouse cat dog

Extracting Columns of Text • cut - Extracts selected fields from a line of text • cut [options] [filename] [...] • • can specify which fields you want to extract • • uses tabs as default delimiter • • -d option to specify a different delimiter • • most useful on structured input (text with columns) • -b range  cut and paste only bytes from this range • -c range  cut and paste only characters from this range • -f range  cut and paste only this range • -d delimiter  use this delimiter instead of the default • 6-39  from 6 to 39 • -12  from 1 (the beginning of the line) to 12 • 30-  from 30 to the end of the line • $ cat /etc/passwd | cut -d : -f 1,3 • foo:501 • bar:502

Merging Multiple Files • paste - Merges text from multiple files to STDOUT • paste [options] [filename] [...] • • -s option to merge files serially • • uses tabs as default delimiter • $ cat file1 $ cat file2 $ cat file3 • A one 1 • B two 2 • C three 3 • D four 4 • $ paste file1 file2 file3 $ paste -s file1 file2 file3 • A one 1 A B C D • B two 2 one two three four • C three 3 1 2 3 4 • D four 4

Text Processing

Text Processing

Presentation Transcript

Text Processing

Strings and Text Processing

TEXT PROCESSING 1

Basic Text Processing

Lecture 8: Text processing

Text processing

TEXT PROCESSING UTILITIES

Text Processing

Text Processing

Advanced Text Processing

Text processing

Text Processing

Chapter 23 Text Processing

Chapter 23 Text Processing

Advanced Text Processing

Text, not Word Processing

Text processing

Text processing