awk

awk • awk is a file-processing programming language. • Makes it easy to perform text manipulation tasks. • Is used in • Generating reports • Matching patterns • Validating data • Filtering data for transmission • An awk program is a sequence of statements of the form • Pattern {action} • Scans the input lines, in order, one at a time. • Searches for the pattern and if pattern is found, the corresponding action is performed. • Each statement of awk program is executed for each line of input.

awk BEGIN Executed once before any input is read main control loop Executed for each line of input Input lines END Executed once all input is read

awk programming model • awk program consists of a main input loop (you don’t write the loop but the main program works as one). • The main routine reads one line of input from a file and makes it available for processing. The main loop executes as many times as there are lines in the input. • Preprocessing before the main loop and post processing after the loop are done with BEGIN and END. • The routine is applied to each input line, one line at a time.

Two ways to present the program to awk. Make the program the first argument on the command line – if the program is short. awk ‘program ‘ [filename ....] Examples: %awk '/Smith/ {print}' people %awk '/Smith/ {print}' - Put the program in a separate file and tell awk to use the program file on the input files. Examples: awk -f awkprog file1 file2 Keywords and some important functions BEGIN, END, FILENAME, FS, NF, NR, OFS, ORS, OFMT, RS break, close, continue, exit, exp, for, getline, if, in, index, int, length log, next, number, print, printf, split, sprintf, sqrt, string, string, substr, while Operators Assignment, compound assignment, arithmetic, relational, logical and regular expression matching operators. awk

\ - escapes any meta character that follows, including itself. ^ - anchors the following regular expression to the beginning of string. $ - anchors the following regular expression to the end of string. . (dot) Matches any character including newline […] – matches any one of the class characters enclosed between the brackets. [^] – A circumflex as first character inside [] reverses the match to all characters except those listed in the []. r1 | r2: between two regular expressions r1 and r2, it allows either of the regular expressions to be matched. r* - Matches any number (including zero) of the regular expression that precedes it. r+ - Matches one or more occurences of the regular expression that precedes it. r? - Matches 0 or 1 occurences of the regular expression that precedes it. () – groups regular expressions \{n,m\} – Matches a range of occurences of a single character that precedes it. Matches any number of occurences between n and m. May not be available in very old versions. Some Regular Expression Metacharacters

Writing Regular Expressions • Writing regular expressions involves three steps: • Specification: Knowing what you want to match. • Coding: Writing an expression to describe what you want to match • Testing: Testing the pattern to see what it matches. • Testing your regular expression may result in, • Hits: Lines you wanted to match • Misses: Lines you did not want to match • Omissions:Lines you wanted to match but did not. • False Alarms: The lines you matched but did not want to match. • Eliminate false alarms by limiting the matches and capture the omissions by expanding the possible matches.

Some Examples What do they match? • [a-zA-Z?+!] - • [a-zA-Z][?+!] - • [-+*/] - • AB\{2,4\}C - • UNIX|LINUX - • Compan(y|ies) - • [0-9][0-9]*\.\{2,\}[a-z][a-z]* -

Multiline Records • FS – default value is a single space. FS can be set to a single character. When more than one character is given it is interpreted as a regular expression. • RS – default value is a newline. Default value can be changed. • Example: BEGIN {RS = "" ; FS = "\n"} # Record separator is a blank line { print "Name ", $1 print "Zip ", $NF } Input file: John Smith 235 Alameda Santa Clara CA 95053 Output: Name John Smith Zip 95053

cat prog1.awk # test for integer, string or a blank line. /[0-9]+/ {print $0 ": An integer"} /[A-Za-z]+/ { print $0 ": A String"} /^$/ {print "A Blank line"} # + metacharacter – one or more cat testfile 1234 This is a test 789 Hello %awk –f prog1.awk testfile 1234: An integer This is a test: A String 789 Hello: An integer 789 Hello: A String A Blank line A Blank line Examples

%cat prog2.awk BEGIN {FS = ","} # Comma is the field separator { print $1 print $2 print $3 } % cat prog3.awk BEGIN {FS = ","} /CA/ {print $1 "," $3} # will match any field with CA $3 ~ /CA/ {print $1 "," $3} # field match %cat testfile2 John Smith, Santa Clara, CA Mary Jones, Red Bank, NJ Susan Wang, Denver, CO % awk –f prog2.awk testfile2 What is the output? More than one character can be specified as a field separator, it will be interpreted as a regular expression. Examples: FS = “\t+” How many fields are in the following line? IJK\t\tXYZ FS= “[‘:,\t\] Examples

$cat prog4.awk BEGIN {printf ("Scores\n "); } { print $0; total = total + $2} #NR – number of input records that are read END {print "Average score is ", total / NR } $cat scores Smith 80 Jones 97 Chan 95 King 78 $ awk -f prog4.awk scores Scores Smith 80 Jones 97 Chan 95 King 78 Average score is 87.5 Examples

Passing Parameters into awk script • Parameters can be passed from the command line into an awk script. A variable(s) is set from the command line and can be accessed from the awk script. • Parameters that are passed in, are not available in BEGIn, they are available to the script only after the first line of input is read. • Example – param.awk BEGIN {print "Passing Parameters"} {print "arg1 = ", arg1 print "arg2 = ", arg2 } From the command line, invoke awk –f param.awk arg1=100 arg2=200 datafile A shell script’s command line arguments can be passed in as follows: Assume that the following line is in a shell script called awktest.sh awk –f param.awk “arg1=$1 arg2=$2” datafile $1 and $2 are the positional parameters given as arguments on command line when awktest.sh is invoked as awktest.sh 100 200

# print lines ending with ia awk ‘ia$/ {print}’ countries - #print countries ending with ia Awk ‘$1 ~ /ia$/ {print $1 }’ countries #select lines where the third field #matches Asia or begins with North #or South $3 ~ /Asia |^North | ^South/{print} #Pattern Ranges /Russia/,/Brazil/ {print} #Replace USA by United States /USA/ {$1 = "United States";print} %cat countries Australia 3000 Australia USA 3615 North America Argentina 1072 South America India 1270 Asia Russia 8650 Asia China 3692 Asia Brazil 3286 South America Patterns Using Regular Expressions

Associative Arrays • Arrays in awk are associative arrays where the index can be a number or a string. • The order in which the items are retrieved may be random. %cat prog6.awk { x [$1] = $2 } END { for (item in x) print item,x[item] } %awk –f prog6.awk scores Jones 89 Smith 65 Chen 100 King 120 Lowel 200

Cat prog7.awk BEGIN { OFS = "\t" }{ # main loop applied to all input lines total = 0 for (I = 2; I <= NF; ++I) total += $I; average = total / (NF -1) # store each student average stAvg[NR] = average avgByName[$1] = average #determine the letter grade if (average >= 90) grade = "A" else if (average >= 80) grade = "B" else if (average >= 70) grade = "C" else grade = "F“ #store a count of the letter grades ++classGrade[grade] } Example: Computing Grades

#class statistics END{ #calculate class average for (x = 1; x <= NR; x++) classTotal += stAvg[x] classAve = classTotal / NR print "Class Average = " classAve #determine how many above or below average #print number of students per letter grade print "Enter name " getline name < "-" print name ": " avgByName[name] for (letterGrade in classGrade) print letterGrade ":" classGrade[letterGrade] | "sort" }

%cat grades Smith 90 80 50 Jones 20 0 70 Wang 67 90 80 Wolf 70 100 90 Pratt 90 88 92 %awk -f prog7.awk grades Smith 73.3333 C Jones 30 F Wang 79 C Wolf 86.6667 B Pratt 90 A Class Average = 71.8 Enter name Smith Smith: 73.3333 A:1 B:1 C:2 F:1

Multidimensional arrays #awk offers a syntax for subscripts that simulate a reference to multidimensional arrays { for (i = 1; i <= NF; ++i) table[NR,i] = $i } END{ for (k = 1; k <= NR ; ++k){ for (i = 1; i <= 4; ++i){ total += table[k,i] printf("%d ", table[k,i]) } printf("\n") } {print "Total = " total} }

next and getline • Next causes the next input line to be read. • Next statement passes control back to the top of the script. %cat prog9.awk NF == 2 {next} # skips to the next record and starts the program from the # beginning /USA/ {$4 = "United States Of America"; print $0} {print NR } %cat countries Japan Asia 2: UK Europe 3: Brazil S.America Egypt Africa 5: USA N.America Canada N.America % awk –f prog9.awk countries 2 3 5: USA N.America United States Of America 5

Using getline #Using getline function to read the next line of input /^\/+/ { getline print $1 } #get input from command line BEGIN{ printf "Enter your name: " getline name < "-" print name } /Smith/ { getline print $1 }

#Reading from a pipe using a getline {while ("who" | getline) terminal[$1] = $2 } END{ for (item in terminal) print item, terminal[item] }

Example - An word lookup # reads a file with acronyms and their expansions, #handles users queries BEGIN { FS = “\t”; OFS = “\t” printf (“Enter a word for lookup: “); } #Load the file named acronyms FILENAME == “acronyms” { wordList[$1] = $2 next }

Example - An word lookup (cont) #scan for command to exit program $0 ~ /^(quit|qQ|[Xx]|exit|)$/ { exit } #process any non-empty line $0 != “” { if ( $0 in wordList) { print wordList[$0]} else print $0 “ not found” } #Prompt user to enter another word { printf (“Enter another word or q|Q to quit”); } acronyms -

split () • Split () is a built-in function that can parse any string into elements of an array. • Syntax: • No Of elements = split (string,array,separator). If no separator is specified, FS is used as the field separator. n = split($0,days) {for (j = 1; j <= n; ++j) print days[j] }

next • The next statement forces awk to immediately stop processing the current record and go on to the next record. The rest of the current rule's action is not executed either. • If you think of the main body in awk is a loop, thenext statement is analogous to a continue statement: it skips to the end of the body of this implicit loop, and executes the increment (which reads another record). • Note: getline function causes awk to read the next record immediately, but it does not alter the flow of control in any way. So the rest of the current action executes with a new input record. • For example, if your awk program works only on records with four fields, and you don't want it to fail when given bad input, you might use this rule near the beginning of the program:

Example: FILENAME == "names.txt" { count += 1; next } {print $0 } END{ print count } #Counts each line in the file, “names.txt”.

%cat prog9.awk NF == 2 {next} # skips to the next record and starts the program from the # beginning /USA/ {$4 = "United States Of America"; print $0} {print NR } %cat countries Japan Asia 2: UK Europe 3: Brazil S.America Egypt Africa 5: USA N.America Canada N.America % awk –f prog9.awk countries 2 3 5: USA N.America United States Of America 5

getline • getline is used to read the next line of input input from the current input file, from a specified file and a pipe. • The getline command can be used without arguments to read input from the current input file. • Reads the next input record and split it up into fields. This is useful if you've finished processing the current record, but you want to continue processing from the next record. • Note: the new value of $0 is used in testing the patterns of any subsequent rules. The original value of $0 that triggered the rule which executed getline is lost.

Example: /^[0-9]+/ {print "Line number ", NR, ":", "starts with a number" } /^\/\*/ { getline } {print NR “:” $0 } Input: This is a cat 1234 a cat A test /* A comment line */ 990 is the score Output: 1:This is a cat Line number 2 : starts with a number 2:1234 a cat 3:A test 5:990 is the score

getline • Using getline to read a line into a variable • You can use `getline variable' to read the next record from awk's input into the variable variable. No other processing is done. • For example, suppose the next line is a comment, or a special string, and you want to read it, without triggering any rules. This form of getline allows you to read that line and store it in a variable so that the main read-a-line-and-check-each-rule loop of awk never sees it. • The getline command used in this way sets only the variables NR and FNR. • The record is not split into fields, so the values of the fields (including $0) and the value of NF do not change.

What is the output of the following program on input file given below: /^[A-Za-z]/ { getline tmp print tmp } {print $0 } Inputfile: ABCD 1234 EFGH 5678

getline • Using getline to read the next record from the file file. • Here file is a string-valued expression that specifies the file name. `< file' is called a redirection since it directs input to come from a different place. • For example, the following program reads its input record from the file `input.dat when it encounters a first field with a value equal to 10 in the current input file. • awk '{ if ($1 == 10) { getline < "input.dat" print } else print }' . • Since the main input stream is not used, the values of NR and FNR are not changed. But the record read is split into fields in the normal manner, so the values of $0 and other fields are changed. So is the value of NF.

Using getline to read the output of a command from a pipe: • You can pipe the output of a command into getline, using `command | getline'. In this case, the string command is run as a shell command and its output is piped into awk to be used as input. This form of getline reads one record at a time from the pipe. • For example, the following program copies its input to its output, except for lines that begin with `@execute', which are replaced by the output produced by running the rest of the line as a shell command: awk ‘{ if ($1 == "@execute") { tmp = substr($0, 10) while ((tmp | getline) > 0) print close(tmp) } else print }' input The close function is called to ensure that if two identical `@execute' lines appear in the input, the command is run for each one.

Close() • Close () allows you to close open files and pipes. • There may be a limitation on the number of files and pipes that can be open at the same time. • Closing a pipe allows you to run the same command twice. • Example: Close (“who”)

What is the output for the given input file Jsmith Mjones @execute who TWolf

Using getline to read the output of a command from pipe into a variable: • When you use `command | getline var', the output of the command command is sent through a pipe to getline and into the variable var. • Example: • awk 'BEGIN { "date" | getline current_time close("date") print "Report printed on " current_time }' • In this version of getline, none of the built-in variables are changed, and the record is not split into fields.

Using system() • System() function executes a command supplied as an expression. • The output generated from executing system() is not available within the program for processing. • System() returns the exit status of the program that was executed. Example: #!/bin/awk -f BEGIN{ status = system ("mkdir temp") if (status != 0) print "command failed" }

User-defined functions • A Function definition can be anywhere that a pattern-action rule can be. • Input to the function are passed as a list of parameters. Example: # inserts a string, insertStr after position in aString function insertString(aString, position, insertStr){ before = substr(aString, 1,position) after = substr(aString,position +1) return before insertStr after } { print insertString($1,5,"BBBB") }#No spaces are allowed between the function name and the left parenthesis.

All the variables in the parameter list are considered local to the function. • All variables defined in the body of the function are treated as global variables. • Therefore any temporary variables that are declared are put at the end of the parameter list. • Example: function insertString(aString, position, insertStr,after){ before = substr(aString, 1,position) after = substr(aString,position +1) return before insertStr after } { print insertString($1,5,"BBBB") } { print aString } { print "before: " before} { print "after: "after }

cat testFile HelloWorld This is a test XYZ1234567890 awk –f fun2.awk testFile HelloBBBBWorld before: Hello after: ThisBBBB before: This after: XYZ12BBBB34567890 before: XYZ12 after:

Functions • Arrays are passed by reference #!/bin/awk -f function moveSmallest(LIST,SIZE, temp,small,smal small = LIST[1] for (i = 2; i <= SIZE; ++i){ if (LIST[i] < small){ small = LIST[i] smallIndex = i; } } LIST[smallIndex] = LIST[1] LIST[1] = small return } END{ array[1] = 12; array[2] = 0; array[3] = -1; array[4] = 100; moveSmallest(array,4) for(i = 1; i <= 4;++i){ print array[i] } }

Arithmetic Functions cos, exp,int,log,sin,sqrt,atan2,rand,srand Some useful String Functions index, length, split, sub,substr,tolower,loupper gsub(regExp,replaceWithString,inString) – globally substitutes replaceWithString for regExp in inString. match (string, regExp) – returns the position of where the regExp is found in string or 0 if no occurences are found. Some built-in Functions

Passing parameters into a script • Input is passed into an awk script by setting variables on the command line. • Example: • awk –f awkprog x=1 y=2 inputfile • The variables x and y can be accessed in the main loop (not in the BEGIN section). • The system variables ARGC and ARGV can be used to access the command line arguments Example: BEGIN { print "BEGIN: " n } NR == 1 { print ARGC; print n for (i = 0; i < ARGC; ++i){ print ARGV[i]} } % awk -f param.awk n=20 testfile BEGIN: 3 20 awk n testfile

An array of Environment variables #!/bin/awk -f BEGIN{ for (env in ENVIRON){ print env "=" ENVIRON[env] } print “Logname = “,ENVIRON[“LOGNAME”] }

awk

awk

Presentation Transcript

awk

sed and awk

AWK

AWK

awk Challenges

The AWK Language

AWK

AWK

Introduction to Awk

awk

CISC3130: awk

awk

AWK

AWK

The awk command

The Awk Utility: Awk as a UNIX Tool

Awk 2 – more awk

Learning AWK

CISC3130: awk

The Awk Utility: Awk as a UNIX Tool