1 / 19

3b: Gawk for Web Log Analysis

152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“

Download Presentation

3b: Gawk for Web Log Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“ 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453 "http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 3b: Gawk forWeb LogAnalysis

  2. Gawk - introduction • A very powerful text processing and pattern matching language • gawk is a Gnu version of awk • Syntax similar to C • See http://www.gnu.org/software/gawk/ for manual • Many awk/gawk tutorials, e.g. • http://www.cs.hmc.edu/qref/awk.html • http://www.cs.ucsb.edu/~sherwood/awk/ Note: The name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of awk was written in 1977.

  3. Gawk - running • Several ways of running from the Unix prompt: % gawk ‘commands’ file % cat file | gawk ‘commands’ % cat file | gawk –f prog.gawk’

  4. Gawk – fields and records • Gawk divides the file into records and fields • Each line is a record (by default) • Fields are delimited by a special character • Default: white space (blank or tab) • Can be changed with –F option • E.g. to have comma as a delimiter, use gawk –F”,” file.csv

  5. Gawk fields and variables Fields are accessed with the $ prefix Special variables: • $1 is the first field, $2 is the second… • $0 is a special field which is the entire line • NF is a special variable - number of fields in the current record • NR is a special variable – current record number

  6. Gawk conditions gawk –F"d" 'condition' file • gawk processes each line of file, using the delimiter d (default is whitespace) to split each line into fields. • The default action is to print the entire line.

  7. Sample log file • We will use file d100.log – first 100 lines from the Nov 16, 2005 KDnuggets log file. • We will give useful code examples – for full gawk introduction see elsewhere • You are encouraged to try the code examples in this lecture on this file • You should get the same answers!

  8. Sample log file d100.log ip1664.com - - [16/Nov/2005:00:00:43 -0500] "GET /robots.txt HTTP/1.0" 200 173 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)" ip1664.com - - [16/Nov/2005:00:00:43 -0500] "GET /gpspubs/sigkdd-kdd99-panel.html HTTP/1.0" 200 14199 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)" ip2283.unr - - [16/Nov/2005:00:01:02 -0500] "GET /dmcourse/data_mining_course/assignments/assignment-3.html HTTP/1.1" 200 8090 "http://www.google.com/search?hl=en&q=use+of+data+cleaning+in+data+mining&spell=1" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" ip2283.unr - - [16/Nov/2005:00:01:03 -0500] "GET /dmcourse/dm.css HTTP/1.1" 200 155 "http://www.kdnuggets.com/dmcourse/data_mining_course/assignments/assignment-3.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" ip1389.net - - [16/Nov/2005:00:02:46 -0500] "GET /gpspubs/kdd99-est-ben-lift/sld021.htm HTTP/1.1" 200 1385 "http://www.google.com/search?hs=JnE&hl=en&lr=&client=opera&rls=en&q=lift+curve&btnG=Search" "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1389.net - - [16/Nov/2005:00:02:46 -0500] "GET /gpspubs/kdd99-est-ben-lift/img021.gif HTTP/1.1" 200 7465 "http://www.kdnuggets.com/gpspubs/kdd99-est-ben-lift/sld021.htm" "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1389.net - - [16/Nov/2005:00:02:47 -0500] "GET /favicon.ico HTTP/1.1" 200 899 "http://www.kdnuggets.com/gpspubs/kdd99-est-ben-lift/sld021.htm" "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1946.com - - [16/Nov/2005:00:02:49 -0500] "GET /news/2001/n10/15i.html HTTP/1.0" 200 4214 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)“ …

  9. Example 1: Lines with Status not equal 200 • Status code is field $9 in the log file • How many lines had status code not 200: % gawk '$9 != 200' d100.log | wc Result: 27 Note: to count status code equal to 200, use '$9 == 200' not '$9 = 200' (this sets $9 to be 200)

  10. Example 2: Count referrals from Google • Gawk has powerful pattern matching • variable ~ "pattern" • Example: how many log lines had a referral (field $11 in the log line) from google: % gawk '$11 ~ "google"' d100.log | wc Result: 2

  11. Example 3: complex condition • How many hits had GET method and status 404? • (status 404 is an error code) • Method is field $6 in the log, but the request is surrounded by " ". We can use % gawk '$6 ~ "GET" && $9 == 404' d100.log | wc Result: 1

  12. Example 4a: Counting ".html" requests • The requested file is field $7. We can use this condition to match files that end in .html • Note: $ in the pattern matches the end of string % gawk '$7 ~ ".html$"' d100.log | wc Result: 21

  13. Example 4b: Counting htm or html requests Some files may also end in .htm, so we can use % gawk '$7 ~ ".html$|.htm$"' d100.log | wc Result: 22

  14. Example 4c: Counting directory requests Some requests can be for a directory, e.g. a request for the homepage www.kdnuggets.com/ would have "GET / HTTP/1.1" string. • We can count these requests by % gawk '$7 ~ "/$"' d100.log | wc Result: 6

  15. Example 4d: Counting all HTML pages • or count html, htm, and directory pages by % gawk '$7 ~ "(html|htm|/)$"' d100.log | wc Result: 28

  16. Gawk computations • More general form of gawk statements is gawk '{statements;…}' file • The statements are executed for each line of file • Statements include the usual conditionals, loops, etc • Details in gawk manual/tutorials

  17. Example 5: External referrers • Example: Print referrers to html pages, excluding direct access (where referrer is "-" ) • Note: to test if $11 is "-", we need to escape a double quote as \" • Code: (all on one line) % gawk '{if ($7~"html$" && $11!="\"-\"") print $11}' d100.log | wc Result: 7

  18. Gawk statements: BEGIN, END • To execute statements before reading the first line we use BEGIN keyword • To execute statements after the last line is read we use END keyword gawk 'BEGIN{stat1;…}{stat2;…}END{stat3;…}' file

  19. Example 6 • Sum all the object sizes for access code 200 gawk '{if ($9 == 200) sumsize+=$10} END{print sumsize}' d100.log Result: 396460 Note: we did not initialize sumsize; all variables by default are initialized to zero

More Related