1 / 17

Logfile-Preprocessing using WUMprep

Logfile-Preprocessing using WUMprep. A Perl-Script-Suite that does more than just filtering raw data. about WUMprep 1. WUMprep is part of Open Source-Project HypKnowSys and written by Carsten Pohle it comprises Logfile-Preprocessing in two ways: filtering

kalkin
Download Presentation

Logfile-Preprocessing using WUMprep

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Logfile-Preprocessing using WUMprep A Perl-Script-Suite that does more than just filtering raw data

  2. about WUMprep 1 • WUMprep is part of Open Source-Project HypKnowSys and written by Carsten Pohle • it comprises Logfile-Preprocessing in two ways: • filtering • adding meaning to WebSites (Taxonomies) • it can be used both as stand-alone and in conjunction with other Mining-Tools (e.g. WUM)

  3. Configuring WUMprep (1) • wumprep.conf is used for defining basic needs of each script • Just give your domain and your input-Log – that will do for the moment • before running removeRobots.pl you can define the sec. in Timestamp  Question: which value is appropriate? wumprep .conf

  4. Next step: logfileTemplate (config 2) • The 4 basic Web-Server Log-Formats are defined in WUMprep´s logFormat.txt • According to a given Format you arrange logfileTemplate. • Basically anything goes but if the Log is queried from a MySQL-database remember that Host, Timestamp and Agent are mandatory (and Referrer at least helpfull)1 1See Nicolas Michael´s Presentation for Details conc. problems with the basic algorithm format.txt logfile Template

  5. You have this format: koerting.hannover.kkf.net - - [01/Mar/2003:00:34:41 -0700] "GET /css/styles.css HTTP/1.1" 200 7867 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)" You have this format 200.11.240.17 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98) 2001-10-20 00:00:00 2399027 you take this template @host_dns@ @auth_user@ @ident@ [@ts_day@/@ts_month@/@ts_year@:@ts_hour@:@ts_minutes@:@ts_seconds@ @tz@] "@method@ @path@ @protocol@" @status@ @sc_bytes@ @referrer@ @agent@ you take this template @host_ip@ @agent@ @ts_year@-@ts_month@-@ts_day@ @ts_hour@:@ts_minutes@:@ts_seconds@ @dummy@ Usage of logfileTemplate (config 3) NB: Have a close look at your logfile and arrange logfileTemplate by following exactly the given Format

  6. Dealing with Unstandardized Format (config 4) • This last slide´s example is taken from a MySQL-Database: 200.11.240.17 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98) 2001-10-20 00:00:00 2399027 • You have an unusual Timestamp-Format (see adapting the Template in last slide), and a missing Referrer  sessionize.pl may look e.g. for foreign referrers to start a new session

  7. Configuring wumprep.conf (5) • Go to [sessionizeSettings] in wumprep.conf. • Out-comment everything that deals with Referrers. It´ll look like this: # Set to true if the sessionizer should insert dummy hits to the # referring document at the beginning of each session. #sessionizeInsertReferrerHits = true # Name of the GET query parameter denoting the referrer (leave blank # if not applicable) #sessionizeQueryReferrerName = referrer # Should a foreign referrer start a new session? #sessionizeForeignReferrerStartsSession = 0

  8. We´re ready to go: sessionize the Log • If no Cookie ID is given, sessionize.pl will look for the Host and the Timestamp. There is a Threshold q= 1800 sec. in wumprep.conf. Let t0 be the first entry. Then a Session is computed by taking any URL, whose Timestamp lies in between t - t0 ≤ qas a sequent request of t0

  9. detectRobots • There are two Types of Robots: ethical and non-ethical (let´s say three: the good, the bad and the very ugly ;-). • The first type acts according the 'Robots Exclusion Standards' and looks first in a file called Robots.txt where to go and where not. • Removing them is done over the Robot-Database indexers.lst. Additionally, detectRobots.pl flags IPs as robots when they accessed robots.txt

  10. detectRobots (2) • The 2nd type with its IP and Agent looking like coming from a Human is difficult to detect and requires a sessionized Log. • There is (besides two others) a time-based heuristic to remove them: Too many html-requests in a given time is very likely to come from a Robot. Default value in wumprep.conf is 2 sec.

  11. detectRobots (3) • you can add entries to indexers.lst by taking a larger Log and type at command-line: grep "bot" logfile |awk '{print "robot-host: ", $1}' |sort |uniq >>indexers.lst • detectRobots.pl removes 6% before and 17% of Robot-Entries after running the script in my Logs (2668:2821Kb vs. 2360:2821Kb of xyz.nobots:xyz.sess) • There will allways remain some uncertainty about Robot-Detection. Further research is necessary.

  12. Further Data Cleaning thankfully is much easier logFilter.pl uses your Filter Rules in wumprep.conf. You can define your own Filter Rules or add them to wumprep.conf \.ico \.gif \.jpg \.jpeg \.css \.js \.GIF \.JPG # @mydomainFilter.txt logFilter

  13. Taxonomies • Taxonomies are built using Regular Expressions: map your Site according a Taxonomy and MapreTaxonomies.pl uses your predefined regexes to overwrite the requests in the log to your Site Concept. • It´ll look something like this: HOME www\.c-o-k\.de\/$ METHODS \/cp_\.htm\?fall=3\/ TOOLS \/cp_\.htm\?fall=1\/ FIELDSTUDIES \/cp_.htm?fall=2\/

  14. Taxonomies II • This is what MapreTaxonomies.pl does with it (Aggregation). 117858:1|80.136.155.126 - - [29/Mar/2003:00:02:00 +0100] "GET AUTHOR-FORMAT/LITERATURDB HTTP/1.1" 200 1406 "-" "Mozilla/5.0 (Windows; U; Win 9x 4.90; de-DE; rv:1.0.1) Gecko/20020823 Netscape/7.0" 117858:1|80.136.155.126 - - [29/Mar/2003:00:02:00 +0100] "GET AUTHOR-FORMAT/LITERATURDB HTTP/1.1" 200 10301 "http://edoc.hu-berlin.de/conferences/conf2/Kuehne-Hartmut-2002-09-08/HTML/kuehne-ch1.html" "Mozilla/5.0 (Windows; U; Win 9x 4.90; de-DE; rv:1.0.1) Gecko/20020823 Netscape/7.0" • This Data Aggregation is a neccessary step before working with WUM

  15. Taxonomies III • Above that, Carsten Pohle wants to use them as a Filter for uninteresting Patterns one usually gets out of Association Rules  any Pattern that matches the Taxonomy (via mapReTaxonomies.pl) is most likely to be uninteresting

  16. Further Reading • Berendt, Mobasher, Spiliopoulou, Wiltshire, Measuring the Accuracy of Sessionizers for Web Usage Analysis • Pang-Ning Tan and Vipin Kumar, Discovery of Web Robot Sessions based on their Navigational Patterns, in: Data Mining and Knowledge Discovery, 6 (1) (2002), S. 9-35. • Nicolas Michael, Erkennen von Web-Robotern anhand ihres Navigationsmusters (on Berendt, HS Web Mining SS03) • Gebhard Dettmar, Knowledge Discovery in Databases - Methodik und Anwendungsbereiche, Knowledge Discovery in Databases, Teil II - Web Mining

  17. Logfile-Preprocessing via WUMprep Thanks for Listening!

More Related