Utilizing Regular Expressions for Data Processing
130 likes | 218 Views
Learn how to extract and analyze data using regular expressions. Discover how to identify traffic patterns in web server data and process information efficiently.
Utilizing Regular Expressions for Data Processing
E N D
Presentation Transcript
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Regular expressions{week 04} from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
Storage and retrieval • Computers store and retrieve information • Retrieval first requires finding information once we find the data, we often mustextract what we need...
Identifying traffic patterns • Weblogs record each and everyaccess to the Web server • Use the data to answer questions • Which pages are the most popular? • How much spam is the site experiencing? • Are certain days/times busier than others? • Are there any missing pages (bad links)? • Where is the traffic coming from?
Weblogs (not blogs!) • Apache records an access_log file: • 75.194.143.61 - - [26/Sep/2011:22:38:12 -0400] "GET /cis460/wordfreq.php HTTP/1.1" 200 566 requesting IP (or host) username/password access timestamp HTTP request server response code size in bytes of data returned (for server response codes, see http://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
What do we do with the data? • We have many options for using the data summarized file (e.g. csv or tsv) spreadsheet access_log database web site
How do we process the data? • Regardless of what we do with the data,we must first parse or extract the data • We could write specific code to processthe data and programmatically extract the desired information • Use regular expressions to simplify processing
Regular expressions (i) • A regular expression is an expression ina “mini language” designed specificallyfor textual pattern matching • Support for regular expressions are availablein many languages, including Java, JavaScript,C, C++, PHP, etc.
Regular expressions (ii) • A pattern contains numerous character groupings and is specified as a string • Patterns to match a phone number include: • [0-9][0-9][0-9]−[0-9][0-9][0-9]−[0-9][0-9][0-9][0-9] • [0-9]{3}−[0-9]{3}−[0-9]{4} • \d\d\d−\d\d\d−\d\d\d\d • \d{3}−\d{3}−\d{4} • (\d\d\d) \d\d\d−\d\d\d\d
Regular expressions in Java (i) • The String class in Java provides a pattern matching method called matches(): • Unlike other languages, Java requires the pattern to match the entire string String s = "Pattern matching in Java!"; String p = "\\w+\\s\\w+\\s\\w{2}\\s\\w+!"; if ( s.matches( p ) ) { System.out.println( "MATCH!" ); }
Regular expressions in Java (ii) • Additional pattern-matching methods: • Use the replaceFirst() and replaceAll() methods to replace a pattern with a string: String s = "<title>Cool Web Site</title>"; String p = "</?\w+>"; String result = s.replaceAll( p, "" );
Regular expressions in Java (iii) • Additional pattern-matching methods: • Use the split() method to split a stringinto an array of substrings String s = "The Legend of Sleepy Hollow"; String[] words = s.split( "\\s+" );