1 / 45

Web Scraping and Regex

Web Scraping and Regex. Ruby Gems. Add-on libraries Don’t reinvent the wheel Syntax: g em install _______. Scraping. Parsing HTML with Nokogiri. Strings. Strings are a sequence of characters denoted by single or double quotes. "a" "puts" "John's book" "12+100"

coby
Download Presentation

Web Scraping and Regex

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Scraping and Regex

  2. Ruby Gems • Add-on libraries • Don’t reinvent the wheel • Syntax: • gem install _______

  3. Scraping • Parsing HTML with Nokogiri

  4. Strings Strings are a sequence of characters denoted by single or double quotes. • "a" • "puts" • "John's book" • "12+100" • 'To be or not to be, that is the question...'

  5. String methods • "abc".upcase #=> "ABC" • "DEF".downcase #=> "def" • "abcdef".reverse #=> "fedcba" • "ABCdef".capitalize #=> "Abcdef" • "dog park".length #=> 8

  6. Mixing datatypes • puts 2 + 2 #=> 4 • puts "2" + "2" #=> "22"

  7. Conversion Methods • "42".to_i + 42 #=> 84 • "42" + 42.to_s #=> "4242"

  8. What class is it? • 42.class # Fixnum • 42.0.class # Float • "42".class # String

  9. Combining strings and numbers • puts "The result of 7 + 7 is " + (7 + 7).to_s • #=> The result of 7 + 7 is 14

  10. Interpolation – another way • puts "The result of 7 + 7 is #{7+7}" • #=> The result of 7 + 7 is 14 • puts "#{10 * 10} is greater than #{9 * 11}" • #=> 100 is greater than 99

  11. Interpolation • Notice how the expressions inside the #{ } are evaluated before being included in the string. • Two requirements here: • The string must be enclosed in double-quotes • Use a pound sign # followed by curly braces {} to enclose the Ruby code.

  12. Exercise Write the following strings using interpolation: • "1 + 1 is: " + (1+1).to_s • "There were 12 cases of a dozen eggs each (" + (12 * 12).to_s + ")" • "His name is " + "jon".capitalize

  13. Solution Solution • "1 + 1 is: #{1+1}" • "There were 12 cases of a dozen eggs each (#{12 * 12})" • "His name is #{"jon".capitalize}"

  14. Backslash is the “escape character” • "He asked Scarlet, \"Frankly my dear, do I give a damn?\" To which Scarlet responded, \"No.\""

  15. New Line puts "Doe, a deer, a female deer.\nRay, a drop of golden\nsun." #=> Doe, a deer, a female deer. #=> Ray, a drop of golden #=> sun.

  16. Common uses of escaped characters • \n • a newline • \t • a tab space • \" • a literal quotation mark • \' • a literal apostrophe • \\ • a literal backslash

  17. What will display? • puts "He's a good doctor, and thorough." • puts '"I\'ve been at sea."' • puts 'Maude said to him: "He's a good doctor, and thorough"' • puts 'Out of order'.upcase • puts "We're going to #{'sea world'.upcase}" • puts "There were #{12*2/4} sheep and #{"three" + " sheepdogs"} out over at #{"Cherry".upcase} Creek." • puts '#{2*2}score and #{"7"} years ago'

  18. What will display? • "He's a good doctor, and thorough." • "I've been at sea." • There is an unterminated String here. The original String terminates at "He's because it is a single-quoted String. A new – and unterminated – String begins at the " after thorough. • OUT OF ORDER • We're going to SEA WORLD • There were 6 sheep and three sheepdogs out over at CHERRY Creek. • #{2*2}score and #{"7"} years ago [string interpolation isn't done in single-quoted strings]

  19. String Substitution puts "The cat and the hat".sub("hat", "rat") #=> The cat and the rat puts "Another brick in the wall".sub("brick in the", "") #=> "Another wall"

  20. Global Substitution puts "I own an iPad, iPhone and an iPod".gsub('i', 'my') #=> I own an myPad, myPhone and an myPod Note that character case matters.

  21. Regular Expressions •  Programming Ruby 1.9 by Dave Thomas (more commonly known as the pickaxe book) sums up what you do with regular expressions in three words • --test, extract, change. 

  22. AKA Regex • You've probably used your word processor's find-and-replace to do substitutions, such as: • Replace all occurrences of "NYC" with "New York City". • With a regular expression, you can do the same find-and-replace action but catch "N.Y.C", "N.Y.", "NY, NY", "nyc" and any other slight variations in spelling and capitalizations, all in one go.

  23. ^\s*\n • ^ • The caret stands for the start of the line. It indicates that we are interested in a pattern from the very beginning of a given line. This is also referred to as an anchor. • \s* • The \s stands for any whitespace character. The asterisks * indicates that we are looking for 0 or more of these whitespaces. So the regex will work if there are no whitespaces or many whitespaces from the beginning of the line. • \n • This is a special character for a newline

  24. ^ + • ^ • Again, this is the beginning-of-the-line anchor. • The empty space is just a literal empty space. We could've also used \s • + • The plus +, known as the greedy operator, looks for one or more of the previous token, which in our example, is a whitespace.

  25. \[\d+\] • \[ • Square brackets are a special character in regexes. But we don't want that special meaning. We just want a literal square bracket, so we escape it using a backslash \ • \d+ • The \d represents any numerical digit. Thus, when followed by the greedy operator +, the \d+ matches one or more numerical digits. • \] • This just matches the literal closing square bracket

  26. Years as 4-digit 3-10-2010 11-7-06 1-6-2007 4-14-08 7-10-2011 1-11-09 12-9-11 6-1-10 5-6-2009

  27. Regex • Open up your editor's find-and-replace and in the Find box, type in: -(\d{2})$ • In the Replace box (your text editor's flavor of regexes may use a backslash \ instead of a dollar sign), type: -20$1

  28. -(\d{2})$ • - • This is just a normal, i.e. literal, i.e. "non-special" hyphen. • () • Parentheses are special regex characters that capture the pattern within them for later use (in the Replace field). In our current example, we want to use whatever the current year value is (e.g. 07 or 11) and prepend a 20 to it. • \d • A d would normally just match the letter "d". But with a backslash, this becomes a special regex character that matches any numerical digit. • {2} • Curly braces allow you to specify the exact number of occurrences of the pattern preceding the braces. Therefore, the regex {2} will match whatever pattern precedes it exactly two times • $ • The dollar sign $ will match the end of the line. We use it in our dates example because we want only to match the last digits of each line. Otherwise, the regex would match the day values because they also begin with a hyphen (ex. 8-20-10).

  29. 20$1 • The only thing special here is the $1 (again, your text editor may use backslashes instead of dollar signs, e.g. \1). • Remember those parentheses we used in the Find pattern? The characters matched by the pattern they encompassed are considered a captured group. • They can be retrieved for use – in this case, the Replace field – by using a dollar sign and the captured group's numerical order. • We only had one set of parentheses, so $1 grabs the first (and only set). If we had used two sets of parentheses, $2 would retrieve the value between the second set of ()

  30. Regex in Ruby puts "My cat eats catfood".sub("cat", "dog") # => My dog eats catfood • If you passed in /cat/, you'd get the same result as above, as the letters cat match their literal values: puts "My cat eats catfood".sub(/cat/, "dog") # => My dog eats catfood

  31. gsub puts "My cat eats catfood".gsub("cat", "dog") # => My dog eats dogfood

  32. We need regex str="My cat goes catatonic when I concatenate his food with Muscat grapes” puts str.gsub("cat", "dog") # => My dog goes dogatonic when I condogenate his food with Musdog grapes

  33. With regex str="My cat gets catatonic when I attempt to concatenate his food with Muscat grapes” puts str.gsub(/\bcat\b/, 'dog') => My dog gets catatonic when I attempt to concatenate his food with Muscat grapes

  34. String.match contract = "Hughes Missile Systems Company, Tucson, Arizona, is being awarded a $7,311,983 modification to a firm fixed price contract for the FY94 TOW missile production buy, total 368 TOW 2Bs. Work will be performed in Tucson, Arizona, and is expected to be completed by April 30, 1996. Of the total contract funds, $7,311,983 will expire at the end of the current fiscal year. This is a sole source contract initiated on January 14, 1991. The contracting activity is the U.S. Army Missile Command, Redstone Arsenal, Alabama (DAAH01-92-C-0260).”

  35. mtch= contract.match(/\$[\d,]+/) puts mtch #=> $7,311,983 #=> $6,952,821

  36. [\d,]+ • The \$ matches a literal dollar sign. The [\d,] matches any character that is either a numerical digit or a comma. • The plus sign + is the greedy operator and it will match the pattern that precedes it one or more times. Therefore: • ...will match any of the following strings: • 12,000 • 42 • 912,345,200 • ,,342134,,3,4,5

  37. Match dates mtch= contract.match(/\w+ \d{1,2}, \d{4}/) puts mtch #=> April 30, 1996 #=> May 31, 1996 \w can be used to match any alphanumeric character. Or if you want to be more precise in matching the month names, you can use a character set, such as [A-Za-z]

  38. If Else if my_bank_account_balance > 50.00 puts "I'm eating steak!" else puts "I'm eating ramen :(" end

  39. Examples if val > 10 puts "Big" end if val > 10 && val <= 0 puts "Small" end

  40. An Exercise • Uses regex • Ask user for email and uses regex to check that it is valid. • Change the name of a group of files • http://www.rexegg.com/regex-uses.html

More Related