1 / 31

Web Scraping Lecture9 - Requests

Web Scraping Lecture9 - Requests. Topics The Requests library Readings: Chapter 9. January 26, 2017. Overview. Last Time: Lecture 8 Slides 1-29 BeautifulSoup revisited Crawling Today: Chapter 3: Lecture 6 Slides 29-40 3-crawlSite.py - 4-getExternalLinks.py –

Download Presentation

Web Scraping Lecture9 - Requests

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Scraping Lecture9 - Requests • Topics • The Requests library • Readings: • Chapter 9 January 26, 2017

  2. Overview • Last Time: Lecture 8 Slides 1-29 • BeautifulSoup revisited • Crawling • Today: • Chapter 3: Lecture 6 Slides 29-40 • 3-crawlSite.py - • 4-getExternalLinks.py – • 5-getAllExternalLinks.py – • Warnings • Chapter 4 • APIs • JSON • Javascript • References • Scrapy site:

  3. References • https://medium.mybridge.co/python-top-10-articles-for-the-past-year-v-2017-6033ae8c65c9#.60zvzhw7u

  4. Logging into Websites • Gets is all we have done • We need to pass information to the server, a user/password pair, to be able to login

  5. Forms example http://pythonscraping.com/pages/files/form.html • <h2>Tell me your name!</h2> • <form method="post" action="processing.php"> • First name: <input type="text" name="firstname"><br> • Last name: <input type="text" name="lastname"><br> • <input type="submit" value="Submit" id="submit"> • </form>

  6. !? Old Code usrlib2  urlrequest • #!/usr/bin/env python • # -*- coding: utf-8 -*- • import urllib2 • gh_url = 'https://api.github.com' • req = urllib2.Request(gh_url) • password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm() • password_manager.add_password(None, gh_url, 'user', 'pass') • auth_manager = urllib2.HTTPBasicAuthHandler(password_manager) • opener = urllib2.build_opener(auth_manager) http://engineering.hackerearth.com/2014/08/21/python-requests-module/

  7. urllib2.install_opener(opener) • handler = urllib2.urlopen(req) • print handler.getcode() • print handler.headers.getheader('content-type') • # ------ • # 200 • # 'application/json' http://engineering.hackerearth.com/2014/08/21/python-requests-module/

  8. Logging into Websites • Requests HTTP for Humans

  9. Now with Requests • import requests • r = requests.get('https://api.github.com', auth=('user', 'pass')) • print r.status_code • print r.headers['content-type'] • # ------ • # 200 • # 'application/json'

  10. Requests.get() • >>> r = requests.get('https://github.com/timeline.json') • Now, we have Response object called r using which we can get all the information. • r.status • r.headers • r.text • …

  11. Posting and etc. • >>> r = requests.post("http://httpbin.org/post") • Other HTTP request types: PUT, DELETE, HEAD and OPTIONS? • >>> r = requests.put("http://httpbin.org/put") • >>> r = requests.delete("http://httpbin.org/delete") • >>> r = requests.head("http://httpbin.org/get") • >>> r = requests.options("http://httpbin.org/get")

  12. Real Signup – for O’Reilly Newsletter

  13. Now Submitting • import requests • params = {' email_addr': 'ryan.e.mitchell@ gmail.com'} • r = requests.post("http://post.oreilly.com/ client/o/oreilly/forms/ quicksignup.cgi", data = params) • print( r.text) • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3659-3664). O'Reilly Media. Kindle Edition.

  14. Chapter 9: 1-simpleForm.py • import requests • params = {'firstname': 'Ryan', 'lastname': 'Mitchell'} • r = requests.post("http://pythonscraping.com/files/processing.php", data=params) • print(r.text) • _________________________________________________________ • wdir='C:/Users/mmm.ENGINEERING/Documents/COURSES_Local/590WebScraping/Code/python-scraping/python-scraping-master/chapter9') • Hello there, Ryan Mitchell!

  15. Radio Buttons, Checkboxes, Etc. • URL of the form: • http:// domainname.com? thing1 = foo& thing2 = bar • This corresponds to a form of this type: • < form method =" GET" action =" someProcessor.php" > • < input type =" someCrazyInputType" name =" thing1" value =" foo" /> • < input type =" anotherCrazyInputType" name =" thing2" value =" bar" /> • < input type =" submit" value =" Submit" /> • </ form > • Which corresponds to the Python parameter object: • {' thing1':' foo', 'thing2':' bar'}

  16. Submitting Files

  17. Chapter 9: 2-fileSubmission.py • import requests • files = {'uploadFile': open('../files/Python-logo.png', 'rb')} • r = requests.post("http://pythonscraping.com/pages/processing2.php", files=files) • print(r.text)

  18. Staying logged in • How do you stay logged into websites as you move from page to page?

  19. Chapter 9: 3-cookies.py • import requests • params = {'username': 'Ryan', 'password': 'password'} • r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", params) • print("Cookie is set to:") • print(r.cookies.get_dict()) • print("-----------") • print("Going to profile page...") • r = requests.get("http://pythonscraping.com/pages/cookies/profile.php", cookies=r.cookies) • print(r.text) Mitchell, Ryan. Web Scraping with Python

  20. 3-cookies.py output produced • wdir='C:/Users/mmm.ENGINEERING/Documents/COURSES_Local/590WebScraping/Code/python-scraping/python-scraping-master/chapter9') • Cookie is set to: • {'username': 'Ryan', 'loggedin': '1'} • ----------- • Going to profile page... • Hey Ryan! Looks like you're still logged into the site!

  21. Sessions

  22. Chapter 9: 4-sessionCookies.py • import requests • session = requests.Session() • params = {'username': 'username', 'password': 'password'} • s = session.post("http://pythonscraping.com/pages/cookies/welcome.php", params) • print("Cookie is set to:") • print(s.cookies.get_dict()) • print("-----------") • print("Going to profile page...") • s = session.get("http://pythonscraping.com/pages/cookies/profile.php") • print(s.text)

  23. Cookie is set to: • {'username': 'username', 'loggedin': '1'} • ----------- • Going to profile page... • Hey username! Looks like you're still logged into the site!

  24. HTTP Basic Access Authorization • import requests • from requests.auth import AuthBase • from requests.auth import HTTPBasicAuth • auth = HTTPBasicAuth('ryan', 'password') • r = requests.post(url="http://pythonscraping.com/pages/auth/login.php", auth=auth) • print(r.text)

  25. CAPTCHA’s -- later

  26. Software Architecture

  27. Architecture of Systems • Python • Re • Urlrequest • BeautifulSoup • Requests • Scrapy • Selenium WebDriver

  28. Testing • 26.4. unittest — Unit testing framework • Source code: Lib/unittest/__init__.py • The unittest unit testing framework was originally inspired by JUnit and has a similar flavor as major unit testing frameworks in other languages. It supports test automation, sharing of setup and shutdown code for tests, aggregation of tests into collections, and independence of the tests from the reporting framework. • To achieve this, unittest supports some important concepts in an object-oriented way:

  29. test fixture • A test fixture represents the preparation needed to perform one or more tests, and any associate cleanup actions. This may involve, for example, creating temporary or proxy databases, directories, or starting a server process. • test case • A test case is the individual unit of testing. It checks for a specific response to a particular set of inputs. unittest provides a base class, TestCase, which may be used to create new test cases. • test suite • A test suite is a collection of test cases, test suites, or both. It is used to aggregate tests that should be executed together. • test runner • A test runner is a component which orchestrates the execution of tests and provides the outcome to the user. The runner may use a graphical interface, a textual interface, or return a special value to indicate the results of executing the tests.

  30. Next time – Selenium Web Driver

More Related