1 / 50

Web Scraping Lecture 10 - Selenium

Web Scraping Lecture 10 - Selenium. Topics Selenium Webdriver ChromeDriver , PhantomJS Readings: Chapter 10. January 26, 2017. Overview. Last Time: Lecture 8 Slides 1-29 Chapter 9 : the Requests Library – filling out forms 1-simpleForm.py 2-fileSubmission.py 3- cookies.py

barr
Download Presentation

Web Scraping Lecture 10 - Selenium

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Scraping Lecture 10 - Selenium • Topics • Selenium Webdriver • ChromeDriver, PhantomJS • Readings: • Chapter 10 January 26, 2017

  2. Overview • Last Time: Lecture 8 Slides 1-29 • Chapter 9: the Requests Library – filling out forms • 1-simpleForm.py • 2-fileSubmission.py • 3- cookies.py • 4-sessionCookies.py– • 5-BasicAuth.py • Software Architecture of systems • Today: • Chapter 13: • References: Chapter 13, websites

  3. Selenium Web Driver Big Picture • Big Picture = Software Architecture – how components of the software fit together

  4. References • Windows Installation • YouTube video • https://www.youtube.com/watch?v=V69wc4Tmwjc • Linux Installation • http://blog.likewise.org/2015/01/setting-up-chromedriver-and-the-selenium-webdriver-python-bindings-on-ubuntu-14-dot-04/ • Chrome Driver • https://sites.google.com/a/chromium.org/chromedriver/getting-started • PhantomJS • Selenium Site

  5. JavaScript • < script > alert(" This creates a pop-up using JavaScript"); </ script > • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3813-3814). O'Reilly Media. Kindle Edition. Web Scraping with Python: Collecting Data from the Modern Web by Ryan Mitchell

  6. Examples of Javascript

  7. jQuery • jQuery is an extremely common library, • used by 70% of the most popular Internet sites and • about 30% of the rest of the Internet. • A site using jQuery is readily identifiable because it will contain an import to jQuery somewhere in its code, such as: • < script src =" http:// ajax.googleapis.com/ ajax/ libs/ jquery/ 1.9.1/ jquery.min.js" > </ script > • dynamically creates HTML content that appears only after the JavaScript is executed.

  8. Google analytics

  9. Google Maps • Embedded in websites

  10. Executing Javascript with Selenium

  11. Selenium Self Service Carolina Demo

  12. Ajax and Dynamic HTML

  13. Installation • Not just pip here; there is the separate ChromeDriver executable that forms the interface between your python program using selenium and the Browser (in this case Chrome)

  14. ChromeDriver - WebDriver for Chrome • Latest Release: ChromeDriver2.27 • https://sites.google.com/a/chromium.org/chromedriver/downloads • Pick your OS • Unzip and remember where it is

  15. PhantonJS – headless WebDriver • http://phantomjs.org/download.html

  16. Setting Up ChromeDriver and the Selenium-WebDriver Python bindings on Ubuntu 14.04 • install Google Chrome for Debian/Ubuntu: • sudo apt-get install libxss1 libappindicator1 libindicator7 • wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb • sudo dpkg -i google-chrome*.deb • sudo apt-get install –f • install xvfb so we can run Chrome headlessly: • sudo apt-get install xvfb https://christopher.su/2015/selenium-chromedriver-ubuntu/

  17. Chromedriver – Unbuntu 14.4 • sudo apt-get install unzip • wget -N http://chromedriver.storage.googleapis.com/2.26/chromedriver_linux64.zip • unzip chromedriver_linux64.zip • chmod +x chromedriver • sudo mv -f chromedriver /usr/local/share/chromedriver • sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver • sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver https://christopher.su/2015/selenium-chromedriver-ubuntu/

  18. Install Selenium and pyvirtualdisplay • pip install pyvirtualdisplay selenium • Now, we can do stuff like this with Selenium in Python: • from pyvirtualdisplay import Display • from selenium import webdriver • display = Display(visible=0, size=(800, 600)) • display.start() • driver = webdriver.Chrome() • driver.get('http://christopher.su') • print driver.title

  19. Selenium Selectors

  20. Still can use BeatiufulSoup

  21. from selenium.webdriver.common.by import By

  22. By Selection strategies

  23. PhantonJS – headless WebDriver Again • http://phantomjs.org/download.html

  24. XPath Syntax • XPath (short for XML Path) is a query language used for navigating and selecting portions of an XML document. • founded by the W3C in 1999 • used in languages such as Python, Java, and C# when dealing with XML documents. • Although BeautifulSoup does not support XPath, many of the other libraries in this book do. • It can often be used in the same way as CSS selectors (such as mytag# idname), although it is designed to work with more generalized XML documents rather than HTML documents in particular. • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 4051-4056). O'Reilly Media. Kindle Edition.

  25. XPATH

  26. XPATH

  27. Selenium Self Service Carolina Demo • if __name__ == "__main__": • driver = init_driver() • password = "MyPassword" • #password = input("Enter MySC password: ") • lookup(driver, "Selenium") • time.sleep(5) • driver.quit()

  28. import time • from selenium import webdriver • from selenium.webdriver.common.by import By • from selenium.webdriver.support.ui import WebDriverWait • from selenium.webdriver.support import expected_conditions as EC • from selenium.common.exceptions import TimeoutException • from bs4 import BeautifulSoup • definit_driver(): • driver = webdriver.Chrome("E:/chromedriver_win32/chromedriver.exe") • driver.wait = WebDriverWait(driver, 5) • return driver

  29. def lookup(driver, query): • driver.get("https://my.sc.edu/") • print ("SSC opened") • try: • link = driver.wait.until(EC.presence_of_element_located( • (By.PARTIAL_LINK_TEXT, "Sign in to"))) • #https://ssb.onecarolina.sc.edu/BANP/twbkwbis.P_WWWLogin?pkg=twbkwbis.P_GenMenu%3Fname%3Dbmenu.P_MainMnu • print ("Found link", link) • link.click() • print ("Clicked link") • #button = driver.wait.until(EC.element_to_be_clickable( • # (By.NAME, "btnK"))) • #box.send_keys(query) • #button.click() • except TimeoutException: • print("Houston we have a problem First Page")

  30. # Now try to login • try: • user_box = driver.wait.until(EC.presence_of_element_located( • (By.NAME, "username"))) • #https://ssb.onecarolina.sc.edu/BANP/twbkwbis.P_WWWLogin?pkg=twbkwbis.P_GenMenu%3Fname%3Dbmenu.P_MainMnu • print ("Found box", user_box) • user_box.send_keys("01069379") • print ("ID entered") • passwd_box = driver.wait.until(EC.presence_of_element_located( • (By.ID, "vipid-password"))) • print ("Found password box", passwd_box) • passwd_box.send_keys(password) • print ("password entered") • button = driver.wait.until(EC.element_to_be_clickable( • (By.NAME, "submit"))) • print ("Found submit button", button) • #box.send_keys(query) • button.click() • except TimeoutException: • print("Houston we have a problem Login Page")

More Related