1 / 34

590 Web Scraping – testing

Learn about web scraping, testing websites, and cleaning natural language data using NLTK. Topics include Scrapy, logging spider, openAllLinks, and removing common words.

bobbi
Download Presentation

590 Web Scraping – testing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 590 Web Scraping – testing • Topics • Chapter 13 - Testing • Readings: • Text – chapter 13 April 4, 2017

  2. Today • Scrapers from scrapy_documentation • loggingSpider.py • openAllLinks.py • Cleaning NLTK data • Removing common words • Testing in Python • unitest • Testing websites

  3. Rest of the semester • Tuesday April 4 • Thursday April 6 • Tuesday April 11 • Thursday April 13 – Test 2 • Tuesday April 18 • Thursday April 20 • Tuesday April 25 – Reading Day • Tuesday May 2 – 9:00 a.m.  EXAM

  4. Test 2 • 50% in class • 50% take-home

  5. Exam – Scraping project • Proposalstatement (April 11) – one sentence description • Project description (April 18) • Demo (May 2)

  6. Cleaning Natural Language data • Removing common words • Corpus of Contemporary English • http://corpus.byu.edu/coca • In addition to this online interface, you can also download extensive data for offline use -- full-text, word frequency, n-grams, and collocates data. You can also access the data via WordAndPhrase (including the ability to analyze entire texts that you input).

  7. Most common words in English • 1rst 25 make up 1/3 of English text • 1rst 100 makeup ½ • common = [‘the’, ‘be’, …] • if isCommon(word) …

  8. More Scrapy • Logging spider • openAllLinks • LxmlLinkExtractor

  9. loggingSpider.py • import scrapy • class MySpider(scrapy.Spider): • name = 'example.com' • allowed_domains = ['example.com'] • start_urls = [ • 'http://www.example.com/1.html', • 'http://www.example.com/2.html', • 'http://www.example.com/3.html', • ] • def parse(self, response): • self.logger.info('A response from %s just arrived!', response.url) scrapy documentation page 36

  10. openAllLinks.py • #multiple Requests and items from a single callback • import scrapy • class MySpider(scrapy.Spider): • name = 'example.com' • allowed_domains = ['example.com'] • start_urls = [ • 'http://www.example.com/1.html', scrapy • … • ] • def parse(self, response): • for h3 in response.xpath('//h3').extract(): • yield {"title": h3} • for url in response.xpath('//a/@href').extract(): • yield scrapy.Request(url, callback=self.parse) scrapy documentation page 36

  11. LxmlLinkExtractor • class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor( • allow=(), • deny=(), • allow_domains=(), • deny_domains=(), • deny_extensions=None, • restrict_xpaths=(), • restrict_css=(), • tags=(‘a’, ‘area’), • attrs=(‘href ’, ), • canonicalize=True, • unique=True, • process_value=None) scrapy documentation

  12. allow (a regular expression (or list of)) – a single regular expression (or list) that (absolute) urls must match • deny (a regular expression (or list of)) • allow_domains (str or list) – • deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links • deny_extensions (list) – a single value or list of strings containing extensions that should be ignored when extracting links. If not given, it will default to the IGNORED_EXTENSIONS list defined in the scrapy.linkextractors package. scrapy documentation

  13. restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below. • restrict_css (str or list) – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. • tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to (’a’, ’area’). • attrs (list) – an attribute or list of attributes which should be considered scrapy documentation

  14. canonicalize (boolean) – canonicalize each extracted url (using w3lib.url.canonicalize_url). Defaults to True. • unique (boolean) – whether duplicate filtering should be applied to extracted links. • process_value (callable) – a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x. scrapy documentation

  15. Chapter 13 Testing • 1-wikiUnitTest.py • 2-wikiSeleniumTest • 3-interactiveTest • 4-dragAndDrop • 5-takeScreenshot • 6-combinedTest • ghostdriver scrapy documentation

  16. Unit Testing • Junit • public class MyUnit { • public String concatenate(String one, String two){ • return one + two; • } • } http://tutorials.jenkov.com/java-unit-testing/simple-test.html

  17. import org.junit.Test; • import static org.junit.Assert.*; • public class MyUnitTest { • @Test • public void testConcatenate() { • MyUnitmyUnit = new MyUnit(); • String result = myUnit.concatenate("one", "two"); • assertEquals("onetwo", result); • } • } http://tutorials.jenkov.com/java-unit-testing/simple-test.html

  18. Python unittest • Comes standard with python • Import and extend unittest.TestCase • setup – run before test to initialize testcase • Teardown – run after • Provide several types of asserts • Run all fubctions that begin with test_ as unit tests

  19. UnitestExample • class TestStringMethods(unittest.TestCase): • deftest_upper(self) • self.assertEqual('foo'.upper(), 'FOO') • deftest_isupper(self): • self.assertTrue('FOO'.isupper()) • self.assertFalse('Foo'.isupper()) • deftest_split(self): • s = 'hello world' • self.assertEqual(s.split(), ['hello', 'world']) • # check that s.split fails when the separator is not a string • with self.assertRaises(TypeError): • s.split(2) • if __name__ == '__main__': • unittest.main()

  20. 1-wikiUnitTest.py • from urllib.request import urlopen • from urllib.parse import unquote • import random • import re • from bs4 import BeautifulSoup • import unittest • class TestWikipedia(unittest.TestCase): • bsObj = None • url = None scrapy documentation

  21. deftest_PageProperties(self): • global bsObj • global url • url = "http://en.wikipedia.org/wiki/Monty_Python" • #Test the first 100 pages we encounter • for i in range(1, 100): • bsObj = BeautifulSoup(urlopen(url)) • titles = self.titleMatchesURL() • self.assertEquals(titles[0], titles[1]) • self.assertTrue(self.contentExists()) • url = self.getNextLink() • print("Done!")

  22. deftitleMatchesURL(self): • global bsObj • global url • pageTitle = bsObj.find("h1").get_text() • urlTitle = url[(url.index("/wiki/")+6):] • urlTitle = urlTitle.replace("_", " ") • urlTitle = unquote(urlTitle) • return [pageTitle.lower(), urlTitle.lower()]

  23. defcontentExists(self): • global bsObj • content = bsObj.find("div",{"id":"mw-content-text"}) • if content is not None: • return True • return False • defgetNextLink(self): • global bsObj • links = bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")) • link = links[random.randint(0, len(links)-1)].attrs['href'] • print("Next link is: "+link) • return "http://en.wikipedia.org"+link

  24. if __name__ == '__main__': • unittest.main()

  25. 2-wikiSeleniumTest • from selenium import webdriver • driver = webdriver.PhantomJS(executable_path='/Users/ryan/Documents/pythonscraping/code/headless/phantomjs-1.9.8-macosx/bin/phantomjs') • driver.get("http://en.wikipedia.org/wiki/Monty_Python") • assert "Monty Python" in driver.title • print("Monty Python was not in the title") • driver.close()

  26. 3-interactiveTest • from selenium import webdriver • from selenium.webdriver.remote.webelement import WebElement • from selenium.webdriver.common.keys import Keys • from selenium.webdriver import ActionChains • #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS • driver = webdriver.PhantomJS(executable_path='../phantomjs-2.1.1-macosx/bin/phantomjs') • #driver = webdriver.Chrome(executable_path='../chromedriver/chromedriver') • driver.get("http://pythonscraping.com/pages/files/form.html")

  27. firstnameField = driver.find_element_by_name("firstname") • lastnameField = driver.find_element_by_name("lastname") • submitButton = driver.find_element_by_id("submit") • ### METHOD 1 ### • firstnameField.send_keys("Ryan") • lastnameField.send_keys("Mitchell") • submitButton.click()

  28. ### METHOD 2 ### • actions = ActionChains(driver).click(firstnameField).send_keys("Ryan").click(lastnameField).send_keys("Mitchell").send_keys(Keys.RETURN) • actions.perform() • ################ • print(driver.find_element_by_tag_name("body").text) • driver.close()

  29. 4-dragAndDrop • from selenium import webdriver • from selenium.webdriver.remote.webelement import WebElement • from selenium.webdriver import ActionChains • #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS • driver = webdriver.PhantomJS(executable_path='../phantomjs-2.1.1-macosx/bin/phantomjs') • #driver = webdriver.Chrome(executable_path='../chromedriver/chromedriver') • driver.get('http://pythonscraping.com/pages/javascript/draggableDemo.html') • print(driver.find_element_by_id("message").text)

  30. print(driver.find_element_by_id("message").text) • element = driver.find_element_by_id("draggable") • target = driver.find_element_by_id("div2") • actions = ActionChains(driver) • actions.drag_and_drop(element, target).perform() • print(driver.find_element_by_id("message").text)

  31. 5-takeScreenshot • from selenium import webdriver • from selenium.webdriver.remote.webelement import WebElement • from selenium.webdriver import ActionChains • #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS • driver = webdriver.PhantomJS(executable_path='../phantomjs-2.1.1-macosx/bin/phantomjs') • driver.implicitly_wait(5) • driver.get('http://www.pythonscraping.com/') • driver.get_screenshot_as_file('tmp/pythonscraping.png')

  32. 6-combinedTest • from selenium import webdriver • from selenium.webdriver.remote.webelement import WebElement • from selenium.webdriver import ActionChains • import unittest

  33. class TestAddition(unittest.TestCase): • driver = None • defsetUp(self): • global driver • #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS • driver = webdriver.PhantomJS(executable_path='../phantomjs-2.1.1-macosx/bin/phantomjs') • #driver = webdriver.Chrome(executable_path='../chromedriver/chromedriver') • url = 'http://pythonscraping.com/pages/javascript/draggableDemo.html' • driver.get(url)

  34. deftearDown(self): • print("Tearing down the test") • deftest_drag(self): • global driver • element = driver.find_element_by_id("draggable") • target = driver.find_element_by_id("div2") • actions = ActionChains(driver) • actions.drag_and_drop(element, target).perform() • self.assertEqual("You are definitely not a bot!", driver.find_element_by_id("message").text)

More Related