Scrapy Best Practices: Settings, Middleware, and Items in Web Scraping

590 Scraping – NER shape features • Topics • Scrapy – items.py • Readings: • Srapy documentation April 4, 2017

Today • Scrapers from scrapy_documentation • loggingSpider.py • openAllLinks.py • Cleaning NLTK data • Removing common words • Testing in Python • unitest • Testing websites

Scrapy notes • Focused narrow scrape (one domain) • Broad scrapes – better suited to • Dealing with javascript in scrapy

Selenium and Scrapy • from scrapy.http import HtmlResponse • from selenium import webdriver • class JSMiddleware(object): • defprocess_request(self, request, spider): • driver = webdriver.PhantomJS() • driver.get(request.url) • body = driver.page_source • return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)

Cfg file • Populating the settings • Settings can be populated using different mechanisms, each of which having a different precedence. Here is the list of them in decreasing order of precedence: • Command line options (most precedence) • Settings per-spider • Project settings module • Default settings per-command • Default global settings (less precedence)

Command line settings • scrapy crawl myspider -s LOG_FILE=scrapy.log

DEPTH_LIMIT • Default: 0 • Scope: scrapy.spidermiddlewares.depth.DepthMiddleware • The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.

DEPTH_PRIORITY • Default: 0 • Scope: scrapy.spidermiddlewares.depth.DepthMiddleware • An integer that is used to adjust the request priority based on its depth: • if zero (default), no priority adjustment is made from depth • a positive value will decrease the priority, i.e. higher depth requests will be processed later ; this is commonly used when doing breadth-first crawls (BFO) • a negative value will increase priority, i.e., higher depth requests will be processed sooner (DFO) • See also: Does Scrapy crawl in breadth-first or depth-first order? about tuning Scrapy for BFO or DFO.

DOWNLOAD_DELAY • Default: 0 • The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported. Example: • DOWNLOAD_DELAY = 0.25 # 250 ms of delay • This setting is also affected by the RANDOMIZE_DOWNLOAD_DELAY setting (which is enabled by default). By default, Scrapy doesn’t wait a fixed amount of time between requests, but uses a random interval between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY.

Conifers

Items.py • import scrapy • class ConifersItem(scrapy.Item): • # define the fields for your item here like: • name = scrapy.Field() • genus = scrapy.Field() • species = scrapy.Field() • pass

Middleware.py • from scrapy import signals • class ConifersSpiderMiddleware(object): • # Not all methods need to be defined. If a method is not defined, • # scrapy acts as if the spider middleware does not modify the • # passed objects. • @classmethod • deffrom_crawler(cls, crawler): • # This method is used by Scrapy to create your spiders. • s = cls() • crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) • return s • defprocess_spider_input(response, spider): • # Called for each response that goes through the spider • # middleware and into the spider. • # Should return None or raise an exception. • return None

defprocess_spider_output(response, result, spider): • # Called with the results returned from the Spider, after • # it has processed the response. • # Must return an iterable of Request, dict or Item objects. • for i in result: • yield i • defprocess_spider_exception(response, exception, spider): • # Called when a spider or process_spider_input() method • # (from other spider middleware) raises an exception. • # Should return either None or an iterable of Response, dict • # or Item objects. • pass

defprocess_start_requests(start_requests, spider): • # Called with the start requests of the spider, and works • # similarly to the process_spider_output() method, except • # that it doesn’t have a response associated. • # Must return only requests (not items). • for r in start_requests: • yield r • defspider_opened(self, spider): • spider.logger.info('Spider opened: %s' % spider.name)

BOT_NAME = 'conifers' • SPIDER_MODULES = ['conifers.spiders'] • NEWSPIDER_MODULE = 'conifers.spiders' • # Crawl responsibly by identifying yourself (and your website) on the user-agent • #USER_AGENT = 'conifers (+http://www.yourdomain.com)' • # Obey robots.txt rules • ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16) • #CONCURRENT_REQUESTS = 32 • # Configure a delay for requests for the same website (default: 0) • # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay • # See also autothrottle settings and docs • #DOWNLOAD_DELAY = 3 • # The download delay setting will honor only one of: • #CONCURRENT_REQUESTS_PER_DOMAIN = 16 • #CONCURRENT_REQUESTS_PER_IP = 16

Pipelines.py • # Define your item pipelines here • # • # Don't forget to add your pipeline to the ITEM_PIPELINES setting • # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html • class ConifersPipeline(object): • defprocess_item(self, item, spider): • return item

coniferSpider • from conifers.items import ConifersItem • class ConiferSpider(scrapy.Spider): • name = "conifer" • allowed_domains = ["greatplantpicks.org"] • start_urls = ['http://greatplantpicks.org/by_plant_type/conifer'] • def parse(self, response): • #filename = response.url.split("/")[-2] + '.html' • filename = 'conifers' + '.html' • with open(filename, 'wb') as f: • f.write(response.body) • pass

import scrapy • from conifers.items import ConifersItem • #from scrapy.selector import Selector • #from scrapy.http import HtmlResponse • class ConifersextractSpider(scrapy.Spider): • name = "conifersExtract" • allowed_domains = ["greatplantpicks.org"] • start_urls = ['http://www.greatplantpicks.org/plantlists/by_plant_type/conifer']

def parse(self, response): • for sel in response.xpath('//tbody/tr'): • item = ConifersItem() • item['name']= sel.xpath('td[@class="common-name"]/a/ text()').extract() • item['genus'] = sel.xpath('td[@class="plantname"]/a/span[@class="genus"]/text()').extract() • item['species'] = sel.xpath('td[@class="plantname"]/a/span[@class="species"]/text()').extract() • yield item

Scrapy Best Practices: Settings, Middleware, and Items in Web Scraping

Scrapy Best Practices: Settings, Middleware, and Items in Web Scraping

Presentation Transcript

Nysba.org - Data Scraping

Njsba.com - Data Scraping

Riabiz.com - Data Scraping

Ilrg.com - Data Scraping

Inbar.org - Data Scraping

Lawyerlegion.com - Data Scraping

Usrg.com - Data Scraping

Dice.com - Data Scraping

Carfinder.com - Data Scraping

Michbar.org - Data Scraping

Fivestaralliance.com - Data Scraping

Attorneys.us - Data Scraping

Alabar.org - Data Scraping

Gayot.com - Data Scraping

Overseasjobs.com - Data Scraping

Wyndham.com - Data Scraping

Jobs.com - Data Scraping

Badc.org - Data Scraping

Wisbar.org - Data Scraping

Deals Information Scraping From Groupon

3 worth-a-shot Dosâ€™ of Web Scraping Service for the beginners to follow up each time

Scraping data from amazon| Amazon web scraping