1 / 20

590 Scraping – NER shape features

590 Scraping – NER shape features. Topics Scrapy – items.py Readings: Srapy documentation. April 4, 2017. Today. Scrapers from scrapy_documentation loggingSpider.py openAllLinks.py Cleaning NLTK data Removing common words Testing in Python unitest Testing websites . Scrapy notes.

fredpruitt
Download Presentation

590 Scraping – NER shape features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 590 Scraping – NER shape features • Topics • Scrapy – items.py • Readings: • Srapy documentation April 4, 2017

  2. Today • Scrapers from scrapy_documentation • loggingSpider.py • openAllLinks.py • Cleaning NLTK data • Removing common words • Testing in Python • unitest • Testing websites

  3. Scrapy notes • Focused narrow scrape (one domain) • Broad scrapes – better suited to • Dealing with javascript in scrapy

  4. Selenium and Scrapy • from scrapy.http import HtmlResponse • from selenium import webdriver • class JSMiddleware(object): • defprocess_request(self, request, spider): • driver = webdriver.PhantomJS() • driver.get(request.url) • body = driver.page_source • return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)

  5. Cfg file • Populating the settings • Settings can be populated using different mechanisms, each of which having a different precedence. Here is the list of them in decreasing order of precedence: • Command line options (most precedence) • Settings per-spider • Project settings module • Default settings per-command • Default global settings (less precedence)

  6. Command line settings • scrapy crawl myspider -s LOG_FILE=scrapy.log

  7. DEPTH_LIMIT • Default: 0 • Scope: scrapy.spidermiddlewares.depth.DepthMiddleware • The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.

  8. DEPTH_PRIORITY • Default: 0 • Scope: scrapy.spidermiddlewares.depth.DepthMiddleware • An integer that is used to adjust the request priority based on its depth: • if zero (default), no priority adjustment is made from depth • a positive value will decrease the priority, i.e. higher depth requests will be processed later ; this is commonly used when doing breadth-first crawls (BFO) • a negative value will increase priority, i.e., higher depth requests will be processed sooner (DFO) • See also: Does Scrapy crawl in breadth-first or depth-first order? about tuning Scrapy for BFO or DFO.

  9. DOWNLOAD_DELAY • Default: 0 • The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported. Example: • DOWNLOAD_DELAY = 0.25 # 250 ms of delay • This setting is also affected by the RANDOMIZE_DOWNLOAD_DELAY setting (which is enabled by default). By default, Scrapy doesn’t wait a fixed amount of time between requests, but uses a random interval between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY.

  10. Conifers

  11. Items.py • import scrapy • class ConifersItem(scrapy.Item): • # define the fields for your item here like: • name = scrapy.Field() • genus = scrapy.Field() • species = scrapy.Field() • pass

  12. Middleware.py • from scrapy import signals • class ConifersSpiderMiddleware(object): • # Not all methods need to be defined. If a method is not defined, • # scrapy acts as if the spider middleware does not modify the • # passed objects. • @classmethod • deffrom_crawler(cls, crawler): • # This method is used by Scrapy to create your spiders. • s = cls() • crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) • return s • defprocess_spider_input(response, spider): • # Called for each response that goes through the spider • # middleware and into the spider. • # Should return None or raise an exception. • return None

  13. defprocess_spider_output(response, result, spider): • # Called with the results returned from the Spider, after • # it has processed the response. • # Must return an iterable of Request, dict or Item objects. • for i in result: • yield i • defprocess_spider_exception(response, exception, spider): • # Called when a spider or process_spider_input() method • # (from other spider middleware) raises an exception. • # Should return either None or an iterable of Response, dict • # or Item objects. • pass

  14. defprocess_start_requests(start_requests, spider): • # Called with the start requests of the spider, and works • # similarly to the process_spider_output() method, except • # that it doesn’t have a response associated. • # Must return only requests (not items). • for r in start_requests: • yield r • defspider_opened(self, spider): • spider.logger.info('Spider opened: %s' % spider.name)

  15. BOT_NAME = 'conifers' • SPIDER_MODULES = ['conifers.spiders'] • NEWSPIDER_MODULE = 'conifers.spiders' • # Crawl responsibly by identifying yourself (and your website) on the user-agent • #USER_AGENT = 'conifers (+http://www.yourdomain.com)' • # Obey robots.txt rules • ROBOTSTXT_OBEY = True

  16. # Configure maximum concurrent requests performed by Scrapy (default: 16) • #CONCURRENT_REQUESTS = 32 • # Configure a delay for requests for the same website (default: 0) • # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay • # See also autothrottle settings and docs • #DOWNLOAD_DELAY = 3 • # The download delay setting will honor only one of: • #CONCURRENT_REQUESTS_PER_DOMAIN = 16 • #CONCURRENT_REQUESTS_PER_IP = 16

  17. Pipelines.py • # Define your item pipelines here • # • # Don't forget to add your pipeline to the ITEM_PIPELINES setting • # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html • class ConifersPipeline(object): • defprocess_item(self, item, spider): • return item

  18. coniferSpider • from conifers.items import ConifersItem • class ConiferSpider(scrapy.Spider): • name = "conifer" • allowed_domains = ["greatplantpicks.org"] • start_urls = ['http://greatplantpicks.org/by_plant_type/conifer'] • def parse(self, response): • #filename = response.url.split("/")[-2] + '.html' • filename = 'conifers' + '.html' • with open(filename, 'wb') as f: • f.write(response.body) • pass

  19. import scrapy • from conifers.items import ConifersItem • #from scrapy.selector import Selector • #from scrapy.http import HtmlResponse • class ConifersextractSpider(scrapy.Spider): • name = "conifersExtract" • allowed_domains = ["greatplantpicks.org"] • start_urls = ['http://www.greatplantpicks.org/plantlists/by_plant_type/conifer']

  20. def parse(self, response): • for sel in response.xpath('//tbody/tr'): • item = ConifersItem() • item['name']= sel.xpath('td[@class="common-name"]/a/ text()').extract() • item['genus'] = sel.xpath('td[@class="plantname"]/a/span[@class="genus"]/text()').extract() • item['species'] = sel.xpath('td[@class="plantname"]/a/span[@class="species"]/text()').extract() • yield item

More Related