1 / 47

Web Scraping Lecture 8 – Storing Data

Web Scraping Lecture 8 – Storing Data. Topics Storing data Downloading CSV , MySQL Readings: Chapters 5, and 4. February 2, 2017. Overview. Last Time: Lecture 6 slides 30- end; Lecture 7 Slides 1-31 Crawling from Chapter 3: Lecture 6 Slides 29-40

astroman
Download Presentation

Web Scraping Lecture 8 – Storing Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Scraping Lecture 8 – Storing Data • Topics • Storing data • Downloading • CSV, MySQL • Readings: • Chapters 5, and 4 February 2, 2017

  2. Overview • Last Time: Lecture 6 slides 30- end; Lecture 7 Slides 1-31 • Crawling from Chapter 3: Lecture 6 Slides 29-40 • Getting code again: https://github.com/REMitchell/python-scraping • 3-crawlSite.py • 4-getExternalLinks.py • 5-getAllExternalLinks.py • Chapter 4 • APIs • JSON • Today: • Iterators, generators and yield • Chapter 4 • APIs • JSON • Javascript • References- Scrapy site/user manual

  3. Reg Expressions – Lookahead patterns • (?=...)Matches if ... matches next, but doesn’t consume any of the string. • This is called a lookaheadassertion. • For example, Isaac (?=Asimov) will match 'Isaac ' only if it’sfollowed by 'Asimov'. • (?!...)Matches if ... doesn’t match next. • This is a negative lookahead assertion. • For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'. • (?#...) A comment; the contents of the parentheses are simply ignored.

  4. Chapter 4: Using APIs • API - In computer programming, an application programming interface (API) is a set of subroutine definitions, protocols, and tools for building application software. • https://en.wikipedia.org/wiki/Application_programming_interface • A web API is an application programming interface (API) for either a web server or a web browser. • Program request in HTML • Response in XML or JSON

  5. Authentication • Identify users – for charges etc. • http:// developer.echonest.com/ api/ v4/ artist/ songs? api_key = < your api key here > %20& name = guns% 20n% 27% 20roses& format = json& start = 0& results = 100 • Using urlopen • token = "< your api key >" • webRequest = • urllib.request.Request(" http:// myapi.com", headers ={" token": token}) • html = urlopen( webRequest)

  6. Google Developers APIs

  7. Mining the Social Web; so Twitter Later

  8. Yield in Python • def _get_child_candidates(self, distance, min_dist, max_dist): • if self._leftchild and distance - max_dist < self._median: • yield self._leftchild • if self._rightchild and distance + max_dist >= self._median: • yield self._rightchild • result, candidates = list(), [self] • while candidates: • node = candidates.pop() • distance = node._get_dist(obj) • if distance <= max_dist and distance >= min_dist: • result.extend(node._values) • candidates.extend(node._get_child_candidates(distance, min_dist, max_dist)) • return result https://pythontips.com/2013/09/29/the-python-yield-keyword-explained/ http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do

  9. Iterators and Generators • When you create a list, you can read its items one by one. Reading its items one by one is called iteration: • >>> mylist = [1, 2, 3] • >>> for i in mylist: • ... print(i) • Generators are iterators, but you can only iterate over them once. It's because they do not store all the values in memory, they generate the values on the fly: • >>> mygenerator = (x*x for x in range(3)) • >>> for i in mygenerator: • ... print(i) https://pythontips.com/2013/09/29/the-python-yield-keyword-explained/

  10. Yield • Yield is a keyword that is used like return, except the function will return a generator. • >>> defcreateGenerator(): • ... mylist = range(3) • ... for i in mylist: • ... yield i*i • ... • >>> mygenerator = createGenerator() # create a generator • >>> print(mygenerator) # mygenerator is an object! • <generator object createGenerator at 0xb7555c34> • >>> for i in mygenerator: • ... print(i) https://pythontips.com/2013/09/29/the-python-yield-keyword-explained/

  11. #Chapter 4: 4-DecodeJson.py • #Web Scraping with Python by Ryan Mitchell • import json • from urllib.request import urlopen • defgetCountry(ipAddress): • response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8') • responseJson = json.loads(response) • return responseJson.get("country_code") • print(getCountry("50.78.253.58"))

  12. #Chapter 4: 5-jsonParsing.py • import json • jsonString = '{"arrayOfNums":[{"number":0},{"number":1},{"number":2}], "arrayOfFruits":[{"fruit":"apple"},{"fruit":"banana"},{"fruit":"pear"}]}' • jsonObj = json.loads(jsonString) • print(jsonObj.get("arrayOfNums")) • print(jsonObj.get("arrayOfNums")[1]) • print(jsonObj.get("arrayOfNums")[1].get("number")+jsonObj.get("arrayOfNums")[2].get("number")) • print(jsonObj.get("arrayOfFruits")[2].get("fruit"))

  13. Wiki Editing Histories – from where?

  14. The Map --

  15. #Chapter 4: 6-wikiHistories.py • from urllib.request import urlopen • from urllib.request import HTTPError • from bs4 import BeautifulSoup • import datetime • import json • import random • import re • random.seed(datetime.datetime.now()) • defgetLinks(articleUrl): • html = urlopen("http://en.wikipedia.org"+articleUrl) • bsObj = BeautifulSoup(html, "html.parser") • return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

  16. defgetHistoryIPs(pageUrl): • #Format of revision history pages is: #http://en.wikipedia.org/w/index.php?title=Title_in_URL&action=history • pageUrl = pageUrl.replace("/wiki/", "") • historyUrl = "http://en.wikipedia.org/w/index.php?title="+pageUrl+"&action=history" • print("history url is: "+historyUrl) • html = urlopen(historyUrl) • bsObj = BeautifulSoup(html, "html.parser") • #finds only the links with class "mw-anonuserlink" which has IP addresses • #instead of usernames • ipAddresses = bsObj.findAll("a", {"class":"mw-anonuserlink"}) • addressList = set() • for ipAddress in ipAddresses: • addressList.add(ipAddress.get_text()) • return addressList

  17. defgetCountry(ipAddress): • try: • response = urlopen("http://freegeoip.net/json/"+ipAddress).read(). • decode('utf-8') • except HTTPError: • return None • responseJson = json.loads(response) • return responseJson.get("country_code") • links = getLinks("/wiki/Python_(programming_language)")

  18. Output

  19. Chapter 5 - Storing Data • Files • Csv files • Json, xml

  20. Downloading images : to copy or not • As you are scraping do you download images or just store links? • Advantages to not copying? • Scrapers run much faster, and require much less bandwidth, when they don’t have to download files. • You save space on your own machine by storing only the URLs. • It is easier to write code that only stores URLs and doesn’t need to deal with additional file downloads. • You can lessen the load on the host server by avoiding large file downloads.

  21. Advantages to not copying? • Embedding these URLs in your own website or application is known as   hotlinking  and doing it is a very quick way to get you in hot water on the Internet. • You do not want to use someone else’s server cycles to host media for your own applications. • The file hosted at any particular URL is subject to change. This might lead to embarrassing effects if, say, you’re embedding a hotlinked image on a public blog. • If you’re storing the URLs with the intent to store the file later, for further research, it might eventually go missing or be changed to something completely irrelevant at a later date. • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 1877-1882). O'Reilly Media. Kindle Edition.

  22. Downloading the “logo” (image) • from urllib.request import urlretrieve • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen(" http:// www.pythonscraping.com") • bsObj = BeautifulSoup( html) • imageLocation = bsObj.find(" a", {" id": "logo"}). find(" img")[" src"] • urlretrieve (imageLocation, "logo.jpg") • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 1889-1896). O'Reilly Media. Kindle Edition.

  23. #Chapter 5: 1-getPageMedia.py • #Web Scraping with Python by Ryan Mitchell • #Chapter 5: 1-getPageMedia.py • import os • from urllib.request import urlretrieve • from urllib.request import urlopen • from bs4 import BeautifulSoup • downloadDirectory = "downloaded" • baseUrl = "http://pythonscraping.com"

  24. defgetAbsoluteURL(baseUrl, source): • if source.startswith("http://www."): • url = "http://"+source[11:] • elifsource.startswith("http://"): • url = source • elifsource.startswith("www."): • url = source[4:] • url = "http://"+source • else: • url = baseUrl+"/"+source • if baseUrl not in url: • return None • return url

  25. defgetDownloadPath(baseUrl, absoluteUrl, downloadDirectory): • path = absoluteUrl.replace("www.", "") • path = path.replace(baseUrl, "") • path = downloadDirectory+path • directory = os.path.dirname(path) • if not os.path.exists(directory): • os.makedirs(directory) • return path

  26. html = urlopen("http://www.pythonscraping.com") • bsObj = BeautifulSoup(html, "html.parser") • downloadList = bsObj.findAll(src=True) • for download in downloadList: • fileUrl = getAbsoluteURL(baseUrl, download["src"]) • if fileUrl is not None: • print(fileUrl) • urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))

  27. Run with Caution • Warnings on downloading unknown files from the internet • This script just downloads EVERY THING !! • Bash scripts, .exe files, maleware • Never Scrape as root • Image downloading to ../../../../usr/bin/python • And the next time someone runs Python !?!?

  28. Storing to CSV files • #Chapter 5: 2-createCsv.py • import csv • #from os import open • csvFile = open("../files/test.csv", 'w+', newline='') • try: • writer = csv.writer(csvFile) • writer.writerow(('number', 'number plus 2', 'number times 2')) • for i in range(10): • writer.writerow( (i, i+2, i*2)) • finally: • csvFile.close()

  29. Retrieving HTML tables • Doing once  use Excel and save as csv • Doing it 50 times write a Python Script

  30. #Chapter 5: 3-scrapeCsv.py • import csv • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen("http://en.wikipedia.org/wiki/Comparison_of_text_editors") • bsObj = BeautifulSoup(html, "html.parser") • #The main comparison table is currently the first table on the page • table = bsObj.findAll("table",{"class":"wikitable"})[0] • rows = table.findAll("tr")

  31. csvFile = open("files/editors.csv", 'wt', newline='', encoding='utf-8') • writer = csv.writer(csvFile) • try: • for row in rows: • csvRow = [] • for cell in row.findAll(['td', 'th']): • csvRow.append(cell.get_text()) • writer.writerow(csvRow) • finally: • csvFile.close()

  32. Storing in Databases • MySQL • Microsoft’s Sequel server • Oracle’s DBMS • Why use MySQL? • Free • Used by the big boys: YouTube, Twitter, Facebook • So ubiquity, price, “out of the box usability”

  33. Relational Databases

  34. SQL – Structured Query Language? • SELECT * FROM users WHERE firstname = "Ryan"

  35. Installing • $ sudo apt-get install mysl-server

  36. Some Basic MySQL commands • CREATE DATABASE scraping; • USE scraping; • CREATE TABLE pages; • error • CREATE TABLE pages (id BIGINT( 7) NOT NULL AUTO_INCREMENT, title VARCHAR( 200), content VARCHAR( 10000), created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id)); • DESCRIBE pages;

  37. > INSERT INTO pages (title, content) VALUES (" Test page title", "This is some test page content. It can be up to 10,000 characters long."); • Of course, we   can   override these defaults: INSERT INTO pages (id, title, content, created) VALUES (3, "Test page title", " This is some test page content. It can be up to 10,000 characters long.", "2014- 09-21 10: 25: 32"); • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 2097-2101). O'Reilly Media. Kindle Edition.

  38. SELECT * FROM pages WHERE id = 2; • SELECT * FROM pages WHERE title LIKE "% test%"; • SELECT id, title FROM pages WHERE content LIKE "% page content%"; • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 2106-2115). O'Reilly Media. Kindle Edition.

  39. Read once; decode; treat as file line by line

  40. #Chapter 5: 3-mysqlBasicExample.py • import pymysql • conn = pymysql.connect(host='127.0.0.1', unix_socket='/tmp/mysql.sock', • user='root', passwd=None, db='mysql') • cur = conn.cursor() • cur.execute("USE scraping") • cur.execute("SELECT * FROM pages WHERE id=1") • print(cur.fetchone()) • cur.close() • conn.close()

  41. # 5-storeWikiLinks.py • from urllib.request import urlopen • from bs4 import BeautifulSoup • import re • import datetime • import random • import pymysql • conn = pymysql.connect(host='127.0.0.1', unix_socket='/tmp/mysql.sock', user='root', passwd=None, db='mysql', charset='utf8') • cur = conn.cursor() • cur.execute("USE scraping")

  42. defgetLinks(articleUrl): • html = urlopen("http://en.wikipedia.org"+articleUrl) • bsObj = BeautifulSoup(html, "html.parser") • title = bsObj.find("h1").get_text() • content = bsObj.find("div", {"id":"mw-content-text"}).find("p").get_text() • store(title, content) • return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

  43. links = getLinks("/wiki/Kevin_Bacon") • try: • while len(links) > 0: • newArticle = links[random.randint(0, len(links)-1)].attrs["href"] • print(newArticle) • links = getLinks(newArticle) • finally: • cur.close() • conn.close()

  44. Next Time Requests Library and DB • Requests: HTTP for Humans • >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) • >>> r.status_code • 200 • >>> r.headers['content-type'] • 'application/json; charset=utf8' • >>> r.encoding • 'utf-8' • >>> r.text • u'{"type":"User"...' • >>> r.json() • {u'private_gists': 419, u'total_private_repos': 77, ...}

  45. Python-Tips • https://pythontips.com/2013/09/01/best-python-resources/ • …

More Related