Web Scraping Lecture 8 – Storing Data

Web Scraping Lecture 8 – Storing Data • Topics • Storing data • Downloading • CSV, MySQL • Readings: • Chapters 5, and 4 February 2, 2017

Overview • Last Time: Lecture 6 slides 30- end; Lecture 7 Slides 1-31 • Crawling from Chapter 3: Lecture 6 Slides 29-40 • Getting code again: https://github.com/REMitchell/python-scraping • 3-crawlSite.py • 4-getExternalLinks.py • 5-getAllExternalLinks.py • Chapter 4 • APIs • JSON • Today: • Iterators, generators and yield • Chapter 4 • APIs • JSON • Javascript • References- Scrapy site/user manual

Reg Expressions – Lookahead patterns • (?=...)Matches if ... matches next, but doesn’t consume any of the string. • This is called a lookaheadassertion. • For example, Isaac (?=Asimov) will match 'Isaac ' only if it’sfollowed by 'Asimov'. • (?!...)Matches if ... doesn’t match next. • This is a negative lookahead assertion. • For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'. • (?#...) A comment; the contents of the parentheses are simply ignored.

Chapter 4: Using APIs • API - In computer programming, an application programming interface (API) is a set of subroutine definitions, protocols, and tools for building application software. • https://en.wikipedia.org/wiki/Application_programming_interface • A web API is an application programming interface (API) for either a web server or a web browser. • Program request in HTML • Response in XML or JSON

Authentication • Identify users – for charges etc. • http:// developer.echonest.com/ api/ v4/ artist/ songs? api_key = < your api key here > %20& name = guns% 20n% 27% 20roses& format = json& start = 0& results = 100 • Using urlopen • token = "< your api key >" • webRequest = • urllib.request.Request(" http:// myapi.com", headers ={" token": token}) • html = urlopen( webRequest)

Google Developers APIs

Mining the Social Web; so Twitter Later

Yield in Python • def _get_child_candidates(self, distance, min_dist, max_dist): • if self._leftchild and distance - max_dist < self._median: • yield self._leftchild • if self._rightchild and distance + max_dist >= self._median: • yield self._rightchild • result, candidates = list(), [self] • while candidates: • node = candidates.pop() • distance = node._get_dist(obj) • if distance <= max_dist and distance >= min_dist: • result.extend(node._values) • candidates.extend(node._get_child_candidates(distance, min_dist, max_dist)) • return result https://pythontips.com/2013/09/29/the-python-yield-keyword-explained/ http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do

Iterators and Generators • When you create a list, you can read its items one by one. Reading its items one by one is called iteration: • >>> mylist = [1, 2, 3] • >>> for i in mylist: • ... print(i) • Generators are iterators, but you can only iterate over them once. It's because they do not store all the values in memory, they generate the values on the fly: • >>> mygenerator = (x*x for x in range(3)) • >>> for i in mygenerator: • ... print(i) https://pythontips.com/2013/09/29/the-python-yield-keyword-explained/

Yield • Yield is a keyword that is used like return, except the function will return a generator. • >>> defcreateGenerator(): • ... mylist = range(3) • ... for i in mylist: • ... yield i*i • ... • >>> mygenerator = createGenerator() # create a generator • >>> print(mygenerator) # mygenerator is an object! • <generator object createGenerator at 0xb7555c34> • >>> for i in mygenerator: • ... print(i) https://pythontips.com/2013/09/29/the-python-yield-keyword-explained/

#Chapter 4: 4-DecodeJson.py • #Web Scraping with Python by Ryan Mitchell • import json • from urllib.request import urlopen • defgetCountry(ipAddress): • response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8') • responseJson = json.loads(response) • return responseJson.get("country_code") • print(getCountry("50.78.253.58"))

#Chapter 4: 5-jsonParsing.py • import json • jsonString = '{"arrayOfNums":[{"number":0},{"number":1},{"number":2}], "arrayOfFruits":[{"fruit":"apple"},{"fruit":"banana"},{"fruit":"pear"}]}' • jsonObj = json.loads(jsonString) • print(jsonObj.get("arrayOfNums")) • print(jsonObj.get("arrayOfNums")[1]) • print(jsonObj.get("arrayOfNums")[1].get("number")+jsonObj.get("arrayOfNums")[2].get("number")) • print(jsonObj.get("arrayOfFruits")[2].get("fruit"))

Wiki Editing Histories – from where?

The Map --

#Chapter 4: 6-wikiHistories.py • from urllib.request import urlopen • from urllib.request import HTTPError • from bs4 import BeautifulSoup • import datetime • import json • import random • import re • random.seed(datetime.datetime.now()) • defgetLinks(articleUrl): • html = urlopen("http://en.wikipedia.org"+articleUrl) • bsObj = BeautifulSoup(html, "html.parser") • return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

defgetHistoryIPs(pageUrl): • #Format of revision history pages is: #http://en.wikipedia.org/w/index.php?title=Title_in_URL&action=history • pageUrl = pageUrl.replace("/wiki/", "") • historyUrl = "http://en.wikipedia.org/w/index.php?title="+pageUrl+"&action=history" • print("history url is: "+historyUrl) • html = urlopen(historyUrl) • bsObj = BeautifulSoup(html, "html.parser") • #finds only the links with class "mw-anonuserlink" which has IP addresses • #instead of usernames • ipAddresses = bsObj.findAll("a", {"class":"mw-anonuserlink"}) • addressList = set() • for ipAddress in ipAddresses: • addressList.add(ipAddress.get_text()) • return addressList

defgetCountry(ipAddress): • try: • response = urlopen("http://freegeoip.net/json/"+ipAddress).read(). • decode('utf-8') • except HTTPError: • return None • responseJson = json.loads(response) • return responseJson.get("country_code") • links = getLinks("/wiki/Python_(programming_language)")

Output

Chapter 5 - Storing Data • Files • Csv files • Json, xml

Downloading images : to copy or not • As you are scraping do you download images or just store links? • Advantages to not copying? • Scrapers run much faster, and require much less bandwidth, when they don’t have to download files. • You save space on your own machine by storing only the URLs. • It is easier to write code that only stores URLs and doesn’t need to deal with additional file downloads. • You can lessen the load on the host server by avoiding large file downloads.

Advantages to not copying? • Embedding these URLs in your own website or application is known as hotlinking and doing it is a very quick way to get you in hot water on the Internet. • You do not want to use someone else’s server cycles to host media for your own applications. • The file hosted at any particular URL is subject to change. This might lead to embarrassing effects if, say, you’re embedding a hotlinked image on a public blog. • If you’re storing the URLs with the intent to store the file later, for further research, it might eventually go missing or be changed to something completely irrelevant at a later date. • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 1877-1882). O'Reilly Media. Kindle Edition.

Downloading the “logo” (image) • from urllib.request import urlretrieve • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen(" http:// www.pythonscraping.com") • bsObj = BeautifulSoup( html) • imageLocation = bsObj.find(" a", {" id": "logo"}). find(" img")[" src"] • urlretrieve (imageLocation, "logo.jpg") • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 1889-1896). O'Reilly Media. Kindle Edition.

#Chapter 5: 1-getPageMedia.py • #Web Scraping with Python by Ryan Mitchell • #Chapter 5: 1-getPageMedia.py • import os • from urllib.request import urlretrieve • from urllib.request import urlopen • from bs4 import BeautifulSoup • downloadDirectory = "downloaded" • baseUrl = "http://pythonscraping.com"

defgetAbsoluteURL(baseUrl, source): • if source.startswith("http://www."): • url = "http://"+source[11:] • elifsource.startswith("http://"): • url = source • elifsource.startswith("www."): • url = source[4:] • url = "http://"+source • else: • url = baseUrl+"/"+source • if baseUrl not in url: • return None • return url

defgetDownloadPath(baseUrl, absoluteUrl, downloadDirectory): • path = absoluteUrl.replace("www.", "") • path = path.replace(baseUrl, "") • path = downloadDirectory+path • directory = os.path.dirname(path) • if not os.path.exists(directory): • os.makedirs(directory) • return path

html = urlopen("http://www.pythonscraping.com") • bsObj = BeautifulSoup(html, "html.parser") • downloadList = bsObj.findAll(src=True) • for download in downloadList: • fileUrl = getAbsoluteURL(baseUrl, download["src"]) • if fileUrl is not None: • print(fileUrl) • urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))

Run with Caution • Warnings on downloading unknown files from the internet • This script just downloads EVERY THING !! • Bash scripts, .exe files, maleware • Never Scrape as root • Image downloading to ../../../../usr/bin/python • And the next time someone runs Python !?!?

Storing to CSV files • #Chapter 5: 2-createCsv.py • import csv • #from os import open • csvFile = open("../files/test.csv", 'w+', newline='') • try: • writer = csv.writer(csvFile) • writer.writerow(('number', 'number plus 2', 'number times 2')) • for i in range(10): • writer.writerow( (i, i+2, i*2)) • finally: • csvFile.close()

Retrieving HTML tables • Doing once  use Excel and save as csv • Doing it 50 times write a Python Script

#Chapter 5: 3-scrapeCsv.py • import csv • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen("http://en.wikipedia.org/wiki/Comparison_of_text_editors") • bsObj = BeautifulSoup(html, "html.parser") • #The main comparison table is currently the first table on the page • table = bsObj.findAll("table",{"class":"wikitable"})[0] • rows = table.findAll("tr")

csvFile = open("files/editors.csv", 'wt', newline='', encoding='utf-8') • writer = csv.writer(csvFile) • try: • for row in rows: • csvRow = [] • for cell in row.findAll(['td', 'th']): • csvRow.append(cell.get_text()) • writer.writerow(csvRow) • finally: • csvFile.close()

Storing in Databases • MySQL • Microsoft’s Sequel server • Oracle’s DBMS • Why use MySQL? • Free • Used by the big boys: YouTube, Twitter, Facebook • So ubiquity, price, “out of the box usability”

Relational Databases

SQL – Structured Query Language? • SELECT * FROM users WHERE firstname = "Ryan"

Installing • $ sudo apt-get install mysl-server

Some Basic MySQL commands • CREATE DATABASE scraping; • USE scraping; • CREATE TABLE pages; • error • CREATE TABLE pages (id BIGINT( 7) NOT NULL AUTO_INCREMENT, title VARCHAR( 200), content VARCHAR( 10000), created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id)); • DESCRIBE pages;

> INSERT INTO pages (title, content) VALUES (" Test page title", "This is some test page content. It can be up to 10,000 characters long."); • Of course, we can override these defaults: INSERT INTO pages (id, title, content, created) VALUES (3, "Test page title", " This is some test page content. It can be up to 10,000 characters long.", "2014- 09-21 10: 25: 32"); • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 2097-2101). O'Reilly Media. Kindle Edition.

SELECT * FROM pages WHERE id = 2; • SELECT * FROM pages WHERE title LIKE "% test%"; • SELECT id, title FROM pages WHERE content LIKE "% page content%"; • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 2106-2115). O'Reilly Media. Kindle Edition.

Read once; decode; treat as file line by line

#Chapter 5: 3-mysqlBasicExample.py • import pymysql • conn = pymysql.connect(host='127.0.0.1', unix_socket='/tmp/mysql.sock', • user='root', passwd=None, db='mysql') • cur = conn.cursor() • cur.execute("USE scraping") • cur.execute("SELECT * FROM pages WHERE id=1") • print(cur.fetchone()) • cur.close() • conn.close()

# 5-storeWikiLinks.py • from urllib.request import urlopen • from bs4 import BeautifulSoup • import re • import datetime • import random • import pymysql • conn = pymysql.connect(host='127.0.0.1', unix_socket='/tmp/mysql.sock', user='root', passwd=None, db='mysql', charset='utf8') • cur = conn.cursor() • cur.execute("USE scraping")

defgetLinks(articleUrl): • html = urlopen("http://en.wikipedia.org"+articleUrl) • bsObj = BeautifulSoup(html, "html.parser") • title = bsObj.find("h1").get_text() • content = bsObj.find("div", {"id":"mw-content-text"}).find("p").get_text() • store(title, content) • return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon") • try: • while len(links) > 0: • newArticle = links[random.randint(0, len(links)-1)].attrs["href"] • print(newArticle) • links = getLinks(newArticle) • finally: • cur.close() • conn.close()

Next Time Requests Library and DB • Requests: HTTP for Humans • >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) • >>> r.status_code • 200 • >>> r.headers['content-type'] • 'application/json; charset=utf8' • >>> r.encoding • 'utf-8' • >>> r.text • u'{"type":"User"...' • >>> r.json() • {u'private_gists': 419, u'total_private_repos': 77, ...}

Python-Tips • https://pythontips.com/2013/09/01/best-python-resources/ • …

Web Scraping Lecture 8 – Storing Data

Web Scraping Lecture 8 – Storing Data

Presentation Transcript