1 / 27

590 Web Scraping – Handling Images

590 Web Scraping – Handling Images. Topics CAPTCHA’s Pillow Tesseract -- OCR Readings: Text – chapters 11. April 11, 2017. CAPTCHA.

gcaron
Download Presentation

590 Web Scraping – Handling Images

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 590 Web Scraping – Handling Images • Topics • CAPTCHA’s • Pillow • Tesseract -- OCR • Readings: • Text – chapters 11 April 11, 2017

  2. CAPTCHA • A CAPTCHA (a backronym for "Completely AutomatedPublic Turing test to tell Computers and Humans Apart") is a type of challenge-response test used in computing to determine whether or not the user is human.[1] • The term was coined in 2003 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford.[ https://en.wikipedia.org/wiki/CAPTCHA

  3. Computer Vision Mitchell, Ryan. Web Scraping with Python

  4. Optical Character Recognition • Extracting information from Scanned documents • Python is a fantastic language for”: • image processing and reading, • image-based machine-learning, and • even image creation. • Libraries for image processing: • Pillow and Tesseract • http:// pillow.readthedocs.org/ en/ 3.0. x/ and • https:// pypi.python.org/ pypi/ pytesseract Mitchell, Ryan. Web Scraping with Python

  5. Pillow • Pillow allows you to easily import and manipulate images with a variety of filters, masks, and even pixel-specific transformations: Mitchell, Ryan. Web Scraping with Python

  6. Chapter 11 1-basicImage.py • from PIL import Image, ImageFilter • kitten = Image.open("../files/kitten.jpg") • blurryKitten = kitten.filter(ImageFilter.GaussianBlur) • blurryKitten.save("kitten_blurred.jpg") • blurryKitten.show() Mitchell, Ryan. Web Scraping with Python

  7. Tesseract • Tesseract is an OCR library. • Sponsored by Google, known for its OCR and machine learning technologies • Tesseract is widely regarded to be the best, most accurate, open source OCR system available.

  8. Chapter 11 -- 2-cleanImage.py • from PIL import Image • import subprocess • defcleanFile(filePath, newFilePath): • image = Image.open(filePath) • #Set a threshold value for the image, and save • image = image.point(lambda x: 0 if x<143 else 255) • image.save(newFilePath) Mitchell, Ryan. Web Scraping with Python

  9. #call tesseract to do OCR on the newly created image • subprocess.call(["tesseract", newFilePath, "output"]) • #Open and read the resulting data file • outputFile = open("output.txt", 'r') • print(outputFile.read()) • outputFile.close() • cleanFile("text_2.png", "text_2_clean.png") Mitchell, Ryan. Web Scraping with Python

  10. Installing Tesseract • Installing Tesseract For Windows users there is a convenient executable installer.   As of this writing, the current version is 3.02, although newer versions should be fine as well. • Linux users can install Tesseract with apt-get: $ sudo apt-get tesseract-ocr • Installing Tesseract on a Mac is slightly more complicated, although it can be done easily with many third-party installers such as Homebrew, Mitchell, Ryan. Web Scraping with Python

  11. NumPy again Mitchell, Ryan. Web Scraping with Python

  12. Well-formatted text • Well-formatted text: • Is written in one standard font (excluding handwriting fonts, cursive fonts, or excessively “decorative” fonts) • If copied or photographed has extremely crisp lines, with no copying artifacts or dark spots • Is well-aligned, without slanted letters • Does not run off the image, nor is there cut-off text or margins on the edges of the image Mitchell, Ryan. Web Scraping with Python

  13. 3-Read-Web-Images

  14. Chapter 11 -- 3-readWebImages.py • import time • from urllib.request import urlretrieve • import subprocess • from selenium import webdriver • #driver = webdriver.PhantomJS(executable_path='/Users/ryan/Documents/pythonscraping/code/headless/phantomjs-1.9.8-macosx/bin/phantomjs') • driver = webdriver.Chrome() • driver.get("http://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200") • time.sleep(2) • driver.find_element_by_id("img-canvas").click() • #The easiest way to get exactly one of every page • imageList = set() Mitchell, Ryan. Web Scraping with Python

  15. #Wait for the page to load • time.sleep(10) • print(driver.find_element_by_id("sitbReaderRightPageTurner").get_attribute("style")) • while "pointer" in driver.find_element_by_id("sitbReaderRightPageTurner").get_attribute("style"): • #While we can click on the right arrow, move through the pages Mitchell, Ryan. Web Scraping with Python

  16. driver.find_element_by_id("sitbReaderRightPageTurner").click()driver.find_element_by_id("sitbReaderRightPageTurner").click() • time.sleep(2) • #Get any new pages that have loaded (multiple pages can load at once) • pages = driver.find_elements_by_xpath("//div[@class='pageImage']/div/img") • for page in pages: • image = page.get_attribute("src") • imageList.add(image) • driver.quit() Mitchell, Ryan. Web Scraping with Python

  17. #Start processing the images we've collected URLs for with Tesseract • for image in sorted(imageList): • urlretrieve(image, "page.jpg") • p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE,stderr=subprocess.PIPE) • p.wait() • f = open("page.txt", "r") • print(f.read()) Mitchell, Ryan. Web Scraping with Python

  18. 4-CAPTCHA

  19. Chapter 11 --- 4-solveCaptcha.py • from urllib.request import urlretrieve • from urllib.request import urlopen • from bs4 import BeautifulSoup • import subprocess • import requests • from PIL import Image • from PIL import ImageOps • defcleanImage(imagePath): • image = Image.open(imagePath) • image = image.point(lambda x: 0 if x<143 else 255) • borderImage = ImageOps.expand(image,border=20,fill='white') • borderImage.save(imagePath) Mitchell, Ryan. Web Scraping with Python

  20. html = urlopen("http://www.pythonscraping.com/humans-only") • bsObj = BeautifulSoup(html, "html.parser") • #Gather prepopulated form values • imageLocation = bsObj.find("img", {"title": "Image CAPTCHA"})["src"] • formBuildId = bsObj.find("input", {"name":"form_build_id"})["value"] • captchaSid = bsObj.find("input", {"name":"captcha_sid"})["value"] • captchaToken = bsObj.find("input", {"name":"captcha_token"})["value"] • captchaUrl = "http://pythonscraping.com"+imageLocation • urlretrieve(captchaUrl, "captcha.jpg") • cleanImage("captcha.jpg") Mitchell, Ryan. Web Scraping with Python

  21. p = subprocess.Popen(["tesseract", "captcha.jpg", "captcha"], stdout= • subprocess.PIPE,stderr=subprocess.PIPE) • p.wait() • f = open("captcha.txt", "r") • #Clean any whitespace characters • captchaResponse = f.read().replace(" ", "").replace("\n", "") • print("Captcha solution attempt: "+captchaResponse) Mitchell, Ryan. Web Scraping with Python

  22. if len(captchaResponse) == 5: • params = {"captcha_token":captchaToken, "captcha_sid":captchaSid, • "form_id":"comment_node_page_form", "form_build_id": formBuildId, "captcha_response":captchaResponse, "name":"Ryan Mitchell", • "subject": "I come to seek the Grail", • "comment_body[und][0][value]": • "...and I am definitely not a bot"} Mitchell, Ryan. Web Scraping with Python

  23. r = requests.post( "http://www.pythonscraping.com/comment/reply/10", data=params) • responseObj = BeautifulSoup(r.text) • if responseObj.find("div", {"class":"messages"}) is not None: • print(responseObj.find("div", {"class":"messages"}).get_text()) • else: • print("There was a problem reading the CAPTCHA correctly!") Mitchell, Ryan. Web Scraping with Python

  24. Mitchell, Ryan. Web Scraping with Python

  25. Mitchell, Ryan. Web Scraping with Python

More Related