1 / 10

Web Scraping using Python | Web Screen Scraping

Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.<br><br>Imagine you have to pull a large amount of data from websites and you want to do it as quickly as possible. How would you do it without manually going to each website and getting the data? Well, u201cWeb Scrapingu201d is the answer. Web Scraping just makes this job easier and faster.

21276
Download Presentation

Web Scraping using Python | Web Screen Scraping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Scraping Using Python Python Has Become The Most Popular Language For Web Scraping for Many Reasons. These Include It’s Flexibility, Ease of Coding, Dynamic Typing, A Large Collection of Libraries to Manipulate Data, and Support For The Most Common Scraping Tools, Such As Scrapy, Beautiful Soup, and Selenium.

  2. 1 What is Web Scraping? Web Scraping is a software method of scraping data from different websites. It keeps attention on the transformation of unstructured data on the web (Typically HTML), into structured data that can be stored and analyzed.

  3. 2 Why We Scrape? • Web Pages that Contain Wealth of Data Designed Mostly for Human Consumption. • Static Website • Interfacing with 3rd Party with no API access • Website are More Important than APIs • The Data is Already Feasible • No Rate Limiting • Anonymous Access

  4. 3 Fetch The Data • Involves Finding the endpoint – URL or URLs • Sending HTTP Request to the server • Using Request Library: Import Requests Data = requests.get(‘http://google.com/’) Html = data.content

  5. 4 Processing • Avoid using reg-ex • Reason why not to use it: It’s Fragile Really Hard to Maintain Importer HTML & Encoding Handling

  6. 5 Use Beautiful Soup For Parsing • Provides Simple Methods to Search, Navigate, and Select • Deals with Broken Web-Pages Really Well • Auto-detects encoding

  7. 6 Export The Data • Database (Relational or Non-Relational) • File (XML, YAML, CSV, JSON, etc) • APIs

  8. 7 Challenges • External Site Can Be Changes Without Warning • Figuring out the Frequency is Difficult • Changes can Break Scrapers Easily • Bad HTTP Status Codes • Example: Using 200 OK to signal an error • Cannot always trust your HTTP libraries default behavior • Messy HTML Markup

  9. 8 Scrapy – A Framework For Web Scraping • Uses XPath to Select Elements • Interactive Shell Scripting • Using Scrapy: Define a Model to Store Items Create Your Spider to Extract Items Write a Pipeline to Store Them

  10. Thank You

More Related