1 / 49

Session 3 Wharton Summer Tech Camp

Session 3 Wharton Summer Tech Camp. Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs. Set up problems. Mac– mostly no problems due to linux -like environment and great support Windows on MOBAXTERM You can use apt - cyg to install everything

dale
Download Presentation

Session 3 Wharton Summer Tech Camp

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Session 3Wharton Summer Tech Camp Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs

  2. Set up problems • Mac– mostly no problems due to linux-like environment and great support • Windows on MOBAXTERM • You can use apt-cyg to installeverything • Apt-cyg installpython • Apt-cyg installidle • Apt-cyg installidlex

  3. REGEX CHALLENGE! • 3 REGEX Challenges • 1 from a well known t-shirt joke (if you know this, don’t say anything) • 2 are song lyrics (tried to find well known songs). • Raise your hand to say the answer

  4. a t-shirt people wear Difficulty * Hint: Phrase r”(bb|[^b]{2})”

  5. a t-shirt people wear Difficulty * Hint: Phrase r”(bb|[^b]{2})” “To be or not to be”

  6. Challenge 2 Difficulty ***** Hint: This is literally the entire lyric for the song r”(\w+ [a-z]{3} w..ld ){144}”

  7. Challenge 2 Difficulty **** Hint: This is literally the entire lyric for the song Hint 2: It’s a song by the music duo who created the latest Record of the Year r”(ar\w{3} [a-z]{3} w..ld ){144}”

  8. Challenge 2 Difficulty ***** Hint: This is literally the entire lyric for the song Hint 2: It’s a song by the music duo who created the latest Record of the Year r”(\w+ [a-z]{3} w..ld ){144}” Around the world – by Daft Punk

  9. Challenge 3 Difficulty ** Hint: Lyric of an old song r”ah, ((ba ){4} (bar){2}a an{2} \s)+”

  10. Challenge 3 Difficulty ** r”ah, ((ba ){4} (bar){2}a an{2} \s)+” Ah, Ba bababa Barbara Ann~ Ah, Ba bababa Barbara Ann~

  11. Song Phrases Ever since I learned regex, I was thinking that many Daft Punk songs are optimized for regex. Lyrics for a song in its entirety with this one simple regex • r”(Around the world ){144}” – Around the world • r"((buy|use|break|fix|trash|change) it )+ now upgrade it” –Technologic • r”(((work|make|do|makes|more) (it|us|than) (harder|better|faster|stronger|ever))+ hour after our work is never over. \s)+” – Harder, better, faster, stronger

  12. THE BIGGEST concern for doctoral students doing empirical work (year 2-4) “WHERE AND HOW DO I GET THE DATA?!“ Mr. Data: “I believe what you are experiencing is frustration”

  13. Data sources • Companies • Wharton Organizations • Scraping Web • APIs : application programming interface

  14. DATA SOURCES • Companies – HARD, UNIQUE • Hardest but once you get a good company, you are set for a paper or two or more… • Wharton Organizations • (WRDS) (EASY,COMMON - great for auxiliary data) Other people can also easily access this data. Data probably have been used already • (WCAI) (EASY, UNIQUE) data is actually pretty great and only few select teams get it after proposal review process • Scraping Web (WGET/REGEX/tools) – MEDIUM, MEDIUM • Relatively easy but painful for big projects and sometimes not allowed based on website. • APIs : application programming interface – EASY,COMMON • Easy but restricted to what the company made available.

  15. Resources for Public Data • There are many list of lists for public data • Find a link to list of lists for data in the course website under “resources for learning” • If you have a good source, please email me so I can link it on the web

  16. Companies

  17. Quick tips • Don’t be afraid to contact random companies • Attend conferences and network like an MBA - think of it like a game • Send a short 2-3 page proposal suggesting a research collaboration • Read about the company you are contacting and make sure to offer something that interests the company • Low success probability – among many proposals I’ve sent (about 30+ if you count emails). • Mostly no response. • 1 company I was working with for 10 months just decided to drop the ball due to CTO changing twice. • 4 very easy data – not useful and suitable for research • 2 very useful data I am currently using/working with. • 1 company disputing about NDA • NDAs: you can request help from upenn legal team here • https://medley05.isc-seo.upenn.edu/researchInventory/jsp/fast2.do?bhcp=1

  18. NDAs are super important • A horror story I heard • A student worked with a company for 1+ year and then the company just decided that the result was too good to publish. Wanted it to be a trade secret/IP. • NDA signed was bad. • No publication. • Most NDAs are OK but some are not. If bad, get help from that link and negotiate. • Look out for “Work for hire” type of NDAs

  19. Wharton Specific

  20. Wharton Specific You probably heard about these organization from wharton doctoral orientation. • WRDS: Wharton Research Data Services • https://wrds-web.wharton.upenn.edu/wrds/ • WCAI: Wharton Customer Analytics Initiative • http://www.wharton.upenn.edu/wcai/ • Other organizations exist but mostly for conferences and not for data. • http://www.wharton.upenn.edu/faculty/research-centers-and-initiativ.cfm

  21. Basic Web Scraping

  22. Caveats • I spent time writing and testing a scraping code for this course where one inputs a list of music artists in csv format and the script queries allmusic.com to obtain information such as the genres associated with the artists. • Written in March of 2013. • On July, It broke because allmusic.com has updated their website…  • This is one problem with scraping. You never know when it will stop working and you have to rewrite.

  23. Outline of basic scraping • CRAWLING: Instead of using web browsers, use scripts to access html (xml, etc). Or crawl through website recursively and download all htmls or txts or whatever. (WGET or Python or any language such as php) • PATTERN SEARCHING: Researcher looks at the raw http output and looks for where the required data is and figure out what the pattern is. (Developer’s toolbox Firefox) • EXTRACTION: Use text extracting tool to extract information and store it! (if it’s structured format such as xml then use appropriate tools for each format). (REGEX, Apache Lucene, SED, AWK, etc) • Go publish papers with the data

  24. Alternatives • Want something easier or with GUI? • MOZENDA: Wharton has license and it’s cheap • More advanced scraping • We will cover this next week with Scrapy • There are many other tools and packages for this. • http://en.wikipedia.org/wiki/Web_crawler • http://stackoverflow.com/questions/419235/anyone-know-of-a-good-python-based-web-crawler-that-i-could-use

  25. Tools used in our examples • WGET + Python • REGEX • HTML/DOM inspector • Firefox has Web Developer's Toolboxwhich is an add-on you can download. • This is useful for looking for pattern of data you want to extract

  26. Scraping Example 1 • Facebook SEC filing exploration • Purpose: Exploration before research • What this toy example is doing: Get SEC filing for Facebook and extract certain parts • I am interested in reading a few words before and after whenever there is “shares” mentioned

  27. DOWNLOAD HTMLS/TXT/JPG/ETC • WGET “GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.” Fire up edgarFBarchive.sh and extractPhrase.py

  28. WGET FB’s SEC filings wget -r -l1 -H -t1 -nd -N -np -A.txt–e robots=off http://www.sec.gov/Archives/edgar/data/1326801/ -r -H -l1 -np These options tell wget to download recursively. -nd no directory. Keep the downloaded in one folder -A.txt only download txt files -erobots=off ignore robot.txt (avoid using this option if wget without this option works. Make sure to use --wait option if you use this option or your IP may get banned)

  29. Caveats • WGET only works well for certain websites. You can use it download all photos etc. But if your script makes too many requests, they may ban your IP. You can specify delayed requests. • Once website gets fancy, you have to use other tools such as PHP or Python packages • ASP • POST (as opposed to GET protocol in HTTP) • Javascript produced cites • AJAX cites • This is a toy example for learning. You can still use this method for simple scraping but consider learning pro tools (we’ll cover basics of a such tool next week)

  30. Scraping Example 2 • Jambase.com concert venues • This example gets a list of artists and queries jambase.com to get concert venue information. • Another toy example

  31. Fire up getConcertVenue.py

  32. API ( Application Programming Interface)

  33. Programmable Web • programmableweb.com • Searchengine for freelyavailableAPIs online • http://blog.programmableweb.com/2012/02/15/40-real-estate-apis-zillow-trulia-walk-score/ • Usageexamples • Usually, youhave to apply for API keys from the websiteor the companyoffering the data • Mostlyfree (limitedqueries)

  34. Idea behind API • You obtain a key from the company offering the data • Make requests for data • Many different ways based on API • Company server grants you the data • Data analysis

  35. Commonly Used Protocol in API • REST (REpresentational State Transfer) – guidelines for client-server interaction for exchanging data as opposed to the alternative SOAP • I recommend this funny explanation for REST vs SOAP (diagram involving Martin Lawrence) • http://stackoverflow.com/questions/209905/representational-state-transfer-rest-and-simple-object-access-protocol-soap • Based on HTTP • You request data via HTTP GET (http://www.w3schools.com/tags/ref_httpmethods.asp) protocol and server will give you data • HTTP-URL?QueryStrings • QueryStrings: Field=Value separated by & • E.g. http://www.youtube.com/watch?v=5pidokakU4I&t=0m38s • v: stands for video = some value • t: stands for start time= some value • Usual Data formats • XML eXtensible Markup Language http://www.w3schools.com/xml/ • JSON JavaScript Object Notationhttp://www.w3schools.com/json/

  36. XML Example <CATALOG> <PLANT> <COMMON>Bloodroot</COMMON> <BOTANICAL>Sanguinariacanadensis</BOTANICAL> <ZONE>4</ZONE> <LIGHT>Mostly Shady</LIGHT> <PRICE>$2.44</PRICE> <AVAILABILITY>031599</AVAILABILITY> </PLANT> <PLANT> <COMMON>Columbine</COMMON> <BOTANICAL>Aquilegia canadensis</BOTANICAL> <ZONE>3</ZONE> <LIGHT>Mostly Shady</LIGHT> <PRICE>$9.37</PRICE> <AVAILABILITY>030699</AVAILABILITY> </PLANT> </CATALOG> Many xml related packages http://wiki.python.org/moin/PythonXml

  37. JSON Example (just like python) newObject = { "first": "Ted", "last": "Logan", "age": 17, "sex": "M", "salary": 0, "registered": false, "interests": ["Van Halen", "Being Excellent", "Partying"] } Main python module import json

  38. Yahoo Finance Data Example

  39. Python Package Wrapper • Yahoo provides simplewebinterface for anyone to downloadstockinformation via url • http://finance.yahoo.com/d/quotes.csv?s=%s&f=%s • s: symbol “GOOG” • f: stat (e.g. l1 means last trade price) • http://finance.yahoo.com/d/quotes.csv?s=GOOG&f=l1 • More info here • http://www.gummy-stuff.org/Yahoo-data.htmOrderedto take down • http://web.archive.org/web/20140325063520/http://www.gummy-stuff.org/Yahoo-data.htm

  40. This Wrapper Package does it for you • ystockquote • https://pypi.python.org/pypi/ystockquote/0.2.3 • https://github.com/cgoldberg/ystockquote • See the simple source code to learn • Open up ystock.py

  41. Example: YQL • http://developer.yahoo.com/yql/ • APIs are written by individual companies and support different I/O and usually different languages. • Yahoo Query Language is a simple interface that yahoo has made available to developers combining several APIs • “Yahoo! Query Language (YQL) enables you to access Internet data with SQL-like commands.” • Apply for your API Key • http://developer.yahoo.com/yql/

  42. Our example: BBYOPEN • https://bbyopen.com/bbyopen-apis-overview • Retail information • Archive query - Returns a single file containing all attributes for all items exposed by the given API • Basic query - Returns information about a single item • Advanced query - Returns information about one or more items according to your specifications • Store availability query - Returns information about products available at specific storesBestbuy is providingthis API • API overview • https://developer.bestbuy.com/get-started

  43. Basic Query Basic query structure http://api.remix.bestbuy.com/API/Item.Format?show=&apiKey=Key  API - One of {products, stores, reviews, categories}  Item - The value of the fundamental attribute for the selected API: o products - sku o stores - storeId o reviews - id o categories - id  Format - One of {xml, json}  show= - (optional) The item attributes you want displayed  Key - Your API key Note: show= and Key can be specified in either order.

  44. Basic Query Examples

  45. API example • Open up bestbuyAPI.py

  46. Lab session • For the next 10-15 minutes, choose your favorite website and try to scrape a few items • We’ll do this again with scrapy

  47. Data isn’t impossibly hard to get after all. There are many routes but it could take a LONG time (especially if are going the company route). START EARLY and you’ll get that data. DATA!

  48. Next Session • Hugh will be speaking about HPCC • After that, we will learn the basics of Scrapy • Brush up on your HTML and look into XPATH • W3school.com is the best • Intro into Big Data and Empirical Business Research

More Related