1 / 61

BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database

BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database. 2013 Open Seminar Series 6 Open Geospatial Informatics  Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw http://bermuda.citi.sinica.edu.tw. BIG Data & Twitter. WHAT IS BIG DATA ?.

binta
Download Presentation

BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics  Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw http://bermuda.citi.sinica.edu.tw

  2. BIG Data &Twitter

  3. WHAT IS BIG DATA ? In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. 《Wikipedia Big data》 Source: http://en.wikipedia.org/wiki/Big_data

  4. WHAT IS BIG DATA ? • In 2001, Doug Laney use 3V model to describe Big Data • Volume: amount of data • Velocity: speed of data in and out • Variety: range of data types and sources • Veracity: truth or fact of data

  5. WHAT IS BIG DATA ? • In 2012, Gartner updated the definition • Still advocate 3V model for describing data • Require new forms of processing • Enhanced decision making • Insight discovery • Process optimization

  6. HOW BIG IS BIG DATA ? • Beyond the ability of commonly used • A few dozen terabytes (107) to many petabytes (108) • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day

  7. NEW TECHNOLOGY FOR BIG DATA • Hadoop • Developed by Apache Software Foundation • Derived from Google's MapReduce & File System • Able to process peta-bytes scale database • NoSQL (Not Only SQL) • Relational databases is not applicable for all cases • NoSQL is a new choose for non-relational databases • Adopted by Google, Facebook, Twitter, etc.

  8. WHAT IS TWITTER? • The fastest, simplest way to communicate • More than 140M active users • Majority source from mobile • 60% of user is out of U.S. • More than 400M twitter.com visitors • More than 400M tweets/day (peak: 25K/sec) • 1,000 employees (majority in San Francisco) • 50% of employee are engineers • Expect to hit nearly $1 billion on global ad revenue in 2014 by eMarketer

  9. TWITTER HISTORY • Evan Williams on the genesis of Twitter, ICWSM, April 2007: • A side project started from Jack Dorsey’s idea Oct, 2006 • Wanted a ubiquitous status message • A community of people answering the question “what are you doing?” • Exploded at SXSW, SF earthquakes (2011) • Good for collective “backchanneling” • High “Ambient intimacy” • Huge API usage was unexpected, as was the rise of the @ sign for replies

  10. HOW BIG IS TWITTER ? Source: http://blog.twitter.com/2011/06/200-million-tweets-per-day.html

  11. IT’S NOT JUST BIG! IT’S FRESH! Source: http://xkcd.com/723/

  12. WHAT IS TWEET ?

  13. TWITTER TOWN HALL July 6, 2011

  14. TWITTER STATS Mapping the global Twitter heartbeat: The geography of Twitter, May 2013 Source: http://www.sgi.com/go/twitter/images/hires/figure4.png

  15. TWITTER STATS

  16. TWITTER STATS Source: Pew Research Center's Internet &American Life Project Winter 2012 Tracking Survey, January 20-February 19, 2012. N=2,253 adults age 18 and older, including 901 cell phone interviews. Interviews conducted in English and Spanish. The margin of error is +/-2.7 percentage points for internet users. **Represents significant difference compared with all other rows in group.

  17. TWITTER STATS

  18. TWITTER STATS

  19. Twitter Dev

  20. TWITTER ACCOUNT • Register a Twitter account (required)

  21. REGISTER A TWITTER APPLICATION • Twitter developer web site: https://dev.twitter.com/ • Select “My applications”

  22. REGISTER A TWITTER APPLICATION • Click “Create a new application” Application List

  23. REGISTER A TWITTER APPLICATION • Fill the required information 1. 2. 3.

  24. REGISTER A TWITTER APPLICATION • Agree developer rules and fill captcha 1. 2.

  25. REGISTER A TWITTER APPLICATION • Go back to application list and click your application • Click “Settings”

  26. REGISTER A TWITTER APPLICATION • Select “Read, Write and Access direct messages” • Click “Update this Twitter application’s settings”

  27. REGISTER A TWITTER APPLICATION • Click “Create my access token”

  28. REGISTER A TWITTER APPLICATION

  29. Twitter API Resource

  30. REST API Source: https://dev.twitter.com/docs/streaming-apis

  31. STREAMING API Source: https://dev.twitter.com/docs/streaming-apis

  32. TWEET CRAWL API Source: https://dev.twitter.com/docs/api/1.1 Source: https://dev.twitter.com/docs/rate-limiting/1.1/limits

  33. tmhOAuth LIBRARY • Website: https://github.com/themattharris/tmhOAuth • $ gitclone https://github.com/themattharris/tmhOAuth.git • Current Version 0.8.2 • Author: Matt Harris @themattharris • Goal: • Support OAuth 1.0A • Use authorization headers instead of query string or POST parameters • Allow uploading of images • Provide enough information to assist with debugging

  34. CRAWLING WITH REST API • New a Oauth object contains authentication token • Set parameters for API • Use Twitter REST API to obtain tweets

  35. CRAWLING WITH STREAMING API • New a Oauth object contains authentication token • Set parameters for API • Construct a connection to Twitter server

  36. WHAT IS OAuth ? • OAuth = Open Authentication • What is OAuth: • An open protocol to allow secure API authorization in a simple and standard method from desktop and web applications. • Goal of OAuth: • Request token URL  • Authorize URL  • Access token URL

  37. NORMAL SEARCH OPERATORS

  38. SEARCH PARAMETERS (REST) Source: https://dev.twitter.com/docs/api/1.1/get/search/tweets

  39. SEARCH PARAMETERS (STREAMING) Source: https://dev.twitter.com/docs/api/1.1/post/statuses/filter

  40. WHAT DOES A TWEET LOOK LIKE?

  41. CRAWLING EFFICIENCY • Duration: May 6th to June 30th in 2012 (55 days) • REST API • Maximum TPS : 450 100 15 60 50 (Tweet / sec) • Steaming API • Randomly returns tweets containing a specific search keyword • The total quantity never exceeding 1% of all public data streams

  42. LARGE-SCALE CRAWLING

  43. Twitter +MySQL

  44. SINGLE NODE CRAWLING TYPE • Guideline for single node crawling: • Each streaming needs to authenticate itself • Total data size seems bounded (i.e. #Tweet to crawler is limited) • Prevent aggressively connecting to Twitter server • Crawling with different Twitter accounts is recommended Tweets Streaming - A Tweets Streaming - B Tweets Streaming - C … Twitter Server Tweet Crawler

  45. MULTI-NODE CRAWLING TYPE Tweets Streaming - A • Guideline for multi-node crawling: • Automatically check connection status • Automatically update databases summary information • Design the crawl program with well log file report function • Design a good database schema for distributed accessing Tweet Crawler Twitter Server Tweets Streaming - B Tweet Crawler

  46. DESIGN TWEET TABLE

  47. SETTING ENVIRONMENT • Install packages • # apt-get install php5 php5-curl • # apt-get install mysql-client mysql-server • # apt-get install phpmyadmin • Set Apache2 as web server when install phpymadmin

  48. SETTING ENVIRONMENT • Create databsase and table for Tweet crawling • Create a *.sql file for database format • Change directory to that file • # mysql -h {$HOST} -u {$USER} -p{$PASSWORD} • mysql> \. {$SQL_FILE}

  49. SETTING ENVIRONMENT • Check the database by phpmyadmin • Open browser and connect URL http://localhost/phpmyadmin • Select database and check the structure

  50. CRAWLING REAL-TIME TWEETS • Connect database • Save Tweet into database

More Related