50 likes | 182 Views
This presentation outlines the planned utilization of Twitter's Streaming API, which provides near-real-time access to public Twitter statuses. It covers the types of obtainable data, including location and user information, as well as necessary infrastructure and expected data volume. The API is currently in alpha testing, with limited access to enhanced resources pending additional TOS acceptance. Our tentative work plan includes testing the environment, ensuring resistance to downtime, and exploring database integration for the parsed content.
E N D
UpdateBy: Brian Klug, Li Fan • Presentation Overview: • API we plan to use(Syntax and commands) • Obtainable Data Types(Location, Text, Time, User, Reply) • Infrastructure (Hardware, Storage Req’s, Design) • Tentative Work Plan (Timeline and Schedule)
API: Streaming API • Enables near-real time access to a subset of public Twitter statuses. • Currently in alpha test • Access to further restricted resources is extremely limited and granted only after acceptance of an additional TOS document. • We have applied for credentials which grant us access to these increased resources (namely a larger sampling, more statuses) • http://apiwiki.twitter.com/Streaming-API-Documentation • Features of streaming API • Continual connection that streams statuses over HTTP. Opened indefinitely and only requires basic authentication for the most basic level • Output data is in XML or JSON formats, both of which are easy to parse. • Can focus on certain tracking predicates that, when specific enough, return all occurrences in full Firehose stream • EG "track=basketball,football,baseball,footy,soccer". Execute: curl -d @tracking http://stream.twitter.com/1/statuses/filter.json -uAnyTwitterUser:Password
Streaming API data • Example data: • {"truncated":false,"text":"@FreedomProject Can you bring the script tomorrow? We can write in the APE if you're not busy.","favorited":false,"in_reply_to_screen_name":"FreedomProject","source":"<a href=\"http://www.tweetdeck.com/\" rel=\"nofollow\">TweetDeck</a>","created_at":"Fri Nov 20 06:37:58 +0000 2009","in_reply_to_user_id":20688076,"in_reply_to_status_id":5882468251,"geo":null,"user":{"favourites_count":0,"verified":false,"notifications":null,"profile_text_color":"34da43","time_zone":"Tijuana","profile_link_color":"e98907","description":"I'm a Robot created in Mexican soil, therefore my name is Mexican Robot","profile_background_image_url":"http://a3.twimg.com/profile_background_images/4329659/d2e513deb84e6fdc10de6ac70ef2f637f8f62f26.jpg","created_at":"Mon Dec 22 07:34:02 +0000 2008","profile_sidebar_fill_color":"b03636","profile_background_tile":false,"location":"Surfin' tubular Innernetwaves","following":null,"profile_sidebar_border_color":"050e61","protected":false,"profile_image_url":"http://a3.twimg.com/profile_images/515614231/jessicaavvy_normal.png","statuses_count":946,"followers_count":59,"name":"MexicanRobot","friends_count":173,"screen_name":"MexicanRobot","id":18303131,"geo_enabled":false,"utc_offset":-28800,"profile_background_color":"000000","url":"http://sharkwithwheels.webs.com"},"id":5882552501} • Data Classes: • Who the message is in response to, if anyone • Client user agent • Location tagged geo-aware data, if any • Time of creation and time zone of poster • Information about avatar, background, profile • User metrics: Statuses posted, Followers, Friends • User description: short user-defined string
Infrastructure • Streaming API expected volume: 3-4 million entries/day • Storage Consideration: • Average total JSON example output size: ~1400 characters • Messages are UTF-8, we’ll assume most are 1 byte • 1400 msg/day * 1 byte * 3.5 million = 4.56 gigabytes/day • 1 year ~ 1.6 terabytes • Currently working on getting at least one server running Ubuntu Server in a VM to begin downloading data • May require additional public IP addresses depending on rate limits, additional servers depending on load • Download first, parse later
Tentative Timeline • Work Plan • Continue investigating using RSS to download status updates from far in the past beyond the 15,000 we are allowed to go back using the streaming API • 1-2 weeks: test our environment and make sure everything is working well • Make sure our methodology for downloading from the stream is resistant to Twitter downtime as features are rolled in and out of the alpha test • Await possible response from Twitter regarding access to additional restricted resources (even higher rate firehose) • 2 weeks to explore how to parse the content into a DB, whether this can be realistically done real time in another process. • Additional time for data mining, research topics, e.t.c.