twitter big data and other ramblings n.
Skip this Video
Loading SlideShow in 5 Seconds..
Twitter, Big Data, and Other Ramblings PowerPoint Presentation
Download Presentation
Twitter, Big Data, and Other Ramblings

Loading in 2 Seconds...

play fullscreen
1 / 24

Twitter, Big Data, and Other Ramblings - PowerPoint PPT Presentation

  • Updated on

Twitter, Big Data, and Other Ramblings. Robert Dittmer. Perspective on those V words. Volume- 1% of the Twitter stream for roughly one month was about 68 million Tweets. Now multiply that by 100. Facebook has the same problem.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Twitter, Big Data, and Other Ramblings

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Twitter, Big Data, and Other Ramblings Robert Dittmer

    2. Perspective on those V words • Volume- 1% of the Twitter stream for roughly one month was about 68 million Tweets. Now multiply that by 100. Facebook has the same problem. • Velocity- How do you analyze thousands of points of data in real-time? SQL Server sure isn’t going to do that. • Variety- Social Media, Manufacturing, Sales, Financial, CRM, Web Traffic, External • Think about what goes into Amazon recommending you a book or movie • Veracity- It all means nothing if it’s not at least somewhat clean

    3. What do you do with a Tweet? • Sentiment Analysis is assigning a numerical value to a word • Positive, Negative, Neutral connotation • Methods for performing Sentiment Analysis • “Dumb” Method- Break down text into individual words and compare with a sentiment dictionary. AKA “Bag of Words” • “Smart” Method- Use a natural language processing tool to analyze parts of speech and calculate sentiment based on context • Example Tweet • “The Apple iPad sucks. The new Google Nexus 7 is awesome!”

    4. Collecting Tweets • Twitter uses a RESTful service to stream Tweets • Steps to start streaming your own Tweets • Go to and create an application • Generate your OAuth credentials • Find an open-source Twitter library • Tweepy (Python) • Tweetinvi (C#) • Plug your credentials in and modify the example

    5. The Tweet, the Whole Tweet, and Nothing but the Tweet • JSON Format (Key-Value Pair) • Notable Fields • ID • CreatedAt • Text • Entities • Hashtags • URLs • Latitude, Longitude

    6. What does a Tweet look like? • {"filter_level":"medium","contributors":null,"text":"Iron man 3 was awesome =)","geo":{"type":"Point","coordinates":[50.73529254,-4.00720746]},"retweeted":false,"in_reply_to_screen_name":null,"truncated":false,"lang":"en","entities":{"symbols":[],"urls":[],"hashtags":[],"user_mentions":[]},"in_reply_to_status_id_str":null,"id":330043889589288960,"source":"<a href=\"\" rel=\"nofollow\">Twitter for Android<\/a>","in_reply_to_user_id_str":null,"favorited":false,"in_reply_to_status_id":null,"retweet_count":0,"created_at":"Thu May 02 19:39:29 +0000 2013","in_reply_to_user_id":null,"favorite_count":0,"id_str":"330043889589288960","place":{"id":"0613276b16c0d59f","bounding_box":{"type":"Polygon","coordinates":[[[-4.335135,50.429347],[-4.335135,50.874614],[-3.732303,50.874614],[-3.732303,50.429347]]]},"place_type":"city","name":"WestDevon","attributes":{},"country_code":"GB","url":"","country":"United Kingdom","full_name":"West Devon, Devon"},"user":{"location":"okehampton","default_profile":false,"statuses_count":1345,"profile_background_tile":true,"lang":"en","profile_link_color":"FC0AFC","profile_banner_url":"","id":503242961,"following":null,"favourites_count":492,"protected":false,"profile_text_color":"0084B4","description":"vicky pollards twin sister ( the nice one )","verified":false,"contributors_enabled":false,"profile_sidebar_border_color":"FFFFFF","name":"vicki phillips ","profile_background_color":"FA03DD","created_at":"Sat Feb 25 16:40:53 +0000 2012","default_profile_image":false,"followers_count":149,"profile_image_url_https":"","geo_enabled":true,"profile_background_image_url":"","profile_background_image_url_https":"","follow_request_sent":null,"url":"","utc_offset":0,"time_zone":"Casablanca","notifications":null,"profile_use_background_image":true,"friends_count":1059,"profile_sidebar_fill_color":"DDEEF6","screen_name":"vixoakleophill","id_str":"503242961","profile_image_url":"","listed_count":0,"is_translator":false},"coordinates":{"type":"Point","coordinates":[-4.00720746,50.73529254]}}

    7. My Tweet Collection • Collected for roughly one month • Lots of trial and error • Originally used Tweepy, but ran into errors • Switched to Tweetinvi and it worked • About 68 million Tweets • Apple • Amazon • Google • Microsoft • Netflix • Tesla • Ford (Probably should have used a different car company)

    8. Yahoo! Finance Detour • Use an HTTP request to get stock data • • Create a metric with stock data and compare the sentiment of a company to their performance

    9. Big Data (and regular data) Tools • Talend Open Studio • Hadoop • SAP HANA

    10. Talend Open Studio • Open Source ETL Tool • Built on Eclipse • Data Quality and Format Issues • Even though I saved Tweets in delimited format, issues remained • Iterated through all 12,736 files with 5000 tweets each • Verified each row against a schema • Mapped to different output files • Tweet (Fact table) • Tracks • User Mentions • Hashtags • URLs • Demo Time!

    11. Hadoop Overview • Based on the Hadoop Distributed File System and MapReduce • MapReduce is a way of parallelizing code using batch processing • Map finds the data you’re looking for • Reduce aggregates that data (count, sum, average) • Embarrassingly parallel processing • Each server in a Hadoop cluster is referred to as a Node • NameNode • DataNode • Blocks of data are replicated to three nodes • Extremely fault tolerant

    12. More Hadoop • Open-source technology • Cloudera vs. Hortonworks • Intel, IBM, MapR, Amazon EMR • Cloudera and Hortonworks are the two biggest faces of Hadoop • Intel actively contributes to optimize it for Xeon Processors • IBM and MapR also involved • Big companies and entities use it

    13. Hadoop Projects • Hive • Data Warehouse on top of Hadoop • Uses HiveQl (essentially SQL with a few extras) to query data • Abstracts MapReduce processes • Has an ODBC connector to allow it talk to anything that talks to databases • Pig • Uses a language called Pig Latin to analyze data • Data flow language abstracts MapReduce for easy use for data analysts • HBase • Billions of rows and millions of columns • Distributed column data store

    14. Hadoop Trivia Time • Who created Hadoop? • Why is it called Hadoop? • Who developed the concept of MapReduce? • What does Facebook Messenger use to store its data? • Who created Hive? • What is Accumulo and who created it?

    15. 2nd Generation Hadoop • Much faster than previous versions • Hive 0.12 is up to 50X faster than previous versions • Hortonworks Stinger project aims for 100X performance improvement • Projects like Spark are moving towards real-time analysis • In-memory cluster compute analysis • Streaming processing with routines written in Python and Scala • Shark is an implementation of Hive using Spark instead of MapReduce

    16. Hadoop Sentiment Analysis • Used the “Dumb” method of Sentiment Analysis • Import the data into HDFS and create Hive tables • Tweet • Sentiment Dictionary • Explode words in each tweet to create a view with TweetID and Word • Join with the Sentiment Dictionary on the word to get sentiment value • Demo Time!

    17. SAP HANA • In-Memory, Column-Store database • Loads all data into main-memory • Analyze billion of rows with sub-second response time • Column-store table structure • Allows for much better compression and parallelization than row-store • Used for real-time analytics • Available with an on premise appliance or cloud-based VM

    18. Why is SAP HANA Awesome? • Column-stores are naturally very good at parallelization • In-Memory means no waiting on IO from disks and is still hundreds of times faster than SSD • Feature rich • Text analytics • Predicative Analytics Library • Application Server • It is an actual Database and does everything a database does • Demo Time

    19. SAP HANA Sentiment Analysis • Sentiment is calculated when creating a full-text index on the text of the tweet • Creates a sentiment value for each tweet • Analyze by my different dimensions • Aggregate sentiment by hour • Demo Time!

    20. Other Text Analysis Options • Python Natural Language Toolkit • Analyze parts of speech and context • Should be possible to integrate with Hadoop (The Google did not help)

    21. Other Big Data Problems • A GE Engine on a transatlantic flight generates 2TB of sensor data • There’s four engines on a 747 • What does the LHC at CERN do with their 15 petabytes of data they create annually? • How does the NSA store a yottabyte of data? • How does a small online gaming company analyze their customer base to increase retention and margins?

    22. How is Sentiment Analysis Being Used? • Companies ingest their social media feeds into these systems • If a Tweet or Facebook post meets a certain criteria, an automated or human response can be requested

    23. Hot vs. Cold Data • Hot data is the recent data you are most interested in • Keep this data in SAP HANA for real-time processing • Archive it after a period of time: 1 month, 3 months, 6 months, etc… • Cold Data is your historical data • Data warehouses that can handle massive volumes of data are EXPENSIVE!!!! • Use Hadoop and Hive as your data warehouse • It only costs the hardware • Still able to analyze cold data, store it cheaply, and integrate with SAP HANA