1 / 50

Civil Unrest Events Analysis on Twitter

Civil Unrest Events Analysis on Twitter. Fang Jin Yao Zhang Hang Zhang Shashidhar Sundareisan. Outline. Motivation & Introduction Sentiment Analysis Geocoding Influential user Visualization Conclusion. Motivation.

konala
Download Presentation

Civil Unrest Events Analysis on Twitter

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Civil Unrest Events Analysis on Twitter Fang Jin Yao Zhang Hang Zhang ShashidharSundareisan

  2. Outline • Motivation & Introduction • Sentiment Analysis • Geocoding • Influential user • Visualization • Conclusion

  3. Motivation • Twitter as a tool to express ideas, can we find some common sentiment distribution from Twitter? • When news spreading on twitter network, can we find who is most influential? • Can we infer the location of twitter, even if without any location information from tweets? • For a given topic, can we find the interest region evolution on twitter? • Goal: • Identify patterns and extract useful information from tweets • Gain sight into civil unrest analysis

  4. Data sets

  5. Civil unrest events data • Mexico election protest 07-22-2012 • Boston Marathon Explosion. 2013-04-15 • Colombia Medellin Protest. 10-04-2012 • Venezuela Strike . 08-03-2012 • Argentina violencia protest 12-21-2012

  6. Filter twitter method • Collect all the twitter using twitter API (Python) • For one topic, create unique keywords for this topic • Filter twitter content by keyword list, or hashtags • Insert all the related twitters into database (sqlite) • Challenge: • For one topic, depending on the topic size, our data sets is quite sparse. • The location information is limited.

  7. Outline • Motivation & Introduction • Sentiment Analysis • Geocoding • Influence user • Visualization • Conclusion

  8. Sentiment • Tasks: identify if the twitter expresses opinions and if yes classify it into positive, negative, or neutral based on the overall sentiments expressed by opinion holders • Methods: Dictionary based: • Counting Opinion Words –Positive: wonderful, elegant, amazing –Negative: horrible, disgusting, poor • Rule-based Method –Simple rules can be manually created – Eg: “There is not one thing I hate about it” •“not … negative” positive •“never … negative” positive • Emotional Method

  9. Positive & Negative Dictionary

  10. Emoticon Analysis in Twitter

  11. Sentiment result Argentina protest event sentiment distribution

  12. More Sentiment result • Mexico protest event sentiment distribution

  13. More Sentiment result Boston Marathon Bombing sentiment distribution

  14. More Sentiment result Positive & Negative Sentiment for Venezuela Event Oct 17

  15. Mexico event sentiment result Sentiment analysis: country vshour country vssentiment

  16. More Sentiment result • We mince the sentiment into angry, surprise, question, and sad

  17. More Sentiment result

  18. More Sentiment result Detailed Sentiment for Venezuela Event Oct 17

  19. Sentiment summary • Account the sentiment words • Pros • Simple & easy • Cons • Cannot handle negation, like “not at all” • Can not handle phrases, like “cannot wait to” • Rule-based • Pros • More accurate • Cons • Cannot discern different opinion holders, but works for twitters. • Get more refined sentiment • Define more refined dictionaries • Make use of emotions from tweets

  20. Outline • Motivation & Introduction • Sentiment Analysis • Geocoding • Influence user • Visualization • Conclusion

  21. Geocoding • Purpose: • Find the location for each tweet • Understand where a given event happens • Example: Boston Bomb

  22. Geocoding Framework Twitter Dataset Rule-based Algorithm Does Geo-info exist? No Yes Test Set Return Geo-info Generate Classifier Training Set Predict Return Geo-info Store Geo-info

  23. Geocoding • Rule Based Algorithm • Extract Locations from Tweets • City level • Geo information in a tweet is required • Classification Based Algorithm • Used for those tweets with no geo information • Using classification methods to predict locations • Country level • Two Complementary Methods

  24. Rule Based Algorithm • Idea: A tweet contain a large amount of geo information to predict where a user may post the tweet • Geo information in twitter • Geo-Tag: location • GPS tag • Twitter place information: • Directly indicating the city and country name • Coordinate (longitude and latitude) • Mentioned places • City or country mentioned in the tweet • User profile • Current living place

  25. Rule Based Algorithm • Procedure • First check “geotag” information for tweets. • If there is a coordinate geotag, then we find the city which is the closest city near the coordinate, and return it. • Second check twitter places information. • If a tweet has city information, then we directly return the city name. • If a tweet has coordinate, we find the most closest city, and return it. • Third check the mention of country and/or city in tweet. • Return the corresponding city or country. • If no geo information, return none.

  26. Rule Based Algorithm • Details about check the mention of country and/or city in tweet • If we find only country return this country. • If we find a city , return city and its country. • If this is non-unique city, we assume that the largest city by population is the one mentioned, and return it. • If there still exists the ambiguity, we just return the country.

  27. Rule Based Algorithm • Result has three formats: • (country) • (city, country) • (city, state, country) • Examples: • (Medellín, Colombia) • (Quitog, Ecuador) • (Ocozocoautla, Chiapas, Mexico)

  28. Rule Based Algorithm • Conclusion • We can extract geo information by City Level • Geo information in tweets is required • Discussion • Priority of geo-info: Geotag> twitter place info > mentioned city/country • Different city with same name in a same country: random select • Question • How can we predict the location of a tweet that has no geo information?

  29. Classification Based Algorithm • Motivation: • A large percent of tweets have no geo information • Rule based algorithm cannot give the location for those tweets • Example: • For our “Mexico” dataset: Almost 50% tweets have no geo-information! • 594 instances have location information • 577 instances don’t have location information

  30. Classification Based Algorithm • Basic Idea: • Those tweets sent by the same user for an event may have the same location • Those tweets sent at the same time for an event tend to have the same location (Region of Interest) • Those tweet has the same sentiment tend to have the same location (Sentiment Analysis) • We can combine the results of sentiment analysis and region of interest!

  31. Classification Based Algorithm • Tools: Weka • Collection of machine learning algorithms for solving data mining problems implemented in Java

  32. Procedure Overview • Data preprocessing • Extracting tweets from database • Attributes selection • Ranking the attributes using Information Gain • Model Training • Selecting different classifiers • Make Prediction • Predicting the country name using the training model

  33. Procedure • Data preprocessing • We extract five attributes for each tweet • User id • Time • Sentiment • Klout • Country (Class) : 14 countries • Attributes selection • Information Gain: Information Entropy • A Higher attribute has a better tendency to divide the class • Result: Time > Sentiment > Klout > UserId

  34. Procedure • Model Training • We tried different types of classifiers: Naiive Bayes, BayesNetworks, Logistic, Decision Tree(J48) and Random Forrest • Result • 10 Folds Cross Validation

  35. Classification Based Algorithm • Conclusion • Tree based classifiers have a better result • Random Forest has the best result • The importance of attributes • Time > Sentiment > Klout > UserId • The better result of tree based classifier also verifies the result of attribute selection • Discussion • Geo information in tweets is not required • Only Country level • Accuracy (14 countries with the best precision of 41.4%)

  36. Geocoding summary • Rule-based Algorithm • Pros • The result is relatively high • City level • Cons • If different countries have the same city name, it may give a wrong result • Can not predict the location for those tweets that have NO geo information • Classification-based Algorithm • Pros • Geo information is not required • Know the overall accuracy • Cons • Only country level • Need training set • Accuracy is an issue • Two Complementary Methods!

  37. Geocoding for Visualization • We have three formats of geocoding: • (country) • (city, country) • (city, state, country) • Using Google Map API • (country): only mark the center of the country • (city, country) and (city, state, country) have the accurate Coordinate (longitude, latitude)

  38. Outline • Motivation & Introduction • Sentiment Analysis • Geocoding • Influence user • Visualization • Conclusion

  39. Influential User • Motivation • Find the most influential user during the civil unrest events • Study the characteristic of influential users • Predict the potential unrest events

  40. Mentioned Graph Construction • For a certain tweets, if it contains “@somebody” or “RT something”, we will assign an edge from this tweet to the target. According to each user’s position in this cascade, we calculate this user’s contribution to this cascade. Also, time factor is considered in this cascade, the earlier the post, the more contribution of this post. Power law is applied to express the time factor in the influence score

  41. Construct Retweet Network Venezuela: 198 Argentina: 425 Mexico: 554 No dejes de dar Amor aunquetecontesten con violencia, deja de mirarte a vosmismo y preocupatepor el otro. Amor y Paz paratodos! Do not stop giving love even if you answer with violence, stop yourself and look into Fret on the other. Love and Peace to all! piscis Hoy no renuncies a tufelicidad solo porcomplacer a alguienmas, debes ser firme y hacerescuchartuvoz de protesta Pisces today do not give up your happiness just to please someone else, you must be firm and to hear your voice of protest Mi corazón. Con la gente en la marcha. My heart. With people on the go.

  42. Methods to extract the influential user • Eigenvector method • Nodes with largest contribution to largest eigenvalue are identified as the influential users ; • Page-rank • Measure the influence with only link structure of the network, allow propagation of influence along the network • Klout method • From Klout company, measure the users’ activity degree across different social network

  43. Methods to extract the influential user • Indegree influence • Followers number,in-degree represents popularity of a user; • Retweet influence • identify by @username, retweets represent the content value of one’s tweets; • Mention influence • measure through the number of mentions containing one’s name. Mentions represent the name value of a user

  44. Pagerank is not a very good algorithm in this case to identify influential users.

  45. Influential User Klout Score distribution Topic Specificity Network topology may follow normal distribution instead of power-law

  46. Analysis of one topic evolution in twitter • We are interested in how the topics are evolved for one specific civil unrest / unexpected events • We take Boston Marathon Bomb as an example • We carried out sentiment analysis, geo-coding on a day-by-day basis

  47. Outline • Motivation & Introduction • Sentiment Analysis • Geocoding • Influence user • Visualization • Conclusion

  48. Visualization • Google Map shows the geolocation • Locations • Tweets numbers • Cluster effect • Word cloud shows the most frequent tweet words • Sentiment Distribution • Influence user distribution

  49. Colombia protest - geocoding

  50. Colombia protest – influence user, sentiment

More Related