1 / 44

Social Media Analytics : Digital Footprints

Social Media Analytics : Digital Footprints. Funded by:. Sandhya Krishnan Dr. Anupam Joshi. Introduction.

Download Presentation

Social Media Analytics : Digital Footprints

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Social Media Analytics : Digital Footprints Funded by: Sandhya Krishnan Dr. Anupam Joshi CHMPR IAB 2012

  2. Introduction • Social media has greatly impacted the way we communicate today. With approximately 3000 tweets/sec (13K/sec around Superbowl) and 2.5 Billion updates a day, it is a great way to disseminate information to users across the world. • However such a tool can also be used to disseminate misinformation in a quick and efficient manner which can have an harmful impact in multiple scenarios  like national security cases, or business/marketing cases and hence needs to be curbed and kept in check  • Our approach is to create a social footprint of users which can be used to distinguish real and imposter/ compromised accounts on social media. CHMPR IAB 2012

  3. Introduction • Social Media is a great way to disseminate information to users across the world. 200 million active users and 340 million tweets/day (December 2012) 1.11 billion users as of May 2013 • But, what about disinformation (intentionally false or inaccurate information spread deliberately) ??

  4. Motivation- March 2013 February 2013 @flydeltaassist @deltaassist Real Twitter Verified Account Both claim to be Promised free tickets to first several thousand followers Was Tweeting against Church’s Anti-Gay policy Both claim to be Pope Francis Fake Account Banned by Twitter

  5. Motivation @theUSpresident @BarakObama Which one is real?? @BarackObama CHMPR IAB 2012

  6. Motivation August 2012 @pmoindia claims to be @pm0india claims to be @dryumyumsingh claims to be • Tweeting content which was : • Misrepresenting violence against Muslims in Burma • Instigating riots in North- Eastern Region of India 6 fake PMO India Profiles

  7. Motivation- News/Business Scenarios April 2013 February 2013 Hacked Accounts

  8. #TwithackerySome Recent Hacking Episodes 2012 -2013

  9. Objective @BarakObama Which one is real?? @BarakObama__ @theUSpresident @Obamanews @BarackObama44 @ThePresObama Is this account compromised?? @BarackObama

  10. Success Criteria • Build a prototype system which performs a joint content and network structure analysis demonstrating the feasibility of distinguishing real and fake profiles. • Developing high accuracy in identifying real accounts of “famous people” • Evaluate further by filtering down the social media network to check the validity of accounts belonging to a layman CHMPR IAB 2012

  11. Solution overview What is a digital footprint? DIGITAL FOOTPRINT @barackobama Content Words in tweets Hash tags URL s Mentions Following Re- tweets Followers Replies Network Structure Meta data Name Verified A/c ? Location Created_at

  12. Solution overview Create Digital Footprint System- Content Module Twitter User_timelineAPI Extract Tweets (Content) Clean Text and Create Bag of Words Model @barackobama • For each word compute TF-IDF score Compute two groups of words- Frequently occurring and Rarely occurring.

  13. Solution Overview Create Digital Footprint System- Network Module Twitter User_TimelineAPI Extract users in ‘Re-Tweets’ and ‘Replies’ Form Close Social Network @barackobama Extract users who ‘mention’ current user

  14. Solution Overview Digital Footprint System- Content Module Digital Signature/ Footprint @barackobama System- Network Module

  15. Solution Overview AuthenticateDigital Footprint • What content is similar? • % terms common between tweets and news articles • How similar are they? • Average difference between TF-IDF scores of such terms • Above two metrics computed for Rare and Frequent terms in both context- Tweets and News Article {Rare and Frequent terms indicated by TF-IDF}

  16. Network Characteristics of Close Social Network Solution Overview AuthenticateDigital Footprint System- Network Module • To understand Trust Propagation in Social Networks, we record: • Number of Twitter ‘verified’ users in the current user’s network • In some scenarios we also use: • Network Intersection with a trusted user • Number of hops required to reach the current user from the trusted user in the network • Number of nodes in network • Out-degree- From user’s Replies and Re-tweets • In-degree – User’s @mentions in addition to @replies directed to the user and @RT of the user’s tweets

  17. Results • Ground Truth • Twitter ‘verified’ real accounts • If above tagging absent, then manual observation of account Analysis done to identify real and fake profiles of “Famous people” Analysis done to identify “Less Famous people” Corporate Accounts Analysis done for a specific time period or 3500 most recent Tweets- Whichever relevant Hacked /Compromised Accounts

  18. Results I Digital Signature/ Footprint “Famous people” System- Network Module System- Content Module “Famous People on Twitter” • People about whom enough information from reliable web sources is available on a day to day basis

  19. Results IPresident Obama [1st May 2013] System- Content Module Graph 1 Graph 2

  20. Results I President Obama System- Network Module System: @barackobama is real Ground Truth: @barackobama is the Twitter ‘verified’ real account

  21. Results I - Conclusion “Famous people” Predicted Actual “Famous people” Total Twitter handles – 31 Number of Real handles – 18 Number of Fake handles - 13

  22. Results - II Digital Signature/ Footprint “Less Famous people” System- Network Module System- Content Module “Less Famous People on Twitter” • People about whom enough information from reliable web sources is not available on a regular day to day basis • Information maybe available on some days or in spurts (when such users are in News for a particular event/ development etc) • Continuous availability of web content about such users is not reliable- hence we look at the social network structure of such users

  23. Results - II A good mix of highly sought users in music, acting, fashion, journalism, media, business US Senators Celebrities popular in the USA Members of Parliament – India Celebrities popular in India

  24. Results – IISenators- USA Digital Signature/ Footprint System- Network Module Trusted User: @barackobama

  25. Results – IISenators- USA System: @chuckgrassley is real Ground Truth: @chuckgrassley is the Twitter ‘verified’ real account

  26. Results IICelebrities- USA Digital Signature/ Footprint System- Network Module • Trusted Users: @youtube, @justinbieber,@shakira,@kimkardashian and @cnnbrk

  27. Results IICelebrities- USA Graph1 (Close)Social Network Analysis Graph 2

  28. Results IICelebrities- USA System: @lindsaylohan is real Ground Truth: @lindsaylohan is the Twitter ‘verified’ real account

  29. Results – IIConclusion “Less Famous people” Predicted Actual Total Twitter handles – 350 Number of Real handles – 278 Number of Fake handles -72

  30. Results III Digital Signature/ Footprint “Corporate Accounts” System- Network Module System- Content Module @bostonmarathon @bostonmarathons @_bostonmarathon

  31. Results IV Phase I of Evaluation Digital Signature/ Footprint “Twitter Handle” • Phase II of Evaluation • Content comparison also done between tweets of compromised account and content from: • Other Similar Twitter Accounts • Previous Content posted by account over a significant period of time System- Network Module System- Content Module Detect hacked/compromised accounts on Twitter

  32. ‘@AP’ hacked Phase I Results System- Content Module “Breaking: Two Explosions in the White House and Barack Obama is injured” The terms which are absent in news articles but present in the tweets of @AP :

  33. ‘@AP’ hacked Phase I Results System- Content Module • The termscommon between tweets and news but have high difference in TF-IDF scores (Average Difference is 0.6): “Breaking: Two Explosions in the White House and Barack Obama is injured”

  34. ‘@AP’ hacked Phase II • Solution approach • 3500 most recent tweets of each handle • Run Content Analysis Module over this data set • Compute: • % common terms between @AP and other account handles • Average Difference in TF-IDF scores between such terms • Results • 40 – 45 % of the topics spoken by these news channel accounts coincide • Above topics showed very high similarity i.e. lower difference in TF-IDF scores • Uncommon topics where observed to be specific stories followed by these individual channels On a regular day, how similar is @AP to @breakingnews, @cnn, @foxnews, @washingtonpost and @Nationnow ?

  35. ‘@AP’ hacked Phase II Results Are the terms in this tweet mentioned by majority news channel accounts? “Breaking: Two Explosions in the White House and Barack Obama is injured”

  36. Other ‘Hacking’ episodes – Successfully Caught @48hours and @60minutes caught accurately with identical Phase I and Phase II analysis like @AP

  37. Other ‘Hacking’ episodes – Successfully Caught • Compare tweets from day of attack with • Past 10 day tweets of the handle

  38. Conclusion System- Content Module Digital Signature/ Footprint System- Network Module Authenticate this footpint to flag account as real or fake/compromised

  39. Conclusion • Applicability of system demonstrated in three flavors: • Authenticating ‘famous’ Twitter users • Content and network analysis modules - both are extremely useful • Authenticating ‘less famous’ Twitter users • Network analysis module is more relevant • Detecting if an existing account is hacked/compromised • Only content analysis is relevant in this context • Content comparison in case of compromised accounts, is done between tweets of compromised account and content from: • Reliable web sources • Other Similar Twitter Accounts • Content posted by account over a significant period of time

  40. Future Work For the three flavors in which our system is usable, some immediate tasks planned are: • Authenticating ‘famous’ Twitter users • Implement a sentiment analysis module in addition to the text analysis module • Authenticating ‘less famous’ Twitter users • Incorporate context to understand who is the “famous” and hence ”trusted” user in context of the current user • Detecting if an existing account is hacked/compromised • Build an online system which will: • Constantly monitor accounts tweeting similar contents • Flag if one such account tweets content very different from others

  41. Future Work • Gather larger data sets and perform evaluations in each of the above categories • Extend system such that it is more applicable in differentiating a layman’s account as real or fake/compromised

  42. References • Zi Chu, Steven Gianvecchio, Haining Wang, and SushilJajodia. 2010. Who is tweeting on Twitter: human, bot, or cyborg?. In Proceedings of the 26th Annual Computer Security Applications Conference (ACSAC '10). ACM, New York, NY, USA, 21-30. • F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. Detecting Spammers on Twitter. In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), July 2010 • Michael Gamon and Anthony Aue. 2005. Automatic identification of sentiment vocabulary: exploiting low association with known sentiment terms. In Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing (FeatureEng '05). Association for Computational Linguistics, Stroudsburg, PA, USA, 57-64 • Soo-Min Kim and Eduard Hovy. 2006. Extracting opinions, opinion holders, and topics expressed in online news media text. In Proceedings of the Workshop on Sentiment and Subjectivity in Text (SST '06). Association for Computational Linguistics, Stroudsburg, PA, USA, 1-8. • Qianni D, Yunjing D. How your friends influence you: quantifying pairwise influences on twitter. [serial online]. January 1, 2012;Available from: Inspec, Ipswich, MA. Accessed April 15, 2013. • Meeyoung Cha and HamedHaddadi and FabrícioBenevenuto and Krishna P. Gummadi,  Measuring user influence in Twitter: The million follower fallacy. ICWSM ’10: Proceedings of international AAAI Conference on Weblogs and Social, 2010 • MohitKewalramani, "Community Detection in Twitter", MastersThesis, University of Maryland Baltimore County, May 2011, •  De Choudhury, M. (2010). How "Birds of a Feather Flock Together" on Online Social Spaces.2010 Grace Hopper Celebration of Women in Computing (Atlanta, • Irani, D.; Webb, S.; Kang Li; Pu, C., "Large Online Social Footprints--An Emerging Threat," Computational Science and Engineering, 2009. CSE '09. International Conference on , vol.3, no., pp.271,276, 29-31 Aug. 2009doi: 10.1109/CSE.2009.459 •  D. Correa, A. Sureka, and R. Sethi, “WhACKY! - What anyone could know about you from Twitter," in PST, 2012. •  M. Motoyama and G. Varghese, “I seek you: searching and matching individuals in social networks," in Proceedings of the eleventh international workshop on Web information and data management,ser. WIDM, 2009. • Paridhi Jain, PonnurangamKumaraguru, “Finding Nemo: Searching and Resolving Identities of Users Across Online Social Networks” Indraprastha Institute of Information Technology (IIIT-Delhi), India • http://www.slideshare.net/franswaa/twitter-101-for-nonprofits

  43. Thank you!Questions?

  44. Questions? CHMPR IAB 2012

More Related