1 / 41

Network Analytics meets Text Mining for Social Media Analysis

Network Analytics meets Text Mining for Social Media Analysis. Dr. Rosaria Silipo. Social Media Data Water Water Everywhere , and not a drop to drink. Social Media Data Water Water Everywhere , and not a drop to drink. What companies do with it: Download and keep

stu
Download Presentation

Network Analytics meets Text Mining for Social Media Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Network Analytics meets Text Mining for Social Media Analysis Dr. Rosaria Silipo

  2. Social Media DataWaterWaterEverywhere, and not a droptodrink

  3. Social Media DataWaterWaterEverywhere, and not a droptodrink • What companies do with it: • Download and keep • Topic [Shift] Detection (email content routing, detect market interest shift, clinical studies, query non structured DBs, ...) • Sentiment Analysis (marketing, polls, elections, ...) • Connection Analysis (influencers, risk analysis, ...) • ....

  4. Social Media DataWaterWaterEverywhere, and not a droptodrink • The Analysis Tools: • Web Crawlers • Visual Exploration • Topic Detection (Text Mining, NLP, Ontologies) • Sentiment Score (Text Mining, NLP) • Influence Score (Network Analytics) • Find Groups (Predictive Analytics)

  5. Case Study Example: Slashdot Data Post • Basic Numbers: • 24532users • 491 threads with • 15 – 843 responses • 12 – 507 users • 113505 posts • 60main topics • Selected Topic: Politics Comments

  6. Case Study Example: Slashdot • Very rich data sourcesabout customers ! • We want to establish: • How users feel about the discussed topic • Whether it matters how users feel • A more general abstraction of the results Sentiment Analysis Network Analytics Clustering

  7. Remove anonymous users, group by PostID Sentiment Analysis Words Tagging MPQA Corpus Positive words Negative words BoW, Entity Filter, Word Frequency, Attitude Calculation by Document Total Attitude by User User Bins Word cloud for selected users

  8. Slashdot – Text Mining • Most Negative User pNutz

  9. Slashdot – Text Mining • Most Positive User dada21

  10. Slashdot – Sentiment Analysis • 16016 positive users • 7107 negative users • Most positive user: dada21 (2838 positive/1725 negative words) • Most negative user: pNutz(43 positive/109 negative words) • Which Topics have positive users in common ? • Government • People • Law/s • Money • Market • Parties

  11. Network Creation User1 User2 User3 User4 User5 User6

  12. Topic Graphs

  13. Topic Graph: NASA

  14. Topic Graph: Sci-Fi

  15. Hubs & Authorities • Hubs = Followers • Authorities = Leaders Users with hub and authority weights and other features Centrality index to define hub weight and authority weight Filtering anonymous users and creating network

  16. Hubs & Authorities dada21 Carl Bialik from the WSJ Tube Steak Doc Ruby pNutz 99BottlesOfBeerInMyF

  17. KNIME: Bringing it all together Users with hub and authority weights and other features Network Analysis Text Analysis Users bins: positive, negative, neutral

  18. dada21 Carl Bialik from the WSJ Tube Steak Catbeller Doc Ruby WebHosting Guy 99BottlesOfBeerInMyF pNutz

  19. What we have found ... • The positive leaders • Theneutral leaders • Thenegative leaders • The inactive users • What identifies each group? • How do I identify a new user? • How do I handle each user?

  20. Why Clustering? • No a priori knowledge (not even on a subset of users) • Prediction and interpretation capabilities required • k-Means algorithm

  21. Re-sampling the Training Set k = 10

  22. The k-Means Clusters

  23. The k-Means Clusters Superfans Neutral users Fans Negative users

  24. Additional Discoveries • There are only very few real leaders! Authority and hub scores identify active participants rather than leaders. • Superfans can be found in cluster_3 • Negative and (sigh!) active users are collected in cluster_1. • Neutral users are usually inactive (cluster_2, cluster_7, and cluster_8) • Positive users with different degrees of activity are scattered across the remaining clusters.

  25. The operational Workflow Cluster Extraction Pre-processing Assignment of new data

  26. Notes • MPQA Corpus: publicly available Subjectivity Lexicon (http://www.cs.pitt.edu/mpqa/lexicons.html) • User Characterization is Sum -> Mean • NLP: No sentence splitting, no negation identification. • For a more refined syntax-based sentiment analysis -> „External Tool“ node

  27. External Tool Node • The „External Tool“ node executes anyexternal program from command line • Writes input data to an input file • Calls Tool to run on input file and command line options and to write results to output file • Reads output file and presents data at output port

  28. Alternative Sentiment Analysis • Free non-interactive Command Line running Tools for Sentiment Analysis not found • SentiStrength v2.2 (still interactive) External Tool and Generic Web Service Client

  29. Web Crawling Workflow Community Web Crawler Node XML Parsing Nodes

  30. Next Steps • Integrate topic information • Integrate user demographic and behavioural information • Discover [time series] patterns for early detection of negative users and superfans • Try other techniques, maybe even on manually segmented data, to discover new user segments

  31. Where do I find more? • Whitepaper: rosariasilipo@yahoo.com • Complete Workflows + Data: www.knime.com • - textmining • - networkmining • - combinedanalysis • (note the above 3 process huge data and require 16G memory) • clustering • Open Source Software: KNIMEwww.knime.com

  32. Next Appointment • User Day US Boston (free) • October 22nd 2013 10:00 -17:00 • Microsoft New England R&D Center (NERD) • One Memorial Drive, Suite 100, Cambridge • http://www.knime.com/user-day-boston-2013

  33. Hands-on Session • 1. Download KNIME from www.knime.com

  34. Hands-on Session • 2. Install Extensions • Help -> Install New Software • Select: • KNIME & Extensions • In KNIME Labs Extensions, select: • KNIME Network Mining • KNIME Textprocessing

  35. Hands-on Session • 3. Get workflows and Slashdot data • Get workflows from USB stick (KNIMEBoston2013.zip) • Text Mining • Network Analytics • Text and Network Mining • Social Media Clustering • Slashdot Raw Data is included in the downloaded workflows • A smaller set of data is available, Slashdot Reduced Data, for lower memory requirements • Both data sets are available from USB Stick

  36. Hands-on Session • 3. Import Workflows

  37. Hands-on Session • Memory Increase in knime.ini • -startup • plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar • --launcher.library • plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.100.v20110502 • -vmargs • -Xmx2G • -XX:MaxPermSize=256m • -server • -Dsun.java2d.d3d=false • -Dosgi.classloader.lock=classname • -XX:+UnlockDiagnosticVMOptions • -XX:+UnsyncloadClass • -Dknime.enable.fastload=true • -Djava.library.path=C:\Users\rosy\Documents\R\win-library\2.15\rJava\jri\x64

  38. Hands-on Session • 5. Improve Workflows: Text Mining Data Preprocessing Data Reading Scoring and Tag Cloud Tagging Words Reading Tag Corpus BoW

  39. Hands-on Session • 6. Improve Workflows: Network Analytics Visualize Network Create Network Object Data Reading and preprocessing Clean up Network

  40. zoomba

  41. nahdude812

More Related