1 / 17

Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Twist : User Timeline Tweets Classifier. Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer. Goal. Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology Input: user timeline tweets

artan
Download Presentation

Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Twist : User Timeline Tweets Classifier Team : Priya Iyer VaidyVenkat Sonali Sharma Mentor: Andy Schlaikjer

  2. Goal • Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology • Input: user timeline tweets • Output: list of auto classified tweets

  3. Rationale • Twitter allows users to create custom Friend Lists based on the user handles.

  4. Rationale (contd.) • Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.

  5. Approach • Step 1: Data Collection • Step 2: Text mining • Step 3: Creation of the training file for the library • Step 4: Evaluation of several classifiers • Step 5: Selecting the best classifier • Step 6: Validating the classification • Step 7: Tuning the parameters • Step 8: Repeat; until correct classification

  6. Text Mining Process • Remove special characters • Tokenize • Remove redundant letters in words • Spell Check • Stemming • Language Identification • Remove Stop Words • Generate bigrams and change to lower case

  7. Go SF Giants! Such an amaazzzingfeelin’!!!! \m/ :D  Stopwords SF Giants! amaazzzingfeelin’!!!! \/ :D  Special chars SF Giants amaazzzingfeelin Spell check SF Giants amazing feeling Stemming SF Giants amazing feel me SF Giants amazing feel stopwords

  8. Choice of ML technique • Logistic Regression Classifier • Reasons: • Most popular linear classification technique for text classification • Ability to handle multiple categories with ease • Gave the best cross-validation accuracy and precision-recall score • Library: LIBLINEAR for Python

  9. Creation of LIBLINEAR training input SF Giants amazing feel Indexing SF – 1 Giants -2 amazing-3 feel-4 Boolean SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1) Training Input for the SVM 1 1:1 2:1 3:1 4:1

  10. Demo

  11. THANK YOU Andy, Marti & The Twitter Team

  12. Questions?

  13. Data Collection Challenges – Backup Slides • Collected >2000 tweets from the “Who to follow” interest lists on Twitter for “Sports” and “Business” • Tweets were not purely “Sports” or “Business” related • Personal messages were prominent • Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly

  14. Text Mining Challenges • Noise in the data: • Tweets are in inconsistent format • Lots of meaningless words • Misspellings • More of individual expression • For example, BAAAAAAAAAAAASSKEttt!!!! bskball , futball, % , :D,\m/, ^xoxo Solution: Regular expressions and NLP toolkit • Different words, same root Playing , plays , playful - play Solution: Stemming

  15. Sample LIBLINEAR input format (Train)

  16. LIBLINEAR output for a test file of 20 tweets • Mixed bag of sports(=1), finance(=2) tweets, entertainment(=3) and technology (=4) • Comma separated values of the categories that each tweet • Accuracy here is 94%. Precision: 0.89 Recall: 0.89 • Experiment with different kernels for a better accuracy

  17. Summary: Data Source/Software/Tools • Category based tweets from • https://twitter.com/i/#!/who_to_follow/interests • Coding done in Python • Database – sqlite3 • ML tool – lib SVM • Stemming – Porter’s Stemming • NLP Tool kit

More Related