team priya iyer vaidy venkat sonali sharma mentor andy schlaikjer n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer PowerPoint Presentation
Download Presentation
Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Loading in 2 Seconds...

play fullscreen
1 / 17

Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer - PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on

Twist : User Timeline Tweets Classifier. Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer. Goal. Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology Input: user timeline tweets

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer' - artan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
team priya iyer vaidy venkat sonali sharma mentor andy schlaikjer

Twist : User Timeline Tweets Classifier

Team :

Priya Iyer

VaidyVenkat

Sonali Sharma

Mentor: Andy Schlaikjer

slide2
Goal
  • Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology
  • Input: user timeline tweets
  • Output: list of auto classified tweets
rationale
Rationale
  • Twitter allows users to create custom Friend Lists based on the user handles.
rationale contd
Rationale (contd.)
  • Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.
approach
Approach
  • Step 1: Data Collection
  • Step 2: Text mining
  • Step 3: Creation of the training file for the library
  • Step 4: Evaluation of several classifiers
  • Step 5: Selecting the best classifier
  • Step 6: Validating the classification
  • Step 7: Tuning the parameters
  • Step 8: Repeat; until correct classification
text mining process
Text Mining Process
  • Remove special characters
  • Tokenize
  • Remove redundant letters in words
  • Spell Check
  • Stemming
  • Language Identification
  • Remove Stop Words
  • Generate bigrams and change to lower case
slide7

Go SF Giants! Such an amaazzzingfeelin’!!!! \m/ :D 

Stopwords

SF Giants! amaazzzingfeelin’!!!! \/ :D 

Special chars

SF Giants amaazzzingfeelin

Spell check

SF Giants amazing feeling

Stemming

SF Giants amazing feel me

SF Giants amazing feel

stopwords

choice of ml technique
Choice of ML technique
  • Logistic Regression Classifier
  • Reasons:
    • Most popular linear classification technique for text classification
    • Ability to handle multiple categories with ease
    • Gave the best cross-validation accuracy and precision-recall score
    • Library: LIBLINEAR for Python
creation of liblinear training input
Creation of LIBLINEAR training input

SF Giants amazing feel

Indexing

SF – 1 Giants -2 amazing-3 feel-4

Boolean

SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1)

Training Input for the SVM

1 1:1 2:1 3:1 4:1

thank you
THANK YOU

Andy,

Marti

&

The Twitter Team

data collection challenges backup slides
Data Collection Challenges – Backup Slides
  • Collected >2000 tweets from the “Who to follow” interest lists on Twitter for “Sports” and “Business”
  • Tweets were not purely “Sports” or “Business” related
  • Personal messages were prominent
  • Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly
text mining challenges
Text Mining Challenges
  • Noise in the data:
      • Tweets are in inconsistent format
      • Lots of meaningless words
      • Misspellings
      • More of individual expression
      • For example, BAAAAAAAAAAAASSKEttt!!!!

bskball , futball, % , :D,\m/, ^xoxo

Solution: Regular expressions and NLP toolkit

  • Different words, same root

Playing , plays , playful - play

Solution: Stemming

liblinear output for a test file of 20 tweets
LIBLINEAR output for a test file of 20 tweets
  • Mixed bag of sports(=1), finance(=2) tweets, entertainment(=3) and technology (=4)
  • Comma separated values of the categories that each tweet
  • Accuracy here is 94%. Precision: 0.89 Recall: 0.89
  • Experiment with different kernels for a better accuracy
summary data source software tools
Summary: Data Source/Software/Tools
  • Category based tweets from
    • https://twitter.com/i/#!/who_to_follow/interests
  • Coding done in Python
  • Database – sqlite3
  • ML tool – lib SVM
  • Stemming – Porter’s Stemming
  • NLP Tool kit