C rime /E vent D etection on T witter

Crime/Event Detection on Twitter Data Sciences Summer Institute 2011 Multimodal Information Access & Synthesis, University of Illinois at Urbana-Champaign

Our Team Team member: Elisee Habimana Jicong Wang Sridevi MaharajRonald Doku Mingjia Zhang Tobias Kin Hou Lei Ravi KhadiwalaDuber Gomez Rui Yang Project leader: Yizhou Sun Rui Li

Motivation - why Twitter? Real Time Wide Coverage

Motivation - An Example • An earthquake happened in Chile at 03:34 local time, Sat Feb 27, 2010 • Traditional communication almost impossible for 2-3 hours, first video image available 6-7 hours after quake Source: <Information Credibility on Twitter>, by Carlos Castillo et al.

Motivation - Another Example • Tweet posted at 2:22pm, June 28th, 20 minutes after the shot, while first news report appears almost 3 hours later

Motivation • Twitter reshape the way people spread and receive information • The real time feature makes twitter a good source of breaking news • The official and verified accounts on twitter provides reliable information • We propose to build up a web application that provide reliable real time crime related information

Demo

Crime/Event Detection on Twitter Data Sciences Summer Institute 2011 Multimodal Information Access & Synthesis, University of Illinois at Urbana-Champaign

Table of Contents • Major Challenges • Crime Focused Crawling • Tweet Classification • Event Extraction • Tweet Ranking • Clustering • Tools • Summary

Major Challenges • Most tweet contents are useless for us • Pointless babble – 40% • Conversational – 38% • Pass-along value – 9% • Self-promotion – 6% • Spam – 4% • News – 4% • Crime related - 0.005% • Roughly 10,000 crime related tweets each day • Information like location and time not always explicit • Display only the most important tweets • Present results in an organized fashion Source: <Twitter Study – August 2009> Kelly, Ryan, ed (August 12, 2009)

Project Flowchart

Crime Focus CrawlingCrawling crime related tweets from TwitterPresented by Jicong Wang

A Snapshot of Twitter Data USERID 43893075 ID 68542312782905344 TEXT Break shooting scene 1 "No More" with @dindamanda @yuyayuyi http://lockerz.com/s/100883315 LOCATION GeoLocation latitude=-6.196612, longitude=106.829552 PLACE TIME Thu May 12 00:05:35 CDT 2011 URLS url=http://lockerz.com/s/100883315, MentionedEntities: 37623286 66072730 Hashtags: also number of Followers, number of Friends, name of User, etc

NOT ALL TWEETS ARE CRIME RELATED! ONLY about 0.005%!

Observation

Iteratively Refining Rules • Repeat the above procedures until an ideal rule is obtained

Problem However, there are STILL many "fake" crime tweets

Refine the Rules • Single Keyword • Combination of Keywords • Key Phrases e.g. crime, kill, death,police, cop, shot • found shot OR died OR injured OR body • armed OR unarmed robbery • police on scene of

Keyword Proportion of crime related tweets Single < 5% Combination 50% among results from single keywords Result • Improved crawling result: • Crawling result: About 25,000 crawled tweets per day. • Over 13,000 users per day.

Tweets ClassificationDetermine whether a tweet is a related event Presented by Tobias Kin Hou Lei

Are these tweets related to crime?

A Classification approach

Features Engineering - Basic features • Concept clusters • Natural disaster: {earthquake,tornado, ...} • Weapon: {weapon,weapons,gun,guns,gunshot, ...} • Injure: {...} • Burglar: {...} • ... • Non-Fire : {hilarious,weather,red,moon,sun, ... ,musician, • pizza,cook,music,dance justin bieber} • Could predict unseen words. e.g. Train ontornado warning, could predict earthquake warning.

Tradition Classification Features • Only Text Classification • But Tweets are short and noisy. • at most 140 words • contain noisy words, • contain urls, tags;

Features Engineering - Social Features • Special tags: • #hpd • #breaking news

Features Engineering - Social Features • User as a feature • List of verified police departments on Twitter • URL • Date • Number

Features Engineering - Social Features

Classification Model • Naive Bayes • Easy and good-performance model for online classification. • Many meaningful features and training data, different classification models will performance the similar result.

Training Data • Crawled in from Twitter at different period of times • Manually labeled by our team • 2000 samples for training, among them: • 60% positive samples • 40% negative samples • 1000 samples for testing • 65% positive samples • 35% negative samples

Summary • About 100 concept clusters covers in different areas of the feature space • Average accuracy on test set is 83.788%

Event ExtractionExtracting event information and groupingPresented by Ravi Khadiwala

Event Extraction • Within the text of an individual tweet there may be information not previously found in through data crawling • This information is often useful to the user • Allows user to visualize where crime occurred • Allows user to view filter by category • Decreases the amount of raw tweets the user must read • This information is also useful to improve performance • Ranking • Clustering • Improves accuracy

The Social Location Web

Five potential sources of locations, listed in descending order of perceived usefulness: GPS tagged tweets latitude=57.8433342, longitude=12.6506338 'Place' tagged tweets(57.6190897,12.427637),(57.6190897,12.7635394) (57.8653997, 12.7635394),(57.8653997,12.427637) User location Textual Location Extraction Named Entity Recognition Regular Expressions Temporal/Spatial Information

Temporal/Spatial Information • Location information hierarchically structured based on reliability • Use Named Entity Recognition • Succeeds on: "I just witnessed a robbery in Champaign" • Fails on: "Breaking and entering at 128 Maple St." • Use regular expressions to recognize common formating of addresses, highways, etc. • Time based on tweet time

Location Disambiguation • Search extracted locations through a city to GPS lookup table • Many American city names are repeated (Atlanta,IL vs Atlanta,GA) • Check for well formated locations (city,state) • If not, resolve by selecting matched city with the largest population • Give preferences to other location sources (like user location and GPS) when there are multiple matches

Categorization • Would like categories with finer granularity than crime or not crime • Based on keyword partitions corresponding to categories, ex: • Robbery/Theft: {robbed,robbery,burglar,theft...} • Natural Disaster: {tornado,typhoon,earthquake...} • Keyword based crawling guarantees presence of words that convey meaningful category information

RankingScoring and Ordering Tweets based on ImportancePresented by Ravi Khadiwala

Ranking • We only want to display best "n" tweets • Nature of twitter may result in an extremely variable amount of data • Serves as another way to filter non-crime tweets • May be able to highlight important events • Summarize the most important data points • Avoid overwhelming the user with results

Learning to Rank Goal: Learn a function f: X -> r where X is a vector of features and r is a importance score Strategy: Take pointwise approach and use a sample of manually scored data find the curve that fits our labeled data We use linear regression using the simple least squares method to find weights such that r = w1x1 + w2x2 + w3x3 + . . . wnxn

Determine Ranking Features • Selected from a large pool of potential features • Social • Number of hashtags,urls,@ (indicates a reply), retweet count • Contextual • Tweet length, category, mentioned locations • User Credibility • Age of user account, friends, followers, status count, verification • Classifier Confidence

Ranking Features and Weights • Labeled ~500 tweets with a ranking (integer from 1 to 5) • Linear regression on all features (normalized) • Examined correlation coefficients • Examined weights • Pruned features • Repeated until we had an adequate feature set with logical weights

Ranking Features and Weights WeightsFeatures -0.996904004778 category 2.87974471144 account age 1.71671010105 favorites 1.17242993534 status count 2.67005302808 followers -3.97882564778 confidence

ClusteringGeographical location: determinant for grouping tweets togetherPresented by Ronald Doku

Clustering tweets • Clustering of tweets means to group overlapping tweets found in the same location into one category.

Why is tweet clustering important? • Clustered tweets inform the user about where most events are happening at a particular time. • The sizes of the clustered tweets also convey how relevant or important the tweets are. • eg. A user may want to find out how far a wild fire outbreak is spreading or has spread to. Clustered tweets of the wildfire on the map shows the user where the fire is or has spread to.

Clustered tweets: high level overview

Clustered tweets: after click (California)

How do we cluster tweets? Also by defining at which zoom-levels each tweet should appear, we cluster the tweets to reduce the number shown at a time. We call this hierarchical clustering.

C rime /E vent D etection on T witter