1 / 54

C rime /E vent D etection on T witter

C rime /E vent D etection on T witter. Data Sciences Summer Institute 2011 Multimodal Information Access & Synthesis, University of Illinois at Urbana-Champaign. Our Team. Team member: Elisee Habimana Jicong Wang Sridevi Maharaj Ronald Doku Mingjia Zhang Tobias Kin Hou Lei

katina
Download Presentation

C rime /E vent D etection on T witter

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crime/Event Detection on Twitter Data Sciences Summer Institute 2011 Multimodal Information Access & Synthesis, University of Illinois at Urbana-Champaign

  2. Our Team Team member: Elisee Habimana Jicong Wang Sridevi MaharajRonald Doku Mingjia Zhang Tobias Kin Hou Lei Ravi KhadiwalaDuber Gomez Rui Yang Project leader: Yizhou Sun Rui Li

  3. Motivation - why Twitter? Real Time Wide Coverage

  4. Motivation - An Example • An earthquake happened in Chile at 03:34 local time, Sat Feb 27, 2010 • Traditional communication almost impossible for 2-3 hours, first video image available 6-7 hours after quake Source: <Information Credibility on Twitter>, by Carlos Castillo et al.

  5. Motivation - Another Example • Tweet posted at 2:22pm, June 28th, 20 minutes after the shot, while first news report appears almost 3 hours later

  6. Motivation • Twitter reshape the way people spread and receive information • The real time feature makes twitter a good source of breaking news • The official and verified accounts on twitter provides reliable information • We propose to build up a web application that provide reliable real time crime related information

  7. Demo

  8. Crime/Event Detection on Twitter Data Sciences Summer Institute 2011 Multimodal Information Access & Synthesis, University of Illinois at Urbana-Champaign

  9. Table of Contents • Major Challenges • Crime Focused Crawling • Tweet Classification • Event Extraction • Tweet Ranking  • Clustering • Tools • Summary

  10. Major Challenges • Most tweet contents are useless for us • Pointless babble – 40% • Conversational – 38% • Pass-along value – 9% • Self-promotion – 6% • Spam – 4% • News – 4% • Crime related - 0.005% • Roughly 10,000 crime related tweets each day • Information like location and time not always explicit • Display only the most important tweets • Present results in an organized fashion Source: <Twitter Study – August 2009> Kelly, Ryan, ed (August 12, 2009)

  11. Project Flowchart

  12. Crime Focus CrawlingCrawling crime related tweets from TwitterPresented by Jicong Wang

  13. A Snapshot of Twitter Data USERID  43893075 ID    68542312782905344 TEXT    Break shooting scene 1 "No More" with @dindamanda @yuyayuyi http://lockerz.com/s/100883315 LOCATION    GeoLocation latitude=-6.196612, longitude=106.829552  PLACE TIME     Thu May 12 00:05:35 CDT 2011 URLS      url=http://lockerz.com/s/100883315, MentionedEntities: 37623286    66072730     Hashtags: also number of Followers, number of Friends, name of User, etc

  14. NOT ALL TWEETS ARE CRIME RELATED! ONLY about 0.005%!

  15. Observation

  16. Iteratively Refining Rules • Repeat the above procedures until an ideal rule is obtained

  17. Problem However, there are STILL many "fake" crime tweets

  18. Refine the Rules • Single Keyword • Combination of Keywords • Key Phrases e.g. crime, kill, death,police, cop, shot • found shot OR died OR injured OR body • armed OR unarmed robbery • police on scene of

  19. Keyword Proportion of crime related tweets Single < 5% Combination 50% among results from single keywords Result • Improved crawling result: • Crawling result: About 25,000 crawled tweets per day. • Over 13,000 users per day.

  20. Tweets ClassificationDetermine whether a tweet is a related event Presented by Tobias Kin Hou Lei

  21. Are these tweets related to crime?

  22. A  Classification approach

  23. Features Engineering - Basic features • Concept clusters • Natural disaster: {earthquake,tornado, ...} • Weapon: {weapon,weapons,gun,guns,gunshot, ...} • Injure: {...} • Burglar: {...} • ... • Non-Fire : {hilarious,weather,red,moon,sun, ... ,musician, • pizza,cook,music,dance justin bieber} • Could predict unseen words. e.g. Train ontornado warning, could predict earthquake warning.

  24. Tradition Classification Features • Only Text Classification • But Tweets are short and noisy. • at most 140 words • contain noisy words, • contain urls, tags;

  25. Features Engineering - Social Features • Special tags: • #hpd • #breaking news

  26. Features Engineering - Social Features • User as a feature • List of verified police departments on Twitter •  URL • Date • Number

  27. Features Engineering - Social Features

  28. Classification Model •  Naive Bayes • Easy and good-performance model for online classification. • Many meaningful features and training data, different classification models will performance the similar result.

  29. Training Data • Crawled in from Twitter at different period of times • Manually labeled by our team • 2000 samples for training, among them: • 60% positive samples • 40% negative samples • 1000 samples for testing • 65% positive samples • 35% negative samples

  30. Summary • About 100 concept clusters covers in different areas of the feature space • Average accuracy on test set is 83.788%

  31. Event ExtractionExtracting event information and groupingPresented by Ravi Khadiwala

  32. Event Extraction • Within the text of an individual tweet there may be information not previously found in through data crawling • This information is often useful to the user • Allows user to visualize where crime occurred • Allows user to view filter by category • Decreases the amount of raw tweets the user must read • This information is also useful to improve performance • Ranking • Clustering • Improves accuracy

  33. The Social Location Web

  34. Five potential sources of locations, listed in descending order of perceived usefulness: GPS tagged tweets  latitude=57.8433342, longitude=12.6506338 'Place' tagged tweets(57.6190897,12.427637),(57.6190897,12.7635394)      (57.8653997, 12.7635394),(57.8653997,12.427637) User location Textual Location Extraction  Named Entity Recognition Regular Expressions Temporal/Spatial Information

  35. Temporal/Spatial Information • Location information hierarchically structured based on reliability  • Use Named Entity Recognition • Succeeds on: "I just witnessed a robbery in Champaign" • Fails on: "Breaking and entering at 128 Maple St." • Use regular expressions to recognize common formating of addresses, highways, etc. • Time based on tweet time

  36. Regex Example "[0-9]+ ([A-Z][A-Za-z]* )+ (ALLEE|ALLEY|ALLY|ALY|ANEX|ANNEX|ANNX|ANX|ARC|ARCADE|AV|AVE|AVEN|AVENU|AVENUE|AVN| AVNUE|BAYOO|BAYOU|BCH|BEACH|BEND|BND|BLF|BLUF|BLUFF|BLUFFS|BOT|BOTTM|BOTTOM|BTM|BLVD|BOUL|BOULEVARD|BO ULV|BR|BRANCH|BRNCH|BRDGE|BRG|BRIDGE|BRK|BROOK|BROOKS|BURG|BURGS|BYP|BYPA|BYPAS|BYPASS|BYPS|CAMP|CMP| CP|CANYN|CANYON|CNYN|CYN|CAPE|CPE|CAUSEWAY|CAUSWAY|CSWY|CEN|CENT|CENTER|CENTR|CENTRE|CNTER|CNTR|CTR|C ENTERS|CIR|CIRC|CIRCL|CIRCLE|CRCL|CRCLE|CIRCLES|CLF|CLIFF|CLFS|CLIFFS|CLB|CLUB|COMMON|COR|CORNER|CORNERS|CORS| COURSE|CRSE|COURT|CRT|CT|COURTS|CTS|COVE|CV|COVES|CK|CR|CREEK|CRK|CRECENT|CRES|CRESCENT|CRESENT|CRSCNT|C RSENT|CRSNT|CREST|CROSSING|CRSSING|CRSSNG|XING|CROSSROAD|CURVE|DALE|DL|DAM|DM|DIV|DIVIDE|DV|DVD|DR|DRIV|DRI VE|DRV|DRIVES|EST|ESTATE|ESTATES|ESTS|EXP|EXPR|EXPRESS|EXPRESSWAY|EXPW|EXPY|EXT|EXTENSION|EXTN|EXTNSN|EXTE NSIONS|EXTS|FALL|FALLS|FLS|FERRY|FRRY|FRY|FIELD|FLD|FIELDS|FLDS|FLAT|FLT|FLATS|FLTS|FORD|FRD|FORDS|FOREST|FORE  STS|FRST|FORG|FORGE|FRG|FORGES|FORK|FRK|FORKS|FRKS|FORT|FRT|FT|FREEWAY|FREEWY|FRWAY|FRWY|FWY|GARDEN|GA RDN|GDN|GRDEN|GRDN|GARDENS|GDNS|GRDNS|GATEWAY|GATEWY|GATWAY|GTWAY|GTWY|GLEN|GLN|GLENS|GREEN|GRN|G REENS|GROV|GROVE|GRV|GROVES|HARB|HARBOR|HARBR|HBR|HRBOR|HARBORS|HAVEN|HAVN|HVN|HEIGHT|HEIGHTS|HGTS|HT |HTS|HIGHWAY|HIGHWY|HIWAY|HIWY|HWAY|HWY|HILL|HL|HILLS|HLS|HLLW|HOLLOW|HOLLOWS|HOLW|HOLWS|INLET|INLT|IS|ISL AND|ISLND|ISLANDS|ISLNDS|ISS|ISLE|ISLES|JCT|JCTION|JCTN|JUNCTION|JUNCTN|JUNCTON|JCTNS|JCTS|JUNCTIONS|KEY|KY|KEYS |KYS|KNL|KNOL|KNOLL|KNLS|KNOLLS|LAKE|LK|LAKES|LKS|LAND|LANDING|LNDG|LNDNG|LA|LANE|LANES|LN|LGT|LIGHT|LIGHTS|

  37. Location Disambiguation • Search extracted locations through a city to GPS lookup table • Many American city names are repeated (Atlanta,IL vs Atlanta,GA) • Check for well formated locations (city,state) • If not, resolve by selecting matched city with the largest population • Give preferences to other location sources (like user location and GPS) when there are multiple matches

  38. Categorization • Would like categories with finer granularity than crime or not crime • Based on keyword partitions corresponding to categories, ex: • Robbery/Theft: {robbed,robbery,burglar,theft...} • Natural Disaster: {tornado,typhoon,earthquake...}  • Keyword based crawling guarantees presence of words that convey meaningful category information

  39. RankingScoring and Ordering Tweets based on ImportancePresented by Ravi Khadiwala

  40. Ranking • We only want to display best "n" tweets • Nature of twitter may result in an extremely variable amount of data • Serves as another way to filter non-crime tweets • May be able to highlight important events • Summarize the most important data points • Avoid overwhelming the user with results

  41. Learning to Rank Goal: Learn a function f: X -> r         where X is a vector of features          and r is a importance score Strategy:     Take pointwise approach and use a sample of manually scored data find the curve that fits our labeled data     We use linear regression using the simple least squares method to find weights such that         r = w1x1 + w2x2 + w3x3 + . . .  wnxn

  42. Determine Ranking Features • Selected from a large pool of potential features • Social • Number of hashtags,urls,@ (indicates a reply), retweet count • Contextual • Tweet length, category, mentioned locations • User Credibility • Age of user account, friends, followers, status count, verification • Classifier Confidence

  43. Ranking Features and Weights • Labeled ~500 tweets with a ranking (integer from 1 to 5) • Linear regression on all features (normalized) • Examined correlation coefficients • Examined weights • Pruned features • Repeated until we had an adequate feature set with logical weights

  44. Ranking Features and Weights WeightsFeatures -0.996904004778        category 2.87974471144           account age 1.71671010105           favorites 1.17242993534           status count 2.67005302808           followers -3.97882564778          confidence

  45. ClusteringGeographical location: determinant for grouping tweets togetherPresented by Ronald Doku

  46. Clustering tweets • Clustering of tweets means to group overlapping tweets found in the same location into one category.

  47.  Why is tweet clustering important? • Clustered tweets inform the user about where most events are happening at a particular time.  • The sizes of the clustered tweets also convey how relevant or important the tweets are. • eg. A user may want to find out how far a wild fire outbreak is spreading or has spread to. Clustered tweets of the wildfire on the map shows the user  where the fire is or has spread to.

  48. Clustered tweets: high level overview

  49. Clustered tweets: after click (California)

  50. How do we cluster tweets? Also by defining at which zoom-levels each tweet should appear, we cluster the tweets to reduce the number shown at a time. We call this hierarchical clustering.

More Related