Geo-spatial Event Detection in the Twitter Stream

Geo-spatial Event Detection in the Twitter Stream Michael Kaisser, AGT International Berlin Buzzwords, June 3, 2013

Outline • Introduction & Context • Social Media Analysis in a C2 Center • The “Avalanche” event detection approach • Identify posting “hot spots” • Evaluate post clusters with Machine Learning approach • Evaluation • Future work

Background: Social Data • Social Media continuously creates massive amounts of data • E.g. 500 Million tweets each day: ~300 GB raw data • Nature of the data: • time-stamped • textual (many languages, lingos & slangs, spelling mistakes are ripe, only a few words per tweet) • links to pictures • links to news paper articles (more text) • sometimes geo-spatial (contains coordinates) • Creating real actionable insights from this isn’t an easy problem •  This talk gives one specific example how this can be done

Use case: Urban Management & Public Safety • Cites today are complex and need to be organized • Administration is responsible for keeping population safe • emergency services • health services • fire fighters • police Command & Control Center

Urban Management & Public Safety • Why is Social Media relevant in this context? ?

Urban Management & Public Safety • Why is Social Media relevant in this context? “There's a plane in the Hudson. I'm on the ferry going to pick up the people. Crazy”

Urban Management & Public Safety • Why is Social Media relevant in this context? “De tering, wat een hel!!! 1,4 miljoen mensen op dat terrein! #loveparade”

Urban Management & Public Safety • Why is Social Media relevant in this context? “#Hoboken is on fire. Building above Hoboken Farm Corporation at 300 Washington is all smoked out”  Social Media can help creating a situational awareness picture

Context: Social Media in a C2 Center

Avalanche: Event detection in a C2 Center

How is it done? • Two step approach: • Identify locations with high tweet activity • Collect geo-spatial tweet clusters • Evaluate clusters with a Machine Learning approach • Do these clusters constitute an real-world event that the tweeters are witnessing first-hand? • Work in Progress: • Classify events according to type

Machine Learning – What is the task? = geo-located Social Media post (Tweet)

Machine Learning – What is the task? Good • Suspicious package in #GrandCentral #NYC #bomb threat possibility not sure?? http://t.co/VwU7SP3X • Suspicious package found in Grand Central Station... the 456 train..the trains are closed !! [pic]: http://t.co/9YPki4k2 • Something happened in the #456 #trainstation in #GrandCentral #NYC http://t.co/GGKvQura • Accident on the #456train in #midtown #NYC http://t.co/fj2mJJmf vs. • RT @refinery29: This image of Madeleine Albright playing the drums will be the best thing you'll see today: http://t.co/rGwQ5RdG • «@_PrettyPoison Guess ill fill out more job apps today» make punna fill out some 2! • The Glamour & Glitz at the 2012 Emmy' s that we loved! http://t.co/CiTFszfL • @IszwanieSyahira: i'm happy and i hope u feel the same too. weeeee ~.~ • How to prepare yourself for Friday's apocalypse http://cnet.co/lPU Bad We need to automatically determine which of the tweet clusters (tweets issued close to each other in a short time frame) represent real-world events and which are just random chatter.

Architecture • We look for geo-spatial clusters of tweets (e.g. 3 or more tweets in a 200m radius, posted within 30 mins) • These become “event candidates” • Event candidates are evaluated with a Machine Learning scheme. • We currently use C4.5 decision trees.

Machine Learning - Features • Tweet cluster: • Suspicious package in #GrandCentral #NYC #bomb threat possibility not sure?? http://t.co/VwU7SP3X • Suspicious package found in Grand Central Station... the 456 train..the trains are closed !! [pic]: http://t.co/9YPki4k2 • Something happened in the #456 #trainstation in #GrandCentral #NYC http://t.co/GGKvQura • Accident on the #456train in #midtown #NYC http://t.co/fj2mJJmf

Scalable Machine Learning … …with Weka! Blue = training Green = runtime In offline ML, we train once, but use the predictive model possibly millions of times a day.  It’s okay if training isn’t fast as lightning.  But during execution every CPU cycle can count.

Scalable Machine Learning … …with Weka! … … which can be optimized further in various ways. See e.g. Nima Asadi, Jimmy Lin, Arjen P. de Vries. Runtime Optimizations for Tree-Based Machine Learning Models. IEEE Transactions on Knowledge and Data Engineering, 2013.

Machine Learning - Evaluation • Evaluation setup: • 1,000 hand-labeled tweet clusters. • 319 good, 681 bad. • 10-fold cross validation.

Machine Learning - Evaluation • Evaluation setup: • 1,000 hand-labeled tweet clusters. 319 good, 681 bad. • 10-fold cross validation.

Machine Learning - Evaluation 1 Common Theme score 0 1 Unique Posters score Blue: event Red: no event • Evaluation setup: • 1,000 hand-labeled tweet clusters. 319 good, 681 bad. • 10-fold cross validation.

(Somewhat simplyfied) Summary • If there are several tweets … • from roughly the same location • at roughly the same time • from different users • that nevertheless use the same words • … chances are good that we have detected an event.

Outlook – work in progress and future work • Derive more coordinates • from shared pictures • from toponyms in posts • use image sharing sites directly • Make use of posts without coordinates • and add them to already existing clusters • Explore real-time TF-IDF • to get rid of the Kardashians & Beliebers • Evaluate system with real-world data • Because recall numbers are currently somewhat misleading

Machine Learning – Relevance Feedback Work in progress Machine Learning Model Good Bad Documents (e.g. tweets, post clusters) Good Users (journalists, C2 operators ) • Users implicitly rate documents by how they interact with them • User performs follow up actions  relevant • User clicks document away  irrelevant •  System learns to present more relevant documents •  System can adapt to changing needs over time

Example: Explosion in an image Image Analysis of shared pictures Work in progress Explosion detected with Image Analysis OMG!!! http://t.co/maiAgHoh OMG!!! • Problem: • Not all tweets contain useful textual information • Shared text might be hard to analyze • Solution: • ~35% of tweets contain linked images • Images provide a wealth of information that can be analyzed • Objects, events, persons • coordinates

Thank you!

Geo-spatial Event Detection in the Twitter Stream

Geo-spatial Event Detection in the Twitter Stream

Presentation Transcript

Geo-Spatial

Potassium Geo-neutrino Detection

National Geo-spatial Information

Geo-Spatial Database Management

Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering

GEO Netherlands Chapter Event

Spatial Cloud Computing: Usage in Geo-Spatial Sciences

Discovering Geographical Topics In The Twitter Stream

Spatial stream support in TGah specification

Geo-Tex Spatial

TOTAL GEO-SPATIAL INFORMATION SOLUTIONS

Title: Spatial Data Mining in Geo-Business

People Detection in Video Stream

Geo-Tex Spatial

Geo-Spatial

Geo/Spatial Search with MySQL

Thinking Geo-Spatial

Geo-spatial Event Detection in the Twitter Stream

Event Detection

Tap Event Detection

Geo-spatial Search Engine

Event Detection