290 likes | 377 Views
Evaluation, Prediction, and Visualization of Spatio -Temporal Crime Patterns in Washington D.C. Area. Date: 2012-12-10 Armin Ashoury Rad, Youngjib Ham, and Yuhyun Song. Introduction. Data mining is the intersection of statistics and computer science to explore huge data sets
E N D
Evaluation, Prediction, and Visualization of Spatio-Temporal Crime Patterns in Washington D.C. Area Date: 2012-12-10 Armin AshouryRad, Youngjib Ham, and Yuhyun Song
Introduction • Data mining is the intersection of statistics and computer science to explore huge data sets • Using crime dataset in Washington D.C area, provide users with useful information such as safe factor of location
Problem Statements and Objective of the Project • Lots of crime related to Larceny, Larceny auto, Larceny F/auto in the area near Washington D.C. Focusing on Larceny incidents, offer the safety factor and information about crime history visually. • Predict the possible type of crime and find the association betweentype of crimes • Objective : Share the crime information with users about spatial distribution of different crimes occurred in Washington D.C. area
Outline • Classification Rules on Crime data in Washington D.C • Detection of Association Rules between the types of crime • Supervised and unsupervised spatial temporal clustering on stolen car
Classification Rules:General Idea • We can predict the type of crimes using the built classifier. • Use 70% of set of data as a training set and the rest of data as a test set to find the best classifier giving the lowest misclassification rate. • Utilize three different classification models using crime data in Washington D.C from 2006 to 2009 • Used variables: • Class variable: Type of Crimes(Larceny, Larceny F/auto, Larceny auto excluded) • Latitude, Longitude, Month, Date • RandomForest, QDA, KNN implemented
Classifiers: Random Forest, KNN,and QDA • Random Forest • Main idea: Grow an ensemble of decision trees that vote for the most popular class • K-Nearest Neighbor • Main idea: method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance based learning. • An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors • QDA • Main Idea: Quadratic discriminant analysis (QDA) is closely related to linear discriminant analysis(LDA) where it is assumed that the measurements from each class are normally distributed. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.
Confusion matrix from random forest • The total misclassification rate when using Longitude, Latitude, Month, Date is 57.05% • Use random forest algorithm to decide which variables are important to classify the type of crimes.
Variable Selection in Random Forest • To group the type of crimes, Location variables are key variables • Location attributes are more important than time variables Longitude Latitude
Comparison: KNN vs. QDA KNN on training set QDA on training set KNN performs better than QDA
KNN’s apparent misclassification rate using test data Apparent Misclassification rate: 0.659168
From Classification rules… • From random forest, geographical information of crime is more important than time attributes. • KNN showed better performance than Random forest and QDA for classifying the type of crimes. However, made large misclassification error rate on test data when predicting type of crimes
Association rules analysis • Different set of crimes happen in every day • When X occurs, Y also occurs in each day • Police can use this information to: • Understand why some kinds of crime happen simultaneously • Gain insight about the crime patterns: • Which crimes happen together • Take action: • Predict specific kinds of crimes • Proactively take a step for the crime with high probability
Use of Association Rules • Definition • An item: a kind of crime • A transaction: a set of crimes which happen one day • Data • Area • Washington D.C. area • Duration • 2006 ~ 2009 (until Sep 27) • 1366 days • Crimes data
Use of Association Rules • Data trends
Mining association rules • Brute-force approach: • List all possible association rules • Prune the rules based on the minsup and minconf thresholds • Computationally expensive • Reducing number of candidates • Apriori principle • Probably the best known algorithm • Find all itemsets that have minimum support • Use frequent itemsets to generate rules • Compute k-item set by merging (k-1)-item sets
Mining association rules • Apriori principle • Frequent itemset generation • Confidence (A B) ≥ minConf • Support (A B) ≥ minSup • The property • If (A B) is a frequent item set, then (A) and (B) have to be frequent item sets as well • In general: if X is frequent k-item set, then all (k-1)-item subsets of X are also frequent • Modify ‘ARMADA’ data mining tool in MATLAB central (http://www.mathworks.com/matlabcentral/fileexchange/3016)
Association rules analysis • Types of association rules • Actionable Rules • contain high-quality, actionable information • Trivial Rules • information already well-known and familiar with • Inexplicable Rules • no explanation and do not suggest action • Trivial and Inexplicable Rules occur most often
Association rules analysis • Example of Rules • {ASSAULT OFFENSES, BURGLARY, LARCENY} {ROBBERY} (s=0.76, c=1.0) • {LARCENY AUTO} {LARCENY F/AUTO} (s=0.89, c=1.0) • {ROBBERY} {LARCENY AUTO} (s=0.87, c=0.89) • {LARCENY} {BURGLARY} (s=0.79, c=0.91) • High support and high confidence: • Such obvious rules may tend to be uninteresting
Association rules analysis • Example of Rules • {ASSAULT OFFENSES, BURGLARY, HOMICIDE OFFENSES} {SEX OFFENSES ABUSE} (s=0.16, c=0.63) • {ARSON, ASSAULT OFFENSES, HOMICIDE OFFENSES} {SEX OFFENSES ABUSE} (s=0.02, c=0.67) • {HOMICIDE OFFENSES, SEX OFFENSES ABUSE} {ARSON} (s=0.02, c=0.12) • Simple interesting patterns: • ARSON, ASSAULT OFFENSES, SEX OFFENSES ABUSE, and HOMICIDE OFFENSES are likely to happen together
Maps and Charts • Web app coded on ASP .NET • For maps we used Google fusion tables and imported them in app using iframe. • All the charts prepared using Google charts tool
Safety factor Clustering all data points using k-means with 10 cluster Creating random points in close distance from actual data point Re-cluster random points using previous centroid • K-means clustering package on R
Framework • Google map • Marker Clusterer • Geocoder • 500 points • Google street view • Google earth
Prediction model and result • Spatio-Temporal regression model • Predicting number of crimes in different categories one step ahead using today’s number of crime in our region and neighbors • Prediction using Poisson regression model • Glm package of R
Prediction on app • Google script • Graphic User interface
Organized Larcenies 3D Clustering • Organized larceny: a person or a group that steal more than one car of one specific model from one region in short period of time • Normalized latitude longitude and time of crimes • 3D clustering using k-means clustering with 20 clusters • k-means in R package
2-steps clustering • 2 clustering procedure • First on location of larcenies • 10 clusters using k-means • Next on time of larcenies • 10 clusters for each one of location clusters • Total 100 clusters
Conclusion • Summary • Extracting previously unknown, valid, comprehensible information from crime dataset (2006 ~ 2009) in Washington D.C. area • Evaluation and Prediction of the crime patterns • Classification • Association analysis • Cluster analysis • Visualization of the results • Web-based App
Conclusion • Contributions • Predicting crime patterns • Proactively taking a measure for the crime with high probability • Helping effectively control the limited police manpower • Future works • Predicting other cities’ crime trends • Studying crime datasets based on a wide range of variables • Demographics, education, etc.