1 / 14

Using to Save Lives

Using to Save Lives. Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix. Outline. Explanation. Digg is a social web-media discovery tool based on user submitted content. 1 or 2 submissions a minute

Download Presentation

Using to Save Lives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

  2. Outline

  3. Explanation • Digg is a social web-media discovery tool based on user submitted content. • 1 or 2 submissions a minute • Half-life of “interest” is about a day • Digg aggregates “interesting” content. • But how do we find interesting Events and know their Themes?

  4. Motivation • Collaborative nature of Social Media can scour the WWW very thoroughly. • But, this generates A LOT of data (you’ll see). • It would be cool to find emergencies, or critical situations based on this collaborative media. • Apple seems like a pretty good starting point.

  5. Approach

  6. Preprocessing • Digg API • REST API • http://services.digg.com/stories/topic/apple?count=10 • XML response • <?xml version="1.0" encoding="utf-8" ?><users timestamp="1176998598" total="1" offset="0" count="1"> <user name="sbwms" icon="http://digg.com/img/user-large/user-default.png" registered="1135702996" profileviews="3104" /></users></xml> • Limitations • 100 results per request • 1 Hour of time series data • Can’t go fast, or else.

  7. Preprocessing • Time Series • Each digg is the event (only 100 at a time) • Rows • Each story’s digg count • Columns • Every hour (2,207 of them from August 08 – November 08) • Clustering • Rows • Each story that was digged at any point in the time series • Columns • The words in the title and description of this story

  8. Preprocessing - Challenges • SLOW • Really Dirty Data • Different Formats of Data • REALLY SLOW

  9. Introduction to Document Clustering • Challenges of clustering of text documents unlike structured data are: • Volume • Dimensionality • Sparsity • Complex semantics • In information retrieval and text mining, text data is represented in a common representation model, e.g. Vector Space Model (VSM) • Huge sparse matrix, we just store non-zero values Text Text documents are converted to Am,n where for m documents and total number of n words (or phrases), each element xi,j represents the frequency of the jth term in the ith document.

  10. Clustering • Dataset • Number of stories (m) : 25470 • Total number of unique words (n): 55557 • Nonzero values: 469323 (0.03214%) • Clustering using Cluto Software • Using Kmeans, bisecting Kmeans • Calculating Centroids and SSE • A C++ program is run on “black”

  11. Document Clustering by Optimizing Criterion Functions • According to Zhao et .al, to have a good clustering for documents we can use some Criterion Function and use optimization to find clusters: • Internal Criterion Functions (I) • Maximizing the internal similarity function: • External Criterion Functions (E) • Minimizing the external similarity function: • Hybrid Criterion Functions (H) • Maximizing

  12. Experiments • SSE for I (K-Means vs Bisecting K-Means)

  13. Visualization • What we used • jQuery • Database query library for javascript • PHP/MySQL • Scripting language and database backend • Google Visualization API • Time Series Graph • Zoomable • Timepedia Chronoscope • Clickable

  14. Conclusions • Success? • Of course we think so • Future Work • Save lives? • Better clustering • Cleaner data • More data • Make it scalable, and dynamic • On-line and on the fly?

More Related