1 / 22

National Technical University of Athens

Clustering Documents using the 3-Gram Graph Representation Model. SUPER Social sensors for secUrity Assessments and Proactive EmeRgencies management. National Technical University of Athens.

Download Presentation

National Technical University of Athens

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Documents using the 3-Gram Graph Representation Model SUPER Social sensors for secUrity Assessments and Proactive EmeRgencies management National Technical University of Athens John Violos, KonstantinosTserpes, AthanasiosPapaoikonomou, MagdaliniKardara, Theodora Varvarigou PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  2. SUPER PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  3. SUPER • Hurricane Sandy 2012 • 20 million tweets • 10pics/sec Instagram • Virginia U.S. 2011 5.8 Richter • 40.000 tweets hit the 1st min PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  4. Topic Communities 3 / 10 / 2014 Detect Topic Communities in Social Networks. Texts of Users, Social Graph, Actions (likes, follow). PCI 2014 18th Panhellenic Conference in Informatics

  5. Text Clustering Users write texts about : Interests Habits Events in their life Cluster texts in topics => Cluster their writers in topic communities. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  6. LDA Latent Dirichlet Allocation What is the weakness? It is a bag of words model. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  7. Sequence of Words The sequence of words is a valuable information. Furthermore Derivative of Words are Similar Words. We need a representation model: Keeps the information of the word sequence. Captures the similarity between derivatives of words. A good solution is the N-Gram Graphs! PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  8. Overview Basic Steps Input: Corpus of texts, number of Clusters k. Ngram Graph that represents the Corpus. Ngram Graph that represents each text. Partition of the Corpus Graph (k subgraphs). Comparison between each text with all partitions. Allocation for each text to the cluster with the highest comparison result. Output: k Clusters with the texts which include. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  9. N-Grams (1) What is the N-Grams? An N-gram is a contiguous sequence of N items from a given sequence of text. The items can be phonemes, syllables, letters, words. In our research we use letters and N=3 An example“home_phone”“hom”, “ome, “me_” , “e_p”, “_ph”, “pho”, “hon”, “one” PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  10. N-Grams (2) NGrams are used in many applications. Approximate string matching. Find likely candidates for the correct spelling of a misspelled word. Language identification. Species identification from a small sequence of DNA. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  11. N-Gram Graph Nodes are all the NGrams of a text. Edges join only the neighbor NGrams. How many edges will be added is defined by a threshold. Edges : Weighted or Unweighted Directed or Undirected PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  12. Example of 3-Gram Graph In this example the graph is: • Undirected • Weighted • Each node is a 3Gram • The threshold of neighbor nodes is 3 The 3Gram Graph “home_phone”. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  13. Graph Comparison How to compare two NGram Graphs? Containment Similarity (unweighted) Value Similarity (weighted) PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  14. k subgraphs • Min number of edges between the k subgraphs. A graph partition can represent a topic. • There are many graph partitioning algorithms: • Kernighan–Lin algorithm • Using the Edge betweenness centrality • Fast Kernel-based Multilevel Algorithm for Graph • Clustering Graph Partitioning PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  15. Fast Kernel-based Multilevel Algorithm Random initial Partitioning For each node i, we compute the cost of belonging the node i in each cluster. The node i is assigned in the cluster with the min cost. We iterate until none node change Cluster. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  16. Experimental Results Reuters-21578 The most widely used test collection for text categorization research. Data set: 18457 documents belonging to 428 labels. Multi label documents: Each document belong to 0 - 29 labels. The complete method was implemented using Java SE. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  17. Experimental Results 3Gram Graph: recognizes the clusters which include many documents.LDA: small clusters like broken parts of the gold standard clusters. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  18. Precision & Recall PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  19. The method can catch the sequence of words. • Derivatives of a word are not handled as different words. • Big clusters can be recognized and more documents can • be assigned to them. • It supports document partial matching and soft membership. • It can capture writing characteristics of a writer. Advantages of the 3-Gram Graphs Clustering PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  20. General notes Our method is not case sensitive. Punctuations and numbers are omitted. The partitions depend on the initial randomly clustering. Maximum nodes of a 3Gram Graph is = 19683. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  21. Experiment with 4Grams, 5Grams, 6Grams. • Experiment with various sizes of threshold. • Experiment with various graph similarity functions. • Experiment with various graph partitioning algorithms. • Remove Stop Words. • Filter out edges which do not provide useful information. Future work PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

  22. SUPER Thank you for your attention! PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

More Related