1 / 51

Development and Implementation of Classification and Clustering Methods for Unstructured Document Collections

This paper discusses the development and implementation of classification and clustering methods for unstructured document collections. It explains the methods used and provides details on the information retrieval system used for experiments.

caitlinj
Download Presentation

Development and Implementation of Classification and Clustering Methods for Unstructured Document Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification and clustering methods development and implementation for unstructured documents collections by Osipova NatalySt.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology

  2. Contents • Introduction • Methods description • Information Retrieval System • Experiments

  3. Contextual Document Clustering was developed in joined project of Applied Mathematics and Control Processes Faculty, St. Petersburg State University and Northern Ireland Knowledge Engineering Laboratory (NIKEL), University of Ulster.

  4. Definitions • Document • Terms dictionary • Dictionary • Cluster • Word context • Context or document conditional probability distribution • Entropy

  5. Document conditional probability distribution Document x y word1 word2 word3 … wordn tf(y) 5 10 6 16 p(y|x) 5/m 10/m 6/m 16/m y – words tf(y) – y frequency p(y|x) – y conditional probability in document x m – document x size (5/m, 10/m,6/m,…,16/m ) – document conditional probability distribution

  6. Word context Word w … Document x1 Document x2 Document xk y word1 word2 … wordn1 tf(y) 5 10 16 p(y|x1) 5/m1 10/m1 16/m1 y word1 word3 … wordn2 tf(y) 7 12 4 p(y|x1) 7/m1 12/m1 4/m1 y word1 word4 … wordnk tf(y) 20 9 3 p(y|x1) 20/mk 9/mk 3/mk … y word1 word2 word3 … wordnk tf(y) 5+7+20=32 10 12 3 p(y|w) 32/m 10/m 12/m 3/m Context conditional probability distribution

  7. Contents • Introduction • Methods description • Information Retrieval System • Experiments

  8. Methods • document clustering method • dictionary build methods • document classification method using training set Information retrieval methods: • keyword search method • cluster based search method • similar documents search method

  9. Documents Dictionary Narrow context words Distances calculation Clusters Contextual Documents Clustering

  10. Entropy y context conditional probability distribution pn p2 p1 p1+p2+…+pn=1 pn p2 p1 Uncertainly measure, here it is used to characterize commonness (narrowness) of the word context.

  11. Contextual Document Clustering maxH(y)=H ( )

  12. Entropy 0 α 0.5 1 H( ) H( ) H( )

  13. Word Context - Document Distance y context conditional probability distribution Average conditional probability distribution Document x conditional probability distribution

  14. Word Context - Document Distance ) JS[p1,p2]=H( - 0.5H( ) ) - 0.5H(

  15. Jensen-Shannon divergence

  16. Dictionary construction Why: - big volumes: 60,000 documents, 50,000 words => 15,000 words in a context - narrow context words importance

  17. Dictionary construction Delete words with 1. High or low frequency 2. High or low document frequency 3. 1. and 2.

  18. Retrieval algorithms • keyword search method • cluster based search method • search by example method

  19. Keyword search method Document 1 word 1 word 2 word 3 … word n1 Document 2 word 10 word 25 word 30 … word n2 Document 3 word 15 word 2 word 32 … word n3 Document 4 word 11 word 21 word 3 … word n4 Request: word 2 Result set: document 1 document3

  20. Cluster based search method Documents Documents Documents Cluster 1 word 1 word 2 … word n1 Cluster 2 word 12 word 26 … word n2 Cluster 3 word 1 word 23 … word n3 Cluster context words Request: word 1 Result set: Cluster 1 Cluster 3

  21. Minimal Spanning Tree Cluster name document 1 document 4 document 2 document 5 document 3 document 6 document 7 Cluster Similar documents search Request: document 3 Result set: document 6 document 7

  22. Document classification: method 1 Training set Test documents Clusters List of topics Topics contexts Distances between topics and clusters contexts Classification result: cluster1 – topic 10 cluster 2 – topic 3 … cluster n – topic 30

  23. Document classification: method 2 Training set Test documents All documents set Topics list Clusters Classification result: cluster1 – topic 10 cluster 2 – topic 3 … cluster n – topic 30

  24. Contents • Introduction • Methods description • Information Retrieval System • Experiments

  25. Information Retrieval System • Architecture • Features • Use

  26. data base server client Information Retrieval System architecture.

  27. Data Base Data Base Server MS SQL Server 2000 Local Area Network “thick” client C# IRS architecture

  28. IRS architecture DBMS MS SQL Server 2000: • High-performance • Scalable • Secure • Huge volumes of data treat • T/SQL • Stored procedures

  29. IRS features In the IRS the following problems are solved: • document clustering • keyword search method • cluster based search method • similar documents search method • document classification with the use of training set

  30. DB structure The Data Base of the IRS consists of the following tables: • documents • all words dictionary • dictionary • table of relations between documents and words: document-word • words contexts • words with narrow contexts • clusters • intermediate tables for main tables build and for retrieve realization

  31. Documents All words dictionary Dictionary Keyword search Table “document-word” Cluster based search Words contexts Clusters Centroid Words with narrow contexts Similar documents search Algorithms implementation

  32. 0,26967 document2 document1 0,211 0,57231 0,1011 0,16285 document5 document3 0,7231 0,8731 0,23851 0,98154 document4 Cluster Similar documents search

  33. Cluster name document 1 document 4 document 2 document 5 document 3 Cluster Minimal Spanning Tree

  34. Similar documents search Similar documents search Clusters table Distances table Tree table

  35. IRS use

  36. IRS use

  37. IRS use

  38. IRS use

  39. IRS use

  40. IRS use

  41. Contents • Introduction • Methods description • Information Retrieval System • Experiments

  42. Experiments Test goals were: • algorithm accuracy test • different classification methods comparison • algorithm efficiency evaluation

  43. Experiments • 60,000 documents • 100 topics • Training set volume = 5% of the collection size

  44. Experiments

  45. Result analysis - Russian Information Retrieval Evaluation Seminar - Such measures as macro-average • recall • precision • F-measure were calculated.

  46. Recall

  47. Precision

  48. F-measure

  49. Result analysis List of some topics test documents were classified in

  50. Result analysis Recall results for every category. Results which were the best for the category are selected with bold type. All results are set in percents.

More Related