Development and Implementation of Classification and Clustering Methods for Unstructured Document Collections

Classification and clustering methods development and implementation for unstructured documents collections by Osipova NatalySt.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology

Contents • Introduction • Methods description • Information Retrieval System • Experiments

Contextual Document Clustering was developed in joined project of Applied Mathematics and Control Processes Faculty, St. Petersburg State University and Northern Ireland Knowledge Engineering Laboratory (NIKEL), University of Ulster.

Definitions • Document • Terms dictionary • Dictionary • Cluster • Word context • Context or document conditional probability distribution • Entropy

Document conditional probability distribution Document x y word1 word2 word3 … wordn tf(y) 5 10 6 16 p(y|x) 5/m 10/m 6/m 16/m y – words tf(y) – y frequency p(y|x) – y conditional probability in document x m – document x size (5/m, 10/m,6/m,…,16/m ) – document conditional probability distribution

Word context Word w … Document x1 Document x2 Document xk y word1 word2 … wordn1 tf(y) 5 10 16 p(y|x1) 5/m1 10/m1 16/m1 y word1 word3 … wordn2 tf(y) 7 12 4 p(y|x1) 7/m1 12/m1 4/m1 y word1 word4 … wordnk tf(y) 20 9 3 p(y|x1) 20/mk 9/mk 3/mk … y word1 word2 word3 … wordnk tf(y) 5+7+20=32 10 12 3 p(y|w) 32/m 10/m 12/m 3/m Context conditional probability distribution

Methods • document clustering method • dictionary build methods • document classification method using training set Information retrieval methods: • keyword search method • cluster based search method • similar documents search method

Documents Dictionary Narrow context words Distances calculation Clusters Contextual Documents Clustering

Entropy y context conditional probability distribution pn p2 p1 p1+p2+…+pn=1 pn p2 p1 Uncertainly measure, here it is used to characterize commonness (narrowness) of the word context.

Contextual Document Clustering maxH(y)=H ( )

Entropy 0 α 0.5 1 H( ) H( ) H( )

Word Context - Document Distance y context conditional probability distribution Average conditional probability distribution Document x conditional probability distribution

Word Context - Document Distance ) JS[p1,p2]=H( - 0.5H( ) ) - 0.5H(

Jensen-Shannon divergence

Dictionary construction Why: - big volumes: 60,000 documents, 50,000 words => 15,000 words in a context - narrow context words importance

Dictionary construction Delete words with 1. High or low frequency 2. High or low document frequency 3. 1. and 2.

Retrieval algorithms • keyword search method • cluster based search method • search by example method

Keyword search method Document 1 word 1 word 2 word 3 … word n1 Document 2 word 10 word 25 word 30 … word n2 Document 3 word 15 word 2 word 32 … word n3 Document 4 word 11 word 21 word 3 … word n4 Request: word 2 Result set: document 1 document3

Cluster based search method Documents Documents Documents Cluster 1 word 1 word 2 … word n1 Cluster 2 word 12 word 26 … word n2 Cluster 3 word 1 word 23 … word n3 Cluster context words Request: word 1 Result set: Cluster 1 Cluster 3

Minimal Spanning Tree Cluster name document 1 document 4 document 2 document 5 document 3 document 6 document 7 Cluster Similar documents search Request: document 3 Result set: document 6 document 7

Document classification: method 1 Training set Test documents Clusters List of topics Topics contexts Distances between topics and clusters contexts Classification result: cluster1 – topic 10 cluster 2 – topic 3 … cluster n – topic 30

Document classification: method 2 Training set Test documents All documents set Topics list Clusters Classification result: cluster1 – topic 10 cluster 2 – topic 3 … cluster n – topic 30

Information Retrieval System • Architecture • Features • Use

data base server client Information Retrieval System architecture.

Data Base Data Base Server MS SQL Server 2000 Local Area Network “thick” client C# IRS architecture

IRS architecture DBMS MS SQL Server 2000: • High-performance • Scalable • Secure • Huge volumes of data treat • T/SQL • Stored procedures

IRS features In the IRS the following problems are solved: • document clustering • keyword search method • cluster based search method • similar documents search method • document classification with the use of training set

DB structure The Data Base of the IRS consists of the following tables: • documents • all words dictionary • dictionary • table of relations between documents and words: document-word • words contexts • words with narrow contexts • clusters • intermediate tables for main tables build and for retrieve realization

Documents All words dictionary Dictionary Keyword search Table “document-word” Cluster based search Words contexts Clusters Centroid Words with narrow contexts Similar documents search Algorithms implementation

0,26967 document2 document1 0,211 0,57231 0,1011 0,16285 document5 document3 0,7231 0,8731 0,23851 0,98154 document4 Cluster Similar documents search

Cluster name document 1 document 4 document 2 document 5 document 3 Cluster Minimal Spanning Tree

Similar documents search Similar documents search Clusters table Distances table Tree table

IRS use

Experiments Test goals were: • algorithm accuracy test • different classification methods comparison • algorithm efficiency evaluation

Experiments • 60,000 documents • 100 topics • Training set volume = 5% of the collection size

Experiments

Result analysis - Russian Information Retrieval Evaluation Seminar - Such measures as macro-average • recall • precision • F-measure were calculated.

Recall

Precision

F-measure

Result analysis List of some topics test documents were classified in

Result analysis Recall results for every category. Results which were the best for the category are selected with bold type. All results are set in percents.

Development and Implementation of Classification and Clustering Methods for Unstructured Document Collections

Development and Implementation of Classification and Clustering Methods for Unstructured Document Collections

Presentation Transcript

By Golly By Gum

Organised by: Sponsored by:

Commissioned by: Prepared by:

Developed By by

By: ________________

By:

Developed By by

Developed By by

Sweet By And By

By Road, By Rail, By River

By:

BY-

Supervised By: Undertaken By:

By:

By And By

Developed By by

Joy By and By

Sponsored by: Co-sponsored by: Presented by:

By-By

Aerogel Market by Type by form by processing by application

Global Ear buds Market, By Product, By Technology, By Feature, By Price, By Application, By Distribution Channel, By Reg

Angiography Devices Market Size By Device, By Product, By Procedure, By Application, By Indication, By End-user, By Geog