Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue

Lobachevsly State University of Nizhny Novgorod Faculty of Computational Mathematics and Cybernatics Chair of the Mathematical Support of Computer Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Prepared by FedorVladimirovichBorisyuk Scientific adviser Doctor of technical sciences, Professor Vladimir IvanovichShvetsov Nizhny Novgorod, 2010

Problem definition Initial data Collection of text documents. Purpose Automatically construct hierarchical catalogue, which reflects thematic areas of given initial collection.

Examples of web catalogues: Yandex Catalogue • Yandex Catalogue stores information on tens of thousands of Russian websites. • Uses 16 major topics, • each of them not more than six levels in depth. • Yandex Catalogue was compiled and is updated manually.

Examples of web catalogues: eLibrary.ru Russian scientific electronic library eLIBRARY.RU 12 millions of scientific articles Use library classificator GRNTI (State rubricator of scientific and technical information) Maintained by group of experts. Depth of catalogue is nomore than 3.

Why to construct electronic catalogue automatically Big amounts of text data is accumulated and is continuously growing. Most of the catalogues are maintained with support of human experts. High labor costs! Subjectivity of human experts. Traditional and manually prepared catalogues and classifiers can not reflect high rates of informational progress in required areas.

Related works • Tao Li and Shenghuo Zhu have used linear discriminant projection approach for transformation of document space onto lower-dimensional space and then cluster the documents into hierarchy using Hierarchical agglomerative clustering algorithm. • O. Peskova develops a modification of layerwise clustering method of Ayvazyan. There was found a 4% advantage in average f-measure of the developed clustering method over Hierarchical agglomerative clustering algorithm.

Mechanism of automatic construction of electronic catalogue Unclassified text collection Parameters of clustering algorithm Preparation of document images for clustering Hierarchical clustering by areas Hierarchical structure of electronic catalogue Post processing of Hierarchical structure

Preparation of document images: Suggested algorithm of keywords selection • For all words of the document stem is extracted using Porter algorithm. • Remove stop words and words, which have frequency more than predefined max frequency or less than predefined minimum frequency. • Weight of the stemi in the document D is calculated using modified TFxIDF formula. • No more than 300 stems with the highest weight are selected as keywords to represent the document. • The number of keywords reduced using suggested selective feature reduction algorithm.

Weighting formula TFxIDF weighting formulas Notations: • Tfi - is term frequency in document D • MaxStemFreqD- max frequency between all stems in D • TDN - total number of documents in collection • DNi- number of documents where this stem occurs • IDFi - inversed document frequency.

Suggested Selective feature space reduction Purpose: Select keywords with the best discrimination power in relation to possible catalogue areas. Selective feature space reduction algorithm: • Cluster the documents collection using modified Hierarchical by areas algorithm. Each area in the tree is characterized by keywords vector. • Execute keywords extraction algorithm on areas keywords vectors to select the keywords of each area in relation to other areas. • Remove from documents keywords, which are not presented in areas feature space. • TDN - total number of documents in collection • DNi- number of documents where this stem occurs

A B D C E F G H Basics of Hierarchical by areas clustering • Object of clustering is text document. • Document is characterized by vector of keywords. • Tree of Areas. • Защита от сбоев вычислительных узлов • Распределенная файловая система • Высокая производительность • Автоматическое распределение нагрузки • РеализацияMapReduce с открытым исходным кодом Apache Hadoop

Characters of Area • Characterized by vector of keywords, which is prepared from the keywords of documents in this area. Each keyword has a weight. • Documents, which are belongs to Area. There is limit on the number of documents in area. • 3. Area can have a children. There is limit on the number of children.

Startup • Lets we have incoming flow of documents. • Incrementally builds Area tree from incoming documents

Hierarchical by areas clustering 1 step. Area = Root area Verify possibility to insert the document Doc. Put Doc to RecycleBin if proximity is less than min. 2 step.Searchfor closest to Doc child of Area. 3 step.IF Childcloser to Doc than toArea, thenArea = Childand go to step 2. 4 step.Insert document in Area. 5 step.Verify limits: IFAreais crowded, then divide it. IFnumber of children is more then limit – integrate them. 6step.Update set of keywords of areas, which are located on the path to the resulted area.

RecycleBin A R B C X • All those documents which do not meet entry criteria of the areas of the certain level should be temporary stored in special area on the same level – in RecycleBin. • When the number of objects in the RecycleBin exceed the predefined limit, RecycleBin is divided and detached area is connected to the current level.

Divide operation A A B C D • Reason: too many documents in area. • Divide area using K-means algorithm into two parts. • Connect areas C and D to the tree. B area will host integrated characteristics.

Integrate operation A A B C X D X B C • Reason: number of children is more then predefined limit. • Find two most close areas (B and C) and unite them under one parent area. • Parent area D will have as center - average of keywords vectors from both integrated areas B and C.

Areas Tree is filled like pyramid of champagne

Post processing of generated Hierarchical structure Setup areas Titles as three first keywords from top of area keywords vector sorted by weight. For all the tree - make a links between areas at the same level if distance between keyword vectors of these areas is greater than calculated distance. Purpose - referring similar or related rubrics.

Test collections

External clustering evaluation Recall Precision F-measure Recall = tp tp 2*Recall*Prec Prec+Recall tp+fp tp+fn Precision= F-measure=

Computational experiments: Evaluation of average metrics

Top levels of catalogue generated by the hierarchical clustering by areas

Conclusions Effective method of keywords extraction from text documents in purpose of text clustering is presented. Taken computational experiments showed efficiency of suggested approach with using of Hierarchical clustering by areas algorithm for automatic construction of electronic catalogue.

Thank you! Questions? fedorvb@gmail.com – FedorVladimirovichBorisyuk shvetsov@unn.ru - Vladimir IvanovichShvetsov

Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue

Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue

Presentation Transcript

Performance guarantees for hierarchical clustering

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Hierarchical View of Software Construction

Hierarchical Clustering

Electronic Adaptation of:

Bayesian Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Interactive Exploration of Hierarchical Clustering Results HCE (Hierarchical Clustering Explorer)

Electronic Catalogue

Hierarchical Clustering

TOWARDS HIERARCHICAL CLUSTERING

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Bayesian Hierarchical Clustering

Hierarchical Clustering