1 / 20

A text mining approach on automatic generation of web directories and hierarchies

This paper presents a text mining approach for automatically generating web directories and hierarchies from a corpus of web pages. The method uses self-organizing map learning algorithm to cluster web pages and identify important words for directory labeling. Experimental results show that the proposed method produces comprehensible and reasonable web directories and hierarchies.

sheffieldl
Download Presentation

A text mining approach on automatic generation of web directories and hierarchies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A text mining approach on automatic generation of web directories and hierarchies Advisor :Dr. Hsu Reporter:Chun Kai Chen Author:Hsin-Chang Yang and Chung-Hong Lee 2004. Expert Systems with Applications 645-663

  2. Outline • Motivation • Objective • The text mining process • Automatic generation of web directories • Experimental Results • Summary

  3. Motivation • The classification of web pages into proper directories and the organization of directory hierarchies are generally performed by human experts.

  4. Objective • In this work, we provide a corpus-based method that applies a kind of text mining techniques on a corpus of web pages to automatically create webdirectories and organize them intohierarchies.

  5. S1 S2 S3 The text mining process 網頁 SOM(DCM) 萃取文章資料 SOM(WCM) Si Automatic generation of web directories Si+1 Generation of directory hierarchies two-levelhierarchy Generation of directories web directories

  6. Automatic generation of web directories • Generation of directory hierarchies • The super-cluster generation process algorithm • Generation of directories • identify cluster themes by examining the WCM • selects the word that is the most important toa super-cluster stop criteria DCM WCM

  7. Experimental Results • The experiments show that our method can produce comprehensible and reasonable web directories and hierarchies.

  8. Introduction(1/3) • Information finding is thus a serious problem for the web since most users find it hard to obtain the information using current information retrieval strategies. • Two kinds of strategies are now adopted by the web communities, namely searching and browsing.

  9. Introduction(2/3) • Since the link structures may be considered static during browsing • the selection of starting pages plays the most important role when a user tries to find his goal in minimum time • Therefore, many commercial or academic web sites actively collect web pages and sort them into web directories • to provide users the starting points in the browsing process

  10. Introduction(3/3) • Most existing web directories were created manually by human specialists. • Yahoo! • Such limitation is mainly caused by the gigantic amount of web pages produced and being produced

  11. Related work • category hierarchy • predefined category hierarchy (Yahoo!) • automatically developing category hierarchy • topic identification • mutually related text excerpts • Self-organizing map algorithm

  12. The text mining process(1/2) • The method is based on the self-organizing map learning algorithm and requires no human intervention during the construction of web directories and hierarchies. The text mining process 網頁 萃取文章資料 SOM(WCM) SOM(DCM)

  13. The text mining process(2/2) • labeling process • each document will associate with a neuron in the map. We record such associations and form the DCM. • In the DCM, each neuron is labeled by a list of documents which • are considered similar and • are in the same cluster. • In the same manner, we label each word to some neuron in the map and form the WCM.

  14. Generation of directory hierarchies(1/3) • The two-levelhierarchy generation process • the parent node is the constructed super-cluster • the child nodes are the clusters that compose the super-cluster • can be further applied to every super-cluster to establish the next level of this hierarchy • The overall hierarchy • iteratively using such top–down approach • until a stop criterion is satisfied

  15. Generation of directory hierarchies(2/3) • To form a super-cluster • the distance between two clusters(二維空間座標距離) • the dissimilarity between two clusters(神經元向量相似度) • the supporting cluster similarity • we can determine the significance of a clusterby examining the overall similarity that is contributed by its neighboring clusters. • doc(i) : 神經元 i的文件數量 • Bi : 神經元 i 的鄰近神經元 index • F: is a monotonically increasing function • The dominating clusters • has locally maximal supporting cluster similarity • the centroid of a super-cluster, which contains several child clusters

  16. Generation of directory hierarchies(3/3) • In Step 3 of the super-cluster generation process algorithm we set three stop criteria. • The first criterion stops finding super-clusters • if there is no neuron left for selection. • The second criterion, which limits the number of dominating clusters, to constrain the breadth of hierarchies. • The third criterion constrains the depth of a hierarchy.

  17. S2 S1 S3

  18. Generation of directories • In this work, we try to identify cluster themes, i.e. directory labels, by examining the WCM. • selects the word that is the most important toa super-cluster

  19. Summary • In this paper, we present a method to automatically generate • web directory hierarchies and identify directory labels. • Experiments show that our method could • successfully cluster the documents into directories, • reveal the hierarchical structure among these directories, • and assign a label to each directory. • However, fully automatic process may not provide the best solutions for these tasks that interfere so much with human beings. • Thus, in our opinions, a kind of semi-automatic process which uses the proposed method as a preprocessing stage should be plausible to meet the general requirements.

  20. Personal Opinion • Application • such as text categorization, thesaurus construction, ontology learning, multilingual information retrieval • Advantage • fully automatic process , which can automatically create web director hierarchies without the intervention of human beings • Disadvantage • may not provide the best solutions

More Related