1 / 26

Methodology to Find Web Site Keywords

A methodology combining web usage and content mining techniques to determine the most important keywords for a web site. Keywords are obtained through clustering of visitor sessions and analyzing the pages belonging to each cluster. The proposed method aims to attract and retain visitors by understanding their browsing behavior and interests.

marysquires
Download Presentation

Methodology to Find Web Site Keywords

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 國立雲林科技大學National Yunlin University of Science and Technology • A Methodology to Find Web Site Keywords • Advisor:Dr. Hsu • Graduate:Chien-Shing Chen • Author:Juan D.Velasquez • Richard Weber • Hiroshi Yasuda Proceedings of the 2004 IEEE International Conference on e-Technology,e-Commerce and e-Service, IEEE 2004

  2. Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Related Work • WUM,WCM,TFIDF,cosine,SOM • Experimental • Conclusions • Personal Opinion • Review

  3. Motivation • N.Y.U.S.T. • I.M. • What in many cases makes the difference between success and failure of e-business is the potential of the respective web site to attract and retain visitors.

  4. Objective • N.Y.U.S.T. • I.M. • We propose a method to determine the set of the most important words in a web site from the visitor’s point of view. • This is done combining usage information with web page content arriving at a set of keywords determined implicitly by the site’s visitors

  5. Introduction • N.Y.U.S.T. • I.M. • We use web page content, especially free text together with pattern from web usage as input for clustering of visitor sessions. • Web usage mining • Web content mining • Analyzing the pages that belong to each one of the clusters found, we can determine the most important words for each cluster and consequently for each type of visitor. • Cluster algorithm, in order to find groups of similar visitor sessions.

  6. Related Work • N.Y.U.S.T. • I.M. • These are categorized in three sub areas: • Web Structure Mining (WSM) • Web Content Mining (WCM) • Web Usage mining (WUM) • In this paper, we propose a combination of WCM and WUM techniques.

  7. Web Context Mining • N.Y.U.S.T. • I.M. • The goal is to find useful information from web contexts. • TFIDF: R:words Q:documents

  8. Web Usage Mining • N.Y.U.S.T. • I.M. • Web Usage mining: The goal is pattern discovery using different kinds of data mining techniques, such as statistical, association, clustering, classification…

  9. Combining WUM and WCM • N.Y.U.S.T. • I.M. • Applying WUM we can understand the visitor browsing behavior, but we cannot discover which content is interesting for the visitor. • A similarity measure has been suggested that allows to compare the behavior of different visitors, through the analysis of visitor preferences.

  10. Web page processing • N.Y.U.S.T. • I.M. • HTML Tags • Stop words • Word stemming

  11. TFIDF • N.Y.U.S.T. • I.M. • Let R be the number of different words in a web site and Q be the number of its pages. • Based on traditional method, we propose a variation incorporating the influence of special words, i.e., words that have different levels of importance for a visitor. ex : italic font, a referrer word, words associated to page title…

  12. Definition 1 • N.Y.U.S.T. • I.M.

  13. Definition 2 • N.Y.U.S.T. • I.M. • From the visitor behavior vector we want to select the most important pages, assuming the important being correlated to the relative time spent on each page.

  14. Definition 3 • N.Y.U.S.T. • I.M.

  15. N.Y.U.S.T. • I.M. • The first element is indicating the visitor’s interest in the pages visited. • The second element is the distance between pages.

  16. Clustering visitor sessions • N.Y.U.S.T. • I.M. • We use a clustering algorithm in order to find groups of similar visitor sessions. Base on this information we determine the most important words for each cluster.

  17. Identifying web site keywords • N.Y.U.S.T. • I.M. • We propose the following method to determine the most important keywords and their importance in each cluster. • A measure (geometric mean) used in order to calculate the importance of each word relative to each cluster.

  18. Experimental • N.Y.U.S.T. • I.M.

  19. Experimental • N.Y.U.S.T. • I.M.

  20. Experimental • N.Y.U.S.T. • I.M.

  21. Experimental • N.Y.U.S.T. • I.M.

  22. Experimental • N.Y.U.S.T. • I.M.

  23. Experimental • N.Y.U.S.T. • I.M.

  24. Concluding • N.Y.U.S.T. • I.M. • We proposed a way to find the most important pages for the visitor, assuming that the time spent in each page is proportional to the visitor interest. • Finding out the most important pages visited, and the time spent in each one of them, ordered by time. • The similarity introduced, can be very useful to increase the knowledge about the visitor preferences in the web,to identify keywords that attract and retain visitors.

  25. Personal Opinion • N.Y.U.S.T. • I.M.

  26. Review • N.Y.U.S.T. • I.M. • WUM • WCM • TFIDF • Important pages Vector • Clustering • Identify the important keywords.

More Related