Refined and Incremental Centroid-based Approach for Genre Categorization of Web Pages

Refined and Incremental Centroid-based Approach for Genre Categorization of Web Pages Chaker JEBARI King Saud University College of Computer & Information Sciences Computer Science Department jebarichaker@yahoo.fr WWW’2008 Conference NLPIX’2008 Workshop April 22, Beijing, China

Overview • Introduction • Related works • Centroid-based categorization • My approach • Experiments • Comparison

Introduction • Web page categorization become more and more useful to enhance search engine results • As the number of web pages increase every day, topic categorization become insufficient • Genre is another criteria used to classify web pages (Jebari and Ounelli, 2007)

Introduction • The genre of web pages (cybergenre) is characterized by the triple <content, form, functionality> (Shephered and Watters, 1999). • Web genres changes over time • I proposed a Refined and Incremental approach for genre categorization of web page

Related works

Centroid-based categorization • Finds a description (centroid or prototype) that summarizes all documents belonging (or not) to a given category. • The time and memory required by centroid-based models are proportional to the number of categories instead of the number of training documents like other machine learning techniques (Naïve Bayes, K nearest neighbors, decision trees, etc).

Centroid-based categorization • Centroid-based models can add more training documents and easily recalculate centroids. • Many models have been proposed to calculate centroids (Rocchio, average, sum, normalized sum models, etc). • Normalized sum is a most powerful model

Centroid-based categorization The centroid Cj for a category cj is defined as follow: A document dj is assigned to the category having most similarity calculated as follow:

Training web pages Pre-processing centroids Construction of centroids New web page Hypertext structure URL Logical structure Combination categorization My approach

Construction of centroids • c = {c1, … , ck}: set of k predefined categories • C = {C1, …, Cj, …, Ck}: set of genre centroids • using the normalized sum formula I discarded web pages that have a similarity with a centroid less than a predefined threshold s0 (noisy web pages). • For each category cj, I calculate a new set of training web pages sj as follow:

Construction of centroids Where pi is a web page and sim is the cosine similarity The centroids Sj obtained after refining, using the normalized sum formula, is defined as follow:

Pre-Processing • Feature extraction: URL, Logical structure (the content of title and Hn tags) and Hypertext structure (the content of anchors) • Remove special characters and stop words • Stemming remaining words • Weighting terms using NormTFIDF (=0.5, =-1 and =-0.5) (Lertnattee and Theeramunkong, 2004)

Categorization of a new page • Categorization of new web pages is performed one by one (incremental categorization). • For each new web page p, I calculate its cosine similarity with all centroids. • I refine the centroids, which have a similarity with the page p, greater or equal than S0.

Categorization of new pages • The refining step consists in: • Adding the new page p to the normalized centroid of the corresponding genre and renormalizes the centroid. • Each normalized centroid Sj is associated with the non-normalized centroid NSj. • Refinement of the centroid Sj can be performed by the following operations: And

Combination • The aim is to combine the outputs of three homogenous classifiers, which uses respectively the URL, the logical structure and the hypertext structure (Jebari, 2007). • I used the decision templates for combination (Kuncheva et al., 2001)

Genre Genre # Of web pages # Of web pages Article Student 1541 127 Faculty Download 1063 151 Staff Link collection 126 205 Private portrayal Department 170 126 Project Non private portrayal 474 163 Course 875 Discussion 127 Help 139 Shop 167 Experiments Corpora: KI-04 WebKB

Experiments • Aims: • Measure the effect of vocabulary size in genre categorization of web pages • Measure the usefulness of refining, incrementing and • Combination in genre categorization of web pages • Comparison with other works and machine learning techniques • Experimental setup: • I have used the Micro-averaged accuracy as a performance measure • I used 5*2 cross-validation methodology

Results Effect of vocabulary size: Micro-averaged accuracy for each feature and for both (a) KI-04 and (b) WebKB corpora is obtained by varying the number of terms between 5 and 3000 (b) (a)

Results Usefulness of refining: Micro-averaged accuracy for each feature and for both (a) KI-04 and (b) WebKB corpora is obtained by varying the refining threshold between 0 and 1 by step of 0.1 (a) (b)

Results Usefulness of incrementing: I varied the proportion of testing web pages on each feature between 10% and 90% by step of 10%. For both KI-04 (a) and WebKB (b) corpora I have obtained the following micro-averaged accuracy (a) (b)

Results Usefulness of combination: Micro-averaged accuracy for each classifier (URL, logical, hypertext and combined classifiers) and for both KI-04 (a) and WebKB (b) corpora

Comparison • Problems: • No publicly available and standard benchmark corpora for genre categorization task • Not agreed sense of web page genres and each study focuses on a different set of genres • Comparison with other works: • Only Kanaris and Stamatatos (Kanaris and Stamatatos, 2007)provide good micro-averaged accuracy using KI-04 corpus because they are based on structural information as in my approach. • (Jebari and Ounalli, 2004)

Author KI-04 WebKB [14] 0.70 - [1] 0.75 0.80 [9] 0.84 - [17] 0.70 - My approach 0.96 0.98 Comparison

Comparison • Comparison with other machine learning techniques: • I have compared my approach with other categorization techniques implemented in the program Rainbow (http://www.cs.cmu.edu/~mccallum/bow/rainbow/) • I have used Rocchio, Naïve bayes (NB), K Nearest • Neighbors (KNN) with K=30, SVM with Fisher kernel and TreeNode because they are widely used in genre categorization of documents.

KI-04 URL Logical Hypertext SVM << ~ << ROCCHIO << < << NB << << <<< KNN << << <<< TreeNode <<< <<< <<< WebKB SVM ~ << ~ ROCCHIO << < << NB << << <<< KNN <<< << <<< TreeNode <<< <<< <<< Comparison To show that obtained results are really meaningful and not due to chance, I used the 5*2 cross-validation t-test (Dietterich, 1998)

Comparison • Time is a very important aspect for comparison. • Following figures shows a comparison of the time that each classification technique needs to execute, in both training and classification phases for each corpus and for each feature.

Comparison Train and test time spend for URL and both KI-04 (a) and WebKB (b) (a) (b)

Comparison Train and test time spend for logical structure and both KI-04 (a) and WebKB (b) (a) (b)

Comparison Train and test time spend for hypertext structure and both KI-04 (a) and WebKB (b) (a) (b)

Conclusion • The approach proposed in this paper uses three new features (the URL address, logical and hypertext structures). • My approach implements three new aspects (refinement, incrementing and combination) which not explored in previous studies on genre categorization. • Conducted experiments show the usefulness of each aspect in genre categorization. • The comparison with other approaches show that my • approach is the fastest and outperforms many known • categorization techniques.

References Jebari, C., and Ounalli, H. The usefulness of Logical Structure in Flexible Document Categorization. International Journal of Information Technology, 2004, vol. 1, no. 3, pp. 117-121 Jebari, C. Combining Classifiers for web page genre categorization. In "Towards Genre-Enabled Search Engines: The Impact of NLP" International Workshop held in conjunction with International Conference in Recent Advances on Natural Language Processing RANLP07, Borovets, Bulgaria. 2007. Jebari, C., and Ounelli, H. Genre Categorization of web pages, IEEE Computer Society, 2007. ACM Press. Shepherd, M., and Watters, C. The functionality attribute of cybergenres. In Proceedings of the 32nd Hawaiian International Conference on System Sciences, January 1999, Hawaii.

References Meyer zuEissen, S., and Stein, B. Genre Classification of Web Pages: User Study and Feasibility Analysis. In Biundo S., Fruhwirth T. and Palm G. (eds.). KI2004: Advances in Artificial Intelligence, Springer. Berlin-Heidelberg-New York, pp. 256-269, 2004. Kennedy, A., and Shephered, M. Automatic Identification of Home Pages. In Proceeding of the 38th Hawaii International Conference on System Sciences, 2005. Boese, E. S., and Howe, A. E. Effect of web document evolution on genre classification. Proceedings of the 14th ACM International conference on Information and knowledge management, pp. 632-639. 2005. Santini, M. Automatic identification of genre in web pages. Ph.D Thesis, University of Brighton, UK, 2007.

References Kanaris, I., and Stamatatos, E. Webpage Genre Identification Using Variable-length Character n-grams. Proceeding of the 19th IEEE Int. Conf. on Tools with Artificial Intelligence. 2007. Lertnattee, V., and Theeramunkong, T. Effect of term distributions on centroid-based text categorization. Journal of Information Sciences, 2004, vol. 158, no. 1, p. 89-115. Kuncheva, L.I., Bezdek, J.C., and Duin, R.P.W. Decision templates for multiple classifier fusion. Pattern Recognition, 34 (2), 2001, 299-314. Dietterich, T .G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7): 1895-1923. 1998.

Thank you for your attention

Refined and Incremental Centroid-based Approach for Genre Categorization of Web Pages

Refined and Incremental Centroid-based Approach for Genre Categorization of Web Pages

Presentation Transcript

Named Entity Disambiguation: A Hybrid Statistical and Rule-based Incremental Approach

Practical Issues for Automated Categorization of Web Sites

Genre-Based Approach and the Competence-Based Curriculum

IR for Web Pages

A generalized cluster centroid based classiﬁer for text categorization

incremental approach to infrastructure

Web Pages for Coaches

Centroid

Overview of process for Web pages

Web Pages

Density link-based methods for clustering web pages

Genre and Task for Web Page Filtering

Web Pages

Web pages

Web Pages

OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization

WEB PAGES:

Centroid and Centre of Gravity

Genre-Based Approach and the Competence-Based Curriculum

WEB PAGES:

Genre and Task for Web Page Filtering