Final project: Web Page Classification

Final project:Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati 1/16

Content • Problem formulation • Algorithms • Implementation • Results • Discussion and future work

Problem • World Wide Web can be clustered into different subsets and labeled accordingly, search engine users can then restrict their keyword search to these specific subsets. • Clustering of web pages can also be used to post-process searching results. • Efficient clustering of web pages is important • Clustering accuracy: feature selection, and web exploitation • Fast algorithm

Web clustering • Clustering is done based on similarity between web pages • Clustering can be done in supervised and unsupervised mode • In our project, we try to focus on unsupervised classification (no sample category labels provided), and compare the efficiency of algorithms and features for clustering web pages.

Project overview • In this project, a platform of unsupervised clustering is implemented: • Vector Model is used • TFIDF model (term frequency-inverted document frequency) • Text, meta information, links and linked content can be configured as features • Similarity measure: • Cosine similarity • Euclidean similarity • Clustering algorithm • K-means • HAC (Hierarchical Agglomerative Clustering) • For a given link list, clustering accuracy and algorithm efficiency is compared. • It is implemented in Java, and can be extended easily.

User interface

Major functionalities • Web page preprocessing • downloading • Parsing: link, meta, text extraction • Filtering of non-sense words: Stop word removal and stemming • Put terms into a pool • clustering

Feature selection • First, a naïve approach from ranking of query results is used: • All the unique terms (after text extraction and filtering) forms the feature terms. That is, if there are totally 1000 terms, the vector dimension will be 1000. • This approach works for small sets of links. • Then we use all the unique terms appearing as meta information in web pages as feature terms. • The dimension can be reduced dramatically. • For 30 links, dimension is 2384 for naïve method, but is reduced to 408 when using meta. • Hyperlink exploitation • Links in web page can also be features • The content or meta information of linked web pages can be seen as local content.

TFIDF based vector space model • TFIDF(i,j)= TF(i,j)*IDF(i) • TF(i,j): the number of times word i occurs in document • DF(i) the numberdocuments in which word i occurs at least once • IDF(i) can be calculated from the document frequency:

Similarity measure • Euclidean similarity :Given the vector space defined by all terms compute the Euclidean distance between each document, and then the reciprocal is taken. • Cosine similarity=numerator / denominator • Numerator: inner product of two vector • Denominator: Euclidean length of the document

Cluster algorithms: Hierarchical Agglomerative Clustering (HAC) • It starts with all the documents and successively combines them into groups within which inter-document similarity is high

Cluster algorithms: K means • K means clustering: nonhierarchical method • Final required number of clusters is chosen • Examines each component in the population and assigns it to one of the clusters depending on the minimum distance • Centroid's position is recalculated everytime a component is added to the cluster and this continues until all the components are grouped into the final required number of clusters

Complexity Analysis • HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). • In K-means, for each round, n documents have to be compared against k centroids, which will take timemore efficient than O(n2) HAC. • While in our experiment, we found that clustering result of HAC make more sense than K-means

Conclusion • Unique features of web page should be exploited • Link, meta information • HAC is better than K-means in clustering accuracy. • Correct and robust parsing of web pages is important for web page clustering • Our parser doesn’t work well on all web pages tested. • The overall performance of our implementation is not satisfactory • Dimension is still large • Space requirement • Parsing accuracy, and some page doesn’t have meta information

Final project: Web Page Classification