1 / 14

Final project: Web Page Classification

Final project: Web Page Classification. By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati. Content. Problem formulation Algorithms Implementation Results Discussion and future work. Problem.

abeni
Download Presentation

Final project: Web Page Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Final project:Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati 1/16

  2. Content • Problem formulation • Algorithms • Implementation • Results • Discussion and future work

  3. Problem • World Wide Web can be clustered into different subsets and labeled accordingly, search engine users can then restrict their keyword search to these specific subsets. • Clustering of web pages can also be used to post-process searching results. • Efficient clustering of web pages is important • Clustering accuracy: feature selection, and web exploitation • Fast algorithm

  4. Web clustering • Clustering is done based on similarity between web pages • Clustering can be done in supervised and unsupervised mode • In our project, we try to focus on unsupervised classification (no sample category labels provided), and compare the efficiency of algorithms and features for clustering web pages.

  5. Project overview • In this project, a platform of unsupervised clustering is implemented: • Vector Model is used • TFIDF model (term frequency-inverted document frequency) • Text, meta information, links and linked content can be configured as features • Similarity measure: • Cosine similarity • Euclidean similarity • Clustering algorithm • K-means • HAC (Hierarchical Agglomerative Clustering) • For a given link list, clustering accuracy and algorithm efficiency is compared. • It is implemented in Java, and can be extended easily.

  6. User interface

  7. Major functionalities • Web page preprocessing • downloading • Parsing: link, meta, text extraction • Filtering of non-sense words: Stop word removal and stemming • Put terms into a pool • clustering

  8. Feature selection • First, a naïve approach from ranking of query results is used: • All the unique terms (after text extraction and filtering) forms the feature terms. That is, if there are totally 1000 terms, the vector dimension will be 1000. • This approach works for small sets of links. • Then we use all the unique terms appearing as meta information in web pages as feature terms. • The dimension can be reduced dramatically. • For 30 links, dimension is 2384 for naïve method, but is reduced to 408 when using meta. • Hyperlink exploitation • Links in web page can also be features • The content or meta information of linked web pages can be seen as local content.

  9. TFIDF based vector space model • TFIDF(i,j)= TF(i,j)*IDF(i) • TF(i,j): the number of times word i occurs in document • DF(i) the numberdocuments in which word i occurs at least once • IDF(i) can be calculated from the document frequency:

  10. Similarity measure • Euclidean similarity :Given the vector space defined by all terms compute the Euclidean distance between each document, and then the reciprocal is taken. • Cosine similarity=numerator / denominator • Numerator: inner product of two vector • Denominator: Euclidean length of the document

  11. Cluster algorithms: Hierarchical Agglomerative Clustering (HAC) • It starts with all the documents and successively combines them into groups within which inter-document similarity is high

  12. Cluster algorithms: K means • K means clustering: nonhierarchical method • Final required number of clusters is chosen • Examines each component in the population and assigns it to one of the clusters depending on the minimum distance • Centroid's position is recalculated everytime a component is added to the cluster and this continues until all the components are grouped into the final required number of clusters

  13. Complexity Analysis • HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). • In K-means, for each round, n documents have to be compared against k centroids, which will take timemore efficient than O(n2) HAC. • While in our experiment, we found that clustering result of HAC make more sense than K-means

  14. Conclusion • Unique features of web page should be exploited • Link, meta information • HAC is better than K-means in clustering accuracy. • Correct and robust parsing of web pages is important for web page clustering • Our parser doesn’t work well on all web pages tested. • The overall performance of our implementation is not satisfactory • Dimension is still large • Space requirement • Parsing accuracy, and some page doesn’t have meta information

More Related