1 / 27

A Combined Approach for Classification of Web Results based on Ranking & Clustering

A Combined Approach for Classification of Web Results based on Ranking & Clustering. Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak. Contents. Introduction Page rank, weighted page rank Document clustering Algorithm for clustering and ranking Conclusion Future scope.

Download Presentation

A Combined Approach for Classification of Web Results based on Ranking & Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Combined Approach for Classification of Web Results based on Ranking & Clustering Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak

  2. Contents • Introduction • Page rank, weighted page rank • Document clustering • Algorithm for clustering and ranking • Conclusion • Future scope

  3. Introduction • The web is most precious place for Information retrieval and Knowledge Discovery • Retrieving information through queries from a search engine is tedious • Solution is Web Mining • web content mining • web structure mining • web uses mining

  4. How to Generate Web Results?

  5. Page Rank (PR) • Order the search results such that important documents move up and less important move down in the list • If a page has some important incoming links then its outgoing link also becomes important

  6. Page Rank (PR) • Rank score of a page p is evenly divided among outgoing links • Modified PR in view Random Surfer Model – not all the users follow direct Links on WWW

  7. Example of Page Rank (PR) PR(A)= (1-d)+d((PR(B)/2+PR(C)/2 ) PR(B)= (1-d)+d( PR(A)/1+PR(C)/2 ) PR(C)= (1-d)+d( PR(B)/2) IF d=0.5 PR(A)=1.2, PR(B)=1.2, PR(C)=0.8 Table : Iterative method of page rank

  8. Weighted Page Rank (WPR) • Assign larger rank values to more important pages instead of evenly dividing among its outgoing links. • Outlink page gets value according to its popularity

  9. Document Clustering •  Automatic document organization, topic extraction and fast information retrieval or filtering • Documents are grouped together based upon measure of similarity of content or of hyperlinked structure • Clustering divides the results of a search for "cell" into groups like "biology," "battery," and "prison."

  10. Document Clustering • Examples : K-means, hierarchical • Clustering may be based on content alone, or both on contents and links or only on links • Two ways to define content based similarity between the documents • Resemblance • Containment

  11. Limitations of Ranking Approach • They give emphasis to links of the resultant pages • No algorithm exists to combine the link score and content score of the page into a single score • Existing approaches return millions of documents in an ordered format • Rank based approaches give equal emphasis to inlinks as well as outlinks of pages

  12. Introduction of Combined Approach • This mechanism takes advantage of importance of inlinks over outlinks • With the use of this user can put search results into hierarchy of query related clusters • Also the documents in each cluster can be ranked to represent them according to their relevancy • Such organization enables the user to effectively limit his search area

  13. Algorithm for Clustering and Ranking

  14. Algorithm: Steps • Step 1: Get the URLs of the pages • Step 2: Provide a similarity value sim(q, p) to each returned document • Step 3: Use sim(q, p) to cluster the documents • Step 4: Provide a rank score WSR(p) to the documents of each cluster:

  15. Output • Clusters of web pages documents are formed based on the similarity • Also the documents in each cluster are ranked to represent them according to their relevancy

  16. Similarity Calculation between Web Pages • Similarity of the document with the query means: • what query terms are present in the document? • where they are present? • how many times? • Calculated using cosine between vector of query terms and vector of documents

  17. Rank Calculation of Web pages • WSR- Weight and Similarity based Rank • Back-links contribute more towards the importance of a page rather than forward links • WSR gives more importance to the inlinks of a page • Importance of the backlink page v of a page u, given by

  18. Rank Calculation of Web pages • Redefined formula for rank is given by

  19. Clustering of Web pages • The clustering is purely based on the similarity values of the pages with respect to the user query • The number of clusters is not predefined • The maximum number of pages that can be in a cluster should be decided

  20. Clustering of Web pages • Lower and upper value of similarity is identified from range of similarity values • Complete page set is divided into number of sets according to the similarity values lying within the partitioned ranges. • Rank(p)= WSR(p) + sim(q, p)

  21. Applications of Clustering and Ranking • Readability assessment - automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger text simplification system • Genre classification - automatically determining the genre of a text

  22. Tools Available • Weka • Rapid Miner • KNIME • Orange

  23. Conclusion • Ranking and clustering gives a way to organize the search results in the form of clusters, the pages in each cluster are further ranked to provide the most relevant and important pages on the top of the cluster • User search space decreases and he can get required content in short time

  24. Future Ideas • Different data mining tools (KNIME, Rapid Miner, Weka) can be used to analyze the result for classification • Search query results for classification can be incorporated from multiple search engines

  25. References • Parneet Kaur, Sawtantar Singh Khurmi and Gurpreet Singh Josan, " Analysis for Classification of Similar Documents among Various Websites using Rapid Miner“. In the proceedings of IEEE International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT), 2014 • Neelam Duhan and A.K. Shanna, "A Novel Approach for Organizing Web Search Results using Ranking and Clustering". In the Proceedings of International Journal of Computer Applications, vol. 5, No. 10, pp. 1-9, August 2010. • O. Zamir, O. Etzioni. “Web document clustering: A feasibility demonstration”. Proceedings of the 19th International ACM SIGIR Conference on Research and Development of Information Retrieval (SIGIR'98), 46-54,1998.

  26. References • Miguel Gomes da Costa Júnior, Zhiguo Gong, “Web Structure Mining: An introduction”. Proceedings of the IEEE International Conference on Information Acquisition, 2005, China. • Taher H. Haveliwala, Aristides Gionis, Dan Klein, Piotr Indyk, “Evaluating strategies for similarity search on the Web”. WWW2002, May, 2002, Honolulu, Hawaii, USA.ACM 1-58113-449-5/02/0005.

  27. Thank You!!

More Related