100 likes | 191 Views
Using PageRank and Naïve Bayes Models for extracting data on good web page design. Archit Baweja, Daniel Moyer, Doug Traher. Introduction. Web no longer a set of inter connected text-only web pages. Presentation of content equally important. Web 2.0 Good page design is subjective
E N D
Using PageRank and Naïve Bayes Models for extracting data on good web page design Archit Baweja, Daniel Moyer, Doug Traher
Introduction • Web no longer a set of inter connected text-only web pages. • Presentation of content equally important. • Web 2.0 • Good page design is subjective • Re-phrase • Can PageRank be used for classification of web pages for a given category • Present solution requires a domain expert • What if we could extract domain knowledge using stochastic techniques • The world wide web is our pool domain knownledge
Related Works • Studies on inferences from PageRank or similar ranking algorithms • People’s attention in the blogosphere[4] • LinkedIn connections to find powerful people[5] • CodeRank for software metrics[6] • Visual impact of political websites on user trust • Reed et al analyze the visual impact of political websites[3] • Harrison et al on the impact of initial consumer trust on intentions to interact with a website[7]
Background • Google’s PageRank [1] • Naïve Bayes Model [2]
Approach • Use Google API to sort web pages for a given subject • Extract web page features • Train a Naïve Bayes Classifier • Classifier helps answer various questions • Does a given web page belong to a given class • What features should my web page have to belong to a class of web pages.
Evaluation • Experiment details • We used our approach in classifying political websites • Colors as the basis of classification, set • Reasons • Fits well with the naïve bayes model of classification. • Colors are an easy parameter for political websites in the United States of America. • Results Discussion • Promising. • See final report.
Conclusions • Approach is viable • Basis for extracting domain knowledge when there is lack of domain experts • Limitations of technologies used • Rank source is inferred by Google • Assumption of the Naïve Bayes Model
Future Work • Need more experiments • Newer features of web pages; images, flash, widgets • Experiment with other classification techniques • Use a more controlled PageRank implementation • Use other classification techniques (bayes networks, neural networks, markov decision process) • Apply to other fields • Basis for extracting domain knowledge when there is lack of domain experts
References • R.M. T.W. Lawrence Page, Sergey Brin. The PageRank citation ranking: Bringing order to the web, 1999. • S.J. Russell and P. Norvig. Artifical Intelligence: A Modern Approach (2nd Edition). Prentice Hall, December 2002. • K.N. Reed and D.P. Groth. Looking good on the web: evaluating the visual impact of political websites. In CHI’08: CHI’08 extended abstracts on Human factors in computing systems, pages 3753-3758, New York, NY, USA, 2008. ACM • L.Kirchhoff, A. Bruns, and T.Nicolai. Investigating the impact of the blogosphere: Using PageRank to determine the distribution of attention. Association of Internet Researchers, 2007. • F. van Puffelens. Using PageRank to determine the most powerful people on LinkedIn. http://frank.vanpuffelen.net/2008/07/using-pagerank-to-determine-most.html • B. Neate, W. Irwin, and N. Churcher. CodeRank: A new family of software metrics. In Software Engineering Conference, 2006, Australian, Apr 2006. • Harrison McKnight D., Choudhury V., and Kacmar C., The Impact of Initial Consumer Trust on Intentions to Interact with a Website: A Trust Building Model. Journal of Strategic Information Systems, 2002.