100 likes | 103 Views
Artface (Automated reorganization to fit approximate client expectations). Mike Venzke. Artface Goals. Provide a method for determining the approximate expectation of a web client Examine feasibility of using this information in an automated manner. Description.
E N D
Artface(Automated reorganization to fit approximate client expectations) Mike Venzke
Artface Goals • Provide a method for determining the approximate expectation of a web client • Examine feasibility of using this information in an automated manner
Description • Using Open Directory categories, create a model for classifying web pages. • Fetch, parse, and classify the referring page of local web hits. • As a result, have the approximate expectations people have when they go to different parts of your website.
Classification Categories • Used DMOZ categories • Already classified web pages; provides good training data. • Went 3 levels deep in directory • Wanted to get approximate expectation, not so specific that very similar items are considered different. • Time and constraints
Page Fetching • Used Python SGMLParser module • Good at parsing out irrelevant data • Fast enough • Easy to use
Classification • Rainbow – LGPL’d Naïve Bayesian text classifier • Used ~ 9000 documents as training data, with expanded category as classification. • ~7000 test pages taken from web logs of www.cs.rpi.edu and www.linenplace.com
Data Results • Fairly accurate results • http://webgraph.canbelearned.com
Automation Possibilities • Determine ‘good’ categories by self-site classification or user input • Track traffic from ‘good’ categories and provide higher-level links to local pages. • Set of bad categories is small and generally universal. • Take action against local sites based on how they’re being used, not what they have.
Automation Possibilities (contd) • Provide custom pages based on what user expected, rather than what page contains. • May not have found what they wanted. • May be interested in a more broad topic.
Process Enhancement Ideas • More training data • Use all levels of DMOZ data, but push classification up to threshold level. • Handle more page errors • Scripting, authentication errors provide false data. • Remove or special-parse ‘classless’ information pages • Search engines