Automatic Classification of Bookmarked Web Pages

Automatic Classification of Bookmarked Web Pages Chris Staff Third Talk February 2007 1

Tasks • Representation of bookmark categories • Two clustering/similarity algorithms • Extra utility • User interface • Evaluation • Write up report 2

Overview • User Interface • To replace the built in ‘Bookmark this Page’ menu item and keyboard command • To display a new dialog box to users to offer choice of recommended category, last category used, and to allow user to select some other category or create a new category 3

Overview • Extra Utility: How can the classification of web pages to be bookmarked be improved? • What particular interests do you have, and how can they be used to improve classification? • E.g., synonym detection, automatic reorganisation of bookmarks, improved interface, … 4

Overview • Evaluation • Will be standard and automated • For testing purposes, download test_eval.zip from home page • Contains 2x8 bookmark files (.html) and one URL file (.txt) • Bookmark files are ‘real’ files collected one year ago • URL file contains a number of lines with following format: • Bk file ID, URL of bookmarked page, home category, exact entry from bookmark file (with date created, etc.) 5

Overview • Evaluation (contd.) • Challenge to also ‘re-create’ bookmark file in the order that it was created by users • Eventually, close to the end of the APT, the evaluation test data set will be made available • About 20 unseen bookmark files and one URL file • Same format as before • You’ll get bookmark files early to prepare representations, but classification run (URL file) will be part of a demo session 6

User Interface • Graphical, as part of the Web Browser • Command-line based, or equivalent, for evaluation purposes 7

User Interface • Graphical • Can be built using Bugeja’s HyperBK as framework • User needs to be able to select clustering algorithm • When system is idle, recalculate different centroids for categories • User needs to be able to switch on/off extra utility • Whenever user bookmarks a page, system kicks in, performs functions, and presents dialog box to user 8

User Interface • Graphical • From dialog box, user should be able to • Use the recommended category • Use a different category • Create a new category • Store bookmark in last category *used* • Store bookmark in last category used to store a bookmark • It needs to be user friendly! 9

User Interface • Command line: • To enable the evaluation to take place without user intervention • Essentially, call program with location of bookmark files directory (which contains bookmark files and the URL file), clustering algorithm to use, extra utility on/off, where to store results, switch logging on/off. • If not practical, then embed call inside web browser 10

Extra Utility • What inefficiencies or problems are the with the current methods? • How can they be improved? • E.g., better term selection (e.g., synonym detection); anything you like, but I need to approve it • Or how can overall system be improved? • E.g., automatic re-organisation of bookmark file to classify unclassified bookmarks, improved interface • Need to work independently, but no need for utility to be unique 11

Evaluation • Two types of evaluation • One to determine if randomly selected bookmarks can be placed into the “correct” category • Another to attempt to re-build the bookmark file in the order the user created it (only for bookmarks in categories) • Both types must run in “batch” mode (via a command-line interface or equivalent) 12

Evaluation • You each have test_eval.zip (from APT’s home page) • This is the test set • Format of files was explained earlier • Later, you’ll get an evaluation set to use to prepare category representations • You will run the random evaluation under lab conditions, as part of the demo • You will *not* run the re-build evaluation on these, but you must provide a mechanism for me to run it 13

Evaluation • Categorising random bookmarks • Bookmarks will be selected from user created categories • Classifying bookmark into correct category is a ‘hit’ • Otherwise, it counts as a ‘miss’ • On average, you’re aiming for a minimum of 80% accuracy (using either classification algorithm) during the evaluation run. • Report results as percentage accuracy, and report the statistical significance of your results • Also report average time overhead per URL *once* the page has been downloaded until it is classified. 14

Evaluation • Re-building the bookmark file (1) • Each entry in the bookmark file is time stamped • You can determine the order in which URLs were bookmarked overall • You may ignore uncategorised bookmarks (or you can suggest a category for them, but don’t count them for the evaluation) • If a URL is the first bookmark in a category, then assume user created category and placed URL in it manually • For all others, correct classification is a hit, otherwise it’s a miss 15

Evaluation • Re-building the bookmark file (2) • After each URL classification, bookmark the URL into the correct (user-determined) category, and re-calculate the category centroid • Continue for all bookmarks • What is the overall success (hit) rate? What is the hit rate with just 1 (user located) bookmark in a category, with 3, 5, 10, 15+? (Does the hit rate improve as the number of bookmarks increases?) What’s the statistical significance? • Remember to use both algorithms (HDIFT and your own), with and without the extra utility (if possible) • Report average time overhead 16

Evaluation • Write up the results based on test data (each algorithm, with and without the extra utility (if the utility is designed to improve results) for random and re-build) • You won’t be able to write results for the evaluation (because it is likely to take place after you submit the report!), but your program will report on results 17

Evaluating the Extra Utility • If your extra utility does not directly improve classification, but performs some other function (e.g., categorises unclassified bookmarks when file is imported into Firefox), then explain how you would evaluate it. 18

More Pitfalls • Frames • Pages that are no longer on-line 19

Automatic Classification of Bookmarked Web Pages