170 likes | 322 Views
CS347. Project Section 2 April 18, 2001 Brent D. Miller. Today’s agenda. More sample project ideas Using VDK packaged utilities Step through a VDK API call Project/VDK Q&A. Sample project 1 (from section 1). email indexer and categorization queries fed into app defining categories
E N D
CS347 Project Section 2 April 18, 2001 Brent D. Miller
Today’s agenda • More sample project ideas • Using VDK packaged utilities • Step through a VDK API call • Project/VDK Q&A
Sample project 1 (from section 1) • email indexer and categorization • queries fed into app defining categories • either archive and index emails as they arrive, or keep current state of mbox • viewing of emails in tree format by category, and search over email • live collection
Sample project 2 (from section 1) • browsing companion • indexing/search of visited sites integrated into browser • sort html content by language?
More sample project concepts • identifying near-duplicate docs (throwing out ephemeral text) • spam detection • query log spamming • falsified hub/authenticate web structures • query log mining • adjusting doc relevance scores • identifying common problem queries (no reusults, too many clicks after search)
Indexing documents with mkvdk • http://www.stanford.edu/class/cs347/project/vdkdocs/coll/02_cbg2.htm#915094 %mkvdk -collection foo -insert docpath [docpath2 …] %mkvdk -collection foo -insert @listfile • listfile contains names of docs, 1 per line %mkvdk -collection foo -bulk -insert bulkfile
bulk submits • http://www.stanford.edu/class/cs347/project/vdkdocs/coll/03_cbg1.htm#314015 • entire submission is inserted into a single partition in one transaction • must use bulk insert file
bulk file format VdkVgwKey: /path/docname (or use url) Field1: value1 Field2: value2 ... <<EOD>> ...
Searching collections with rcvdk • basic usage: rcvdk collpath • basic commands: searchQUERY • executes search over entire collection results • displays results of previous search viewDOCNUM • displays specified document
Advanced rcvdk commands • results list can be customized to display desired fields • Use “x” command (expert mode); then: rcvdk> fieldsfield1numcharsfield2numchars … • subsequent results lists will show numchars chars of the specified fields for each doc
Advanced packaged VDK tools • didump - displays word table in a partition % didumpcollpath/parts/00000001.did • browse - peruse entries in a vdb • use to examine docs’ field values stored in a partition • can also examine .dids % browsecollpath/parts/00000003.ddd
Using spiders • no spider included with VDK • extra credit for group which gets spider code from public domain & makes it build VDK bulk files for class to use • must have throttling of spidering rate • should adhere to robot files • should be able to limit scope of spidering
A typical VDK call • declare handles and arguments VdkCollection collection = NULL; VdkCollectionOpenArgRec collopen; • Fill the “arg rec” VdkStructInit(&collopen); collopen.path = collectionPathname; collopen.serviceLevel = VdkServiceType_Search; • Make the call VdkCollectionOpen(session, &collection, &collopen)
Q&A • see the video for in-class Q&A session
Section next week? • NO section next week • office hours or email if you need help
Your assignments • Send me list of your group members if you haven’t done so • Send me project ideas and try to narrow project to one proposal • make sure to have completed the VDK installation by monday • Office hours 4/23 5:30pm in Gates B26B
The End • section leader: Brent Miller • students: you