1 / 17

CS347

CS347. Project Section 2 April 18, 2001 Brent D. Miller. Today’s agenda. More sample project ideas Using VDK packaged utilities Step through a VDK API call Project/VDK Q&A. Sample project 1 (from section 1). email indexer and categorization queries fed into app defining categories

lona
Download Presentation

CS347

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS347 Project Section 2 April 18, 2001 Brent D. Miller

  2. Today’s agenda • More sample project ideas • Using VDK packaged utilities • Step through a VDK API call • Project/VDK Q&A

  3. Sample project 1 (from section 1) • email indexer and categorization • queries fed into app defining categories • either archive and index emails as they arrive, or keep current state of mbox • viewing of emails in tree format by category, and search over email • live collection

  4. Sample project 2 (from section 1) • browsing companion • indexing/search of visited sites integrated into browser • sort html content by language?

  5. More sample project concepts • identifying near-duplicate docs (throwing out ephemeral text) • spam detection • query log spamming • falsified hub/authenticate web structures • query log mining • adjusting doc relevance scores • identifying common problem queries (no reusults, too many clicks after search)

  6. Indexing documents with mkvdk • http://www.stanford.edu/class/cs347/project/vdkdocs/coll/02_cbg2.htm#915094 %mkvdk -collection foo -insert docpath [docpath2 …] %mkvdk -collection foo -insert @listfile • listfile contains names of docs, 1 per line %mkvdk -collection foo -bulk -insert bulkfile

  7. bulk submits • http://www.stanford.edu/class/cs347/project/vdkdocs/coll/03_cbg1.htm#314015 • entire submission is inserted into a single partition in one transaction • must use bulk insert file

  8. bulk file format VdkVgwKey: /path/docname (or use url) Field1: value1 Field2: value2 ... <<EOD>> ...

  9. Searching collections with rcvdk • basic usage: rcvdk collpath • basic commands: searchQUERY • executes search over entire collection results • displays results of previous search viewDOCNUM • displays specified document

  10. Advanced rcvdk commands • results list can be customized to display desired fields • Use “x” command (expert mode); then: rcvdk> fieldsfield1numcharsfield2numchars … • subsequent results lists will show numchars chars of the specified fields for each doc

  11. Advanced packaged VDK tools • didump - displays word table in a partition % didumpcollpath/parts/00000001.did • browse - peruse entries in a vdb • use to examine docs’ field values stored in a partition • can also examine .dids % browsecollpath/parts/00000003.ddd

  12. Using spiders • no spider included with VDK • extra credit for group which gets spider code from public domain & makes it build VDK bulk files for class to use • must have throttling of spidering rate • should adhere to robot files • should be able to limit scope of spidering

  13. A typical VDK call • declare handles and arguments VdkCollection collection = NULL; VdkCollectionOpenArgRec collopen; • Fill the “arg rec” VdkStructInit(&collopen); collopen.path = collectionPathname; collopen.serviceLevel = VdkServiceType_Search; • Make the call VdkCollectionOpen(session, &collection, &collopen)

  14. Q&A • see the video for in-class Q&A session

  15. Section next week? • NO section next week • office hours or email if you need help

  16. Your assignments • Send me list of your group members if you haven’t done so • Send me project ideas and try to narrow project to one proposal • make sure to have completed the VDK installation by monday • Office hours 4/23 5:30pm in Gates B26B

  17. The End • section leader: Brent Miller • students: you

More Related