1 / 8

Deep Web Crawling

Mathy Vanhoef. Deep Web Crawling. Co-presentation. Values for generic text boxes. 2. 1. Initial seed keywords are extracted from the form page. A query template with only the generic text box is submitted. 4. 3. Discard keywords not representative for the page ( TF-IDF rank ).

Download Presentation

Deep Web Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mathy Vanhoef Deep Web Crawling Co-presentation

  2. Values for generic text boxes 2 1 Initial seed keywords are extracted from the form page A query template with only the generic text box is submitted 4 3 Discard keywords not representative for the page (TF-IDF rank) Additional keywords are extracted from the resulting page Runs until a sufficient number of keywordshas been extracted

  3. Initial Seed • What if page has no keywords? • No good keywords  No useful indexing

  4. Current Solution • Pages with long list of static links • Provide example searches: • Has to be manually maintained

  5. Can we improve this?

  6. New Idea! • Scan incoming links for keywords • Example: • Discussion on Fourier transform on forum • Contains link to wolfram alpha • Extract “Fourier” as a keyword

  7. Analysis • Automatic • Extracts commonly used keywords • Must avoid extracting too many keys • Especially for heavily linked sites

  8. Questions?

More Related