90 likes | 179 Views
Mathy Vanhoef. Deep Web Crawling. Co-presentation. Values for generic text boxes. 2. 1. Initial seed keywords are extracted from the form page. A query template with only the generic text box is submitted. 4. 3. Discard keywords not representative for the page ( TF-IDF rank ).
E N D
Mathy Vanhoef Deep Web Crawling Co-presentation
Values for generic text boxes 2 1 Initial seed keywords are extracted from the form page A query template with only the generic text box is submitted 4 3 Discard keywords not representative for the page (TF-IDF rank) Additional keywords are extracted from the resulting page Runs until a sufficient number of keywordshas been extracted
Initial Seed • What if page has no keywords? • No good keywords No useful indexing
Current Solution • Pages with long list of static links • Provide example searches: • Has to be manually maintained
New Idea! • Scan incoming links for keywords • Example: • Discussion on Fourier transform on forum • Contains link to wolfram alpha • Extract “Fourier” as a keyword
Analysis • Automatic • Extracts commonly used keywords • Must avoid extracting too many keys • Especially for heavily linked sites