1 / 31

Keywords Selection Problem in Hidden Web Crawling

Keywords Selection Problem in Hidden Web Crawling. Ka Cheung Sia, Richard March 15 2004. Agenda. What is Hidden Web? How to crawl the Hidden Web? Problem formalization Searching for “best” keyword Greedy Tree searching Pruning Experiments & results Conclusion. What is Hidden Web?.

Download Presentation

Keywords Selection Problem in Hidden Web Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

  2. Agenda • What is Hidden Web? • How to crawl the Hidden Web? • Problem formalization • Searching for “best” keyword • Greedy • Tree searching • Pruning • Experiments & results • Conclusion

  3. What is Hidden Web? • Hidden • Unreachable by following hyperlinks • Dynamically generated • Accessible only through a search interface • Informative • Examples • http://citeseer.ist.psu.edu/ - CS research paper • http://www.pubmed.org – medical research paper • http://catalog.loc.gov – library of congress

  4. What is Hidden Web? • Search interface • http://citeseer.ist.psu.edu/cis?q=heuristic+search&submit=Search+Documents&cs=1

  5. What is Hidden Web? • Result

  6. What is Hidden Web? • Document

  7. How to crawl the Hidden Web • http://citeseer.ist.psu.edu/cis?q=heuristic+search&submit=Search+Documents&cs=1 Our task Figure out a keyword Query HiddenWeb Result

  8. Problem formalization • Set-cover • Vertex – documents • Hyper-edges – query words

  9. Goal • Maximize the number of unique documents retrieved with minimum number of query words

  10. Problem formalization • P(qi) • portion of unique documents retrieved by issuing query word qi (portion of documents containing “qi”) • P(qi v qj) • portion of unique documents retrieved by issuing query words qi and qj (portion of documents containing qi or qj) • P(qi | qj) • portion of documents containing qi in the set of documents retrieved by issuing query words qj

  11. Problem formalization • What is the next “best” query word? • P((q1 v … v qi-1) v qi)= P(q1 v … v qi-1) + P(qi) – P((q1 v … v qi-1) ^ qi)= P(q1 v … v qi-1) + P(qi) – P(q1 v … v qi-1)P(qi | q1 v … v qi-1) • P(q1 v … v qi-1) – knownP(qi | q1 v … v qi-1) – knownP(qi) – unknown • Approximate P(qi) using P(qi | q1 v … v qi-1)

  12. Search for best query word • Greedy: choose the most frequently occurring word so far to be the query • Choose qi with maximum P(qi | q1 v … v qi-1) • For set-cover problem, greedy is proven to obtain log-optimal solution

  13. Search for best query word • Can we do better? • Intuition • Correlation of keywords • E.g.- linux- debian, redhat, suse, knoppix, fedora, etc… • We might save the query word “linux” !

  14. Search for best query word Wholedocumentcollection Documentsretrieved by qk Documents retrieved by qi Documentsretrieved by qj Already retrieveddocuments

  15. Search for best query word linux debian redhat f(x) = Number of documents we get by issuing queries linux, debain, redhat minus theoverlapping between “redhat, linux” and “debain, linux” and “redhat, debain”

  16. Search for best query word • The search tree is huge (branching factor) • We look ahead for the 10 most frequent keywords • We only search up to depth 6 • Pruning

  17. Search for best query word • DFBnBSub-tree where the sum of documentsretrieved assuming no overlappingbetween keywords are less than thecurrent best solution

  18. Experiment • Document collection : ~100K front pages of randomly selected websites • Query interface : an inverted index (a program that returns documents containing the given query word) • Methods • Greedy • DFS search (look ahead for 10 words, up to depth 6) • DFS search with pruning (DFBnB)

  19. Results provide 51 work 159 privacy 144 years 172 world 344 list 205 info 1467 map 184 want 57 order 87 people 85 read 56 main 2270 high 95 designed 240 latest 36 events 132 looking 46 send 80 right 380 enter 1285 local 77 browser 1216 questions 77 real 77 provide 51 work 159 privacy 144 years 172 read 101 main 2364 designed 291 info 1455 latest 53 looking 60 send 101 right 402 local 99 world 239 list 142 map 150 want 42 order 69 people 67 high 85 events 126 questions 85 enter 1272 browser 1216 real 77 • Does searching helps?

  20. Results • Does searching helps?

  21. Results • How much does pruning saves? • With out pruning – 187300 nodes are examined187300=(10)+(10*9)+(10*9*8)+(10*9*8*7)+(10*9*8*7*6)+(10*9*8*7*6*5) • With pruning – 5558 nodes are examined on average (when we choose the most frequent keyword to expand) • DFBnB saves ~ 30 times

  22. Conclusion • Searching helps little “in this problem” • DFBnB is “really effective” in pruning search tree

  23. End

  24. More results • Priori information helps

  25. Results

  26. Results

  27. Search & Greedy

  28. Search with prune & Greedy

  29. Search for best query word • base = q1 v … v qi • P(base v qi+1 v qi+2)= P(base v qi+1) + P(qi+2) – P((base v qi+1) ^ qi+2) • P((base v qi+1) ^ qi+2)= P((base ^ qi+2) v (qi +1^ qi+2))= P(base ^ qi+2) + P(qi+1 ^ qi+2) – P(base ^ qi+1 ^ qi+2)= P(base ^ qi+2) + P(qi+1 ^ qi+2) – P(base ^ qi+1 ^ qi+2)

  30. 2 words overlapping

  31. 3 words overlapping

More Related