Keywords Selection Problem in Hidden Web Crawling

Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Agenda • What is Hidden Web? • How to crawl the Hidden Web? • Problem formalization • Searching for “best” keyword • Greedy • Tree searching • Pruning • Experiments & results • Conclusion

What is Hidden Web? • Hidden • Unreachable by following hyperlinks • Dynamically generated • Accessible only through a search interface • Informative • Examples • http://citeseer.ist.psu.edu/ - CS research paper • http://www.pubmed.org – medical research paper • http://catalog.loc.gov – library of congress

What is Hidden Web? • Search interface • http://citeseer.ist.psu.edu/cis?q=heuristic+search&submit=Search+Documents&cs=1

What is Hidden Web? • Result

What is Hidden Web? • Document

How to crawl the Hidden Web • http://citeseer.ist.psu.edu/cis?q=heuristic+search&submit=Search+Documents&cs=1 Our task Figure out a keyword Query HiddenWeb Result

Problem formalization • Set-cover • Vertex – documents • Hyper-edges – query words

Goal • Maximize the number of unique documents retrieved with minimum number of query words

Problem formalization • P(qi) • portion of unique documents retrieved by issuing query word qi (portion of documents containing “qi”) • P(qi v qj) • portion of unique documents retrieved by issuing query words qi and qj (portion of documents containing qi or qj) • P(qi | qj) • portion of documents containing qi in the set of documents retrieved by issuing query words qj

Problem formalization • What is the next “best” query word? • P((q1 v … v qi-1) v qi)= P(q1 v … v qi-1) + P(qi) – P((q1 v … v qi-1) ^ qi)= P(q1 v … v qi-1) + P(qi) – P(q1 v … v qi-1)P(qi | q1 v … v qi-1) • P(q1 v … v qi-1) – knownP(qi | q1 v … v qi-1) – knownP(qi) – unknown • Approximate P(qi) using P(qi | q1 v … v qi-1)

Search for best query word • Greedy: choose the most frequently occurring word so far to be the query • Choose qi with maximum P(qi | q1 v … v qi-1) • For set-cover problem, greedy is proven to obtain log-optimal solution

Search for best query word • Can we do better? • Intuition • Correlation of keywords • E.g.- linux- debian, redhat, suse, knoppix, fedora, etc… • We might save the query word “linux” !

Search for best query word Wholedocumentcollection Documentsretrieved by qk Documents retrieved by qi Documentsretrieved by qj Already retrieveddocuments

Search for best query word linux debian redhat f(x) = Number of documents we get by issuing queries linux, debain, redhat minus theoverlapping between “redhat, linux” and “debain, linux” and “redhat, debain”

Search for best query word • The search tree is huge (branching factor) • We look ahead for the 10 most frequent keywords • We only search up to depth 6 • Pruning

Search for best query word • DFBnBSub-tree where the sum of documentsretrieved assuming no overlappingbetween keywords are less than thecurrent best solution

Experiment • Document collection : ~100K front pages of randomly selected websites • Query interface : an inverted index (a program that returns documents containing the given query word) • Methods • Greedy • DFS search (look ahead for 10 words, up to depth 6) • DFS search with pruning (DFBnB)

Results provide 51 work 159 privacy 144 years 172 world 344 list 205 info 1467 map 184 want 57 order 87 people 85 read 56 main 2270 high 95 designed 240 latest 36 events 132 looking 46 send 80 right 380 enter 1285 local 77 browser 1216 questions 77 real 77 provide 51 work 159 privacy 144 years 172 read 101 main 2364 designed 291 info 1455 latest 53 looking 60 send 101 right 402 local 99 world 239 list 142 map 150 want 42 order 69 people 67 high 85 events 126 questions 85 enter 1272 browser 1216 real 77 • Does searching helps?

Results • Does searching helps?

Results • How much does pruning saves? • With out pruning – 187300 nodes are examined187300=(10)+(10*9)+(10*9*8)+(10*9*8*7)+(10*9*8*7*6)+(10*9*8*7*6*5) • With pruning – 5558 nodes are examined on average (when we choose the most frequent keyword to expand) • DFBnB saves ~ 30 times

Conclusion • Searching helps little “in this problem” • DFBnB is “really effective” in pruning search tree

End

More results • Priori information helps

Results

Search & Greedy

Search with prune & Greedy

Search for best query word • base = q1 v … v qi • P(base v qi+1 v qi+2)= P(base v qi+1) + P(qi+2) – P((base v qi+1) ^ qi+2) • P((base v qi+1) ^ qi+2)= P((base ^ qi+2) v (qi +1^ qi+2))= P(base ^ qi+2) + P(qi+1 ^ qi+2) – P(base ^ qi+1 ^ qi+2)= P(base ^ qi+2) + P(qi+1 ^ qi+2) – P(base ^ qi+1 ^ qi+2)

2 words overlapping

3 words overlapping

Keywords Selection Problem in Hidden Web Crawling

Keywords Selection Problem in Hidden Web Crawling

Presentation Transcript

CRAWLING THE HIDDEN WEB

User-Centric Web Crawling

Crawling the Hidden Web

Web Crawling

Web Crawling

Web Crawling

Crawling the Hidden Web

Web Crawling

CRAWLING THE WEB

Crawling the Hidden Web

Crawling the Hidden Web

Crawling the Hidden Web

Distributed Web Crawling over DHTs

Ch. 8: Web Crawling

Chapter 9 Web Crawling

Web Crawling and Data Gathering

Datahut - Web Crawling Services

Ch. 8: Web Crawling

Web Crawling and Automatic Discovery

Deep Web Crawling

User-Centric Web Crawling*