1 / 22

CS621 : Seminar-2008

DEEP WEB Shubhangi Agrawal (08305044) ‏ Jayalekshmy S. Nair (08305056) ‏. CS621 : Seminar-2008. Introduction. Deep Web : The part of web which does not come under surface web. Surface Web : That part of the World Wide Web which is crawled and indexed by conventional search engines.

jerica
Download Presentation

CS621 : Seminar-2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DEEP WEB Shubhangi Agrawal (08305044)‏ Jayalekshmy S. Nair (08305056)‏ CS621 : Seminar-2008

  2. Introduction • Deep Web : The part of web which does not come under surface web. • Surface Web : That part of the World Wide Web which is crawled and indexed by conventional search engines. • Deep Web consists of 91,000 terabytes of data whereas surface web contains only 167 terabytes.

  3. Contextual View Of The Deep Web

  4. What Constitutes Deep Web • Dynamic content : dynamic pages which are returned in response to a submitted query. • Unlinked content : pages which are not linked to other pages. • Private Web : sites that require registration and login.

  5. What Constitutes Deep Web • Limited access content : sites that limit access to their pages in a technical way. • Scripted content : pages that are only accessible through links produced by JavaScript. • Non-HTML/text content : textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.

  6. Why Is The Information Not Accessible • Conventional search engines use programs called spiders or crawlers. • When a search engine reaches a page, it will capture the text on that page, indexes it and crawls to any pages that may have static hyperlinks to it. • Cannot crawl and index information in databases because they don't have a static URL.

  7. Why Use The Deep Web • Very vast : 550 times that of surface web • Quality of content / higher level of authority • Comprehensiveness • Focused • Timeliness • The material isn’t available elsewhere on the Web

  8. How To Access Contents Of Deep Web • Manually search all the databases • Human Crawlers (Web Harvesting)‏ • Federated Search

  9. Web Harvesting • Web Harvesting is an implementation of a Web crawler uses human expertise or machine guidance to direct the crawler to URLs which compose a specialized collection or set of knowledge. Web harvesting can be thought of as focused or directed Web crawling.

  10. Process • Identifying and specifying as input to a computer program a list of URLs that defines a specialized collection or a set of knowledge • The computer program then begins to download this list of URLs. • Crawl depth can be defined , crawling need not be recursive • The downloaded content is then indexed by the search engine application and offered to information customers as a searchable Web application.

  11. Limitations • Amount of human intervention needed is high. • Some sites are very slow, particularly during busy periods, so getting all the information needed within a limited time window may be impossible.

  12. Federated Search • Simultaneous search of multiple online databases • User enters the query in a single interface • Query is sent to different databases associated with the search engine. • Results are presented in a manner suitable to the user

  13. Process • Transforming a query and broadcasting it to a group of databases with the appropriate syntax • Merging the results collected from the databases • Presenting them in a unified format with minimal duplication • Providing a means, performed either automatically or by the portal user, to sort the merged result set.

  14. Federated Search contd... • Advantage : They are as current as the information sources as the sources are searched in real time • Eg : WorldWideScience • Contains 40 information sources several of them are federated search portals themselves

  15. Limitations • Scalability • Vast amount of info coming can be a problem • All the databases cannot be covered • Either it searches the entire database or User intervention is required • Results depend on user supplying the correct keywords

  16. Automatic Information Discovery From The Invisible Web Automatic Information Discovery From The Invisible Web A system that maintains information about the specialized search engines in the invisible web. When a query arrives, the system not only finds the most appropriate specialized engines, but also redirects the query automatically so that the user can directly receive the appropriate query results.Characteristics • Database of specialized search engines • Automatic search engine selection • Data mining for better query specification and search

  17. System Architecture

  18. System Overview 1.Populate the search engine database • Crawlers identify search engines using form tags • Along with the URL , an engine description is also stored in the database 2.Query pre-processing • Send the keywords to some general search engines for a query and return the top results. • Based on the results, find words and phrases that appear often with the search keywords.

  19. System Overview 3.Engine selection • Each keyword/phrase generated from the pre-processing step is matched with the search engine description of database 4.Query execution and result post-processing • After the search engines are selected, the system automatically sends the query to all the search engines and awaits the results to return. • Based on the information stored in the database, the system can automatically generate the query string and send the appropriate query to the websites

  20. Deep Web constitutes a large repository of information which is getting deeper and bigger all the time. There are various possible ways in which the information in it can be accessed. There has been continuous improvement in this field , still there is need of more efficient methods to be commercially implemented. Conclusion

  21. References • Bergman, M.K. (2001). The deep web: Surfacing hidden value. The Journal of Electronic Publishing, 7(1). Retrieved from http://www.press.umich. edu/jep/07-01/bergman.html • King-Ip Lin, Hui Chen, "Automatic Information Discovery from the "Invisible Web"," itcc,pp.0332, International Conference on Information Technology: Coding and Computing, 2002 • www.wikipedia.com • http://worldwidescience.org/ • http://science.gov/

  22. Queries ???

More Related