CS621 : Seminar-2008

DEEP WEB Shubhangi Agrawal (08305044)‏ Jayalekshmy S. Nair (08305056)‏ CS621 : Seminar-2008

Introduction • Deep Web : The part of web which does not come under surface web. • Surface Web : That part of the World Wide Web which is crawled and indexed by conventional search engines. • Deep Web consists of 91,000 terabytes of data whereas surface web contains only 167 terabytes.

Contextual View Of The Deep Web

What Constitutes Deep Web • Dynamic content : dynamic pages which are returned in response to a submitted query. • Unlinked content : pages which are not linked to other pages. • Private Web : sites that require registration and login.

What Constitutes Deep Web • Limited access content : sites that limit access to their pages in a technical way. • Scripted content : pages that are only accessible through links produced by JavaScript. • Non-HTML/text content : textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.

Why Is The Information Not Accessible • Conventional search engines use programs called spiders or crawlers. • When a search engine reaches a page, it will capture the text on that page, indexes it and crawls to any pages that may have static hyperlinks to it. • Cannot crawl and index information in databases because they don't have a static URL.

Why Use The Deep Web • Very vast : 550 times that of surface web • Quality of content / higher level of authority • Comprehensiveness • Focused • Timeliness • The material isn’t available elsewhere on the Web

How To Access Contents Of Deep Web • Manually search all the databases • Human Crawlers (Web Harvesting)‏ • Federated Search

Web Harvesting • Web Harvesting is an implementation of a Web crawler uses human expertise or machine guidance to direct the crawler to URLs which compose a specialized collection or set of knowledge. Web harvesting can be thought of as focused or directed Web crawling.

Process • Identifying and specifying as input to a computer program a list of URLs that defines a specialized collection or a set of knowledge • The computer program then begins to download this list of URLs. • Crawl depth can be defined , crawling need not be recursive • The downloaded content is then indexed by the search engine application and offered to information customers as a searchable Web application.

Limitations • Amount of human intervention needed is high. • Some sites are very slow, particularly during busy periods, so getting all the information needed within a limited time window may be impossible.

Federated Search • Simultaneous search of multiple online databases • User enters the query in a single interface • Query is sent to different databases associated with the search engine. • Results are presented in a manner suitable to the user

Process • Transforming a query and broadcasting it to a group of databases with the appropriate syntax • Merging the results collected from the databases • Presenting them in a unified format with minimal duplication • Providing a means, performed either automatically or by the portal user, to sort the merged result set.

Federated Search contd... • Advantage : They are as current as the information sources as the sources are searched in real time • Eg : WorldWideScience • Contains 40 information sources several of them are federated search portals themselves

Limitations • Scalability • Vast amount of info coming can be a problem • All the databases cannot be covered • Either it searches the entire database or User intervention is required • Results depend on user supplying the correct keywords

Automatic Information Discovery From The Invisible Web Automatic Information Discovery From The Invisible Web A system that maintains information about the specialized search engines in the invisible web. When a query arrives, the system not only finds the most appropriate specialized engines, but also redirects the query automatically so that the user can directly receive the appropriate query results.Characteristics • Database of specialized search engines • Automatic search engine selection • Data mining for better query specification and search

System Architecture

System Overview 1.Populate the search engine database • Crawlers identify search engines using form tags • Along with the URL , an engine description is also stored in the database 2.Query pre-processing • Send the keywords to some general search engines for a query and return the top results. • Based on the results, find words and phrases that appear often with the search keywords.

System Overview 3.Engine selection • Each keyword/phrase generated from the pre-processing step is matched with the search engine description of database 4.Query execution and result post-processing • After the search engines are selected, the system automatically sends the query to all the search engines and awaits the results to return. • Based on the information stored in the database, the system can automatically generate the query string and send the appropriate query to the websites

Deep Web constitutes a large repository of information which is getting deeper and bigger all the time. There are various possible ways in which the information in it can be accessed. There has been continuous improvement in this field , still there is need of more efficient methods to be commercially implemented. Conclusion

References • Bergman, M.K. (2001). The deep web: Surfacing hidden value. The Journal of Electronic Publishing, 7(1). Retrieved from http://www.press.umich. edu/jep/07-01/bergman.html • King-Ip Lin, Hui Chen, "Automatic Information Discovery from the "Invisible Web"," itcc,pp.0332, International Conference on Information Technology: Coding and Computing, 2002 • www.wikipedia.com • http://worldwidescience.org/ • http://science.gov/

Queries ???

CS621 : Seminar-2008

CS621 : Seminar-2008

Presentation Transcript

Financial Resources Seminar

SEMINAR – 2008

CS621: Artificial Intelligence

Bocconi - Duke Seminar Nov 24 th , 2008

186.162 Seminar (mit Bachelorarbeit) 186.175 Seminar aus Computergraphik WS 2008

CS621: Artificial Intelligence

Bridge Planning Seminar 2008

EU Developments 2008 ECASBA SEMINAR June 11-12 2008

Dónal Rice Senior ICT Advisor CEUD Access Officers Seminar 2008 Athlone 7 May 2008

CS621 : Artificial Intelligence

Paula J. Bruening 22 February 2008 Lima, Peru

CS621: Artificial Intelligence

CS621: Artificial Intelligence

Conclusions of the seminar

CS621: Introduction to Artificial Intelligence

Summary of Issues from 2008 External Examiner Reports

CS621 Seminar on: “Intelligent Database Systems”