Harnessing the Deep Web Paper Authors : Jayant Madhavan , Loredana Afanasiev , Lyublena Antova , Alon Halevy

Harnessing the Deep WebPaper Authors : JayantMadhavan, LoredanaAfanasiev, LyublenaAntova, Alon Halevy
Paper Published : January 2009 Speaker : Aditya Vardhan

Deep Web : Bigger Picture The Deep Web refers to content hidden behind HTML. The Deep Web is believed to be the biggest source of structured data on the Web and hence accessing its contents has been a long standing challenge. Typical Search engines use web crawlers that follow hyperlinks through known protocol virtual port numbers. This technique is ideal for discovering resources on the surface web but is often ineffective at finding deep Web resources. March 25, 2013 Harnessing the Deep Web

Deep Web : Bigger Picture Deep Web is one (significant) source of data, much of which is structured, but not the only one. Other examples of Deep web include :- Dynamic content, Unlinked content, Private Web & Scripted content. Following approaches to expose deep-web content :- 1) Virtual Integration & 2) Surfacing Approach (discussed later) Semantic servers can be built to aggregate the structured data that helps in extradition efforts & reformulation of the web search queries. March 25, 2013 Harnessing the Deep Web

Landscape of Structured Data There is often a misconception that Deep Web and structured data on the web are synonymous. It is important to first consider the search task we expect to use the structured data for. Roughly, these structured data tasks can be classified into one of the following classes :- 1)Unstructured queries: These tasks are identical to the popular current mode of searching for information on the Web. Users pose keyword queries and get a listing of URLs to web pages as the result. 2)Single-page structured queries: In these tasks we pose more precise queries to the search engine. We may pose them using an interface that allows users to submit more structured queries (e.g., searching for jobs, cars or houses). 3)Multi-page structured queries: we expect the search engine to find answers that are derived by combining (e.g., via a join or union) data from multiple structured (or unstructured) sources on the Web. For example, we would like to combine data from the Internet Movie Database and websites with movie playing times to find local playing times of movies directed by Woody Allen. March 25, 2013 Harnessing the Deep Web

What is available?? We consider the structured data we find on the web and find that it comes mainly in two forms: one where the data itself is already structured in the form of tables, and the second where the query interface to the data is structured. The first type of data was analyzed by the Web Tables Project that collected all the HTML tables on the Web, extracted the ones that offer high-quality data, and offered a search interface over them. That work showed that there are on the order of 150 million high-quality relational tables on the Web when restricting ourselves to English pages. March 25, 2013 Harnessing the Deep Web

The second kind of structured data is available through structured query interfaces. HTML-form interfaces enable users to pose specific (but template) queries in particular domains and obtain results embedded in HTML pages. Deep-web data is clearly useful for unstructured queries and single page structured queries that return web pages as results. Providing precise answers from the results of deep web content is significantly harder because of the additional extraction effort that is required and the fact that extraction is hard to do in a site-independent way for a large numbers of sites. March 25, 2013 Harnessing the Deep Web

Access to the Deep-Web There area two main approaches to providing access to deep-web content: virtual integration and surfacing. The virtual integration approach : The approach is based on constructing mediator systems, potentially one for each domain (e.g., used cars, real-estate, or books). To create an application with this approach, one analyzes forms and identifies the domain of their underlying content, and then creates semantic mappings from the inputs of the form to the elements in the mediated schema of that domain. Queries over the mediated schema can then be reformulated as queries over each of the underlying forms. Virtual integration is a very attractive option when designing vertical search engines. March 25, 2013 Harnessing the Deep Web

Access to the Deep-Web Drawback of Virtual Integration :- Virtual Integration is not suitable in context of search engines, it is only viable when their domain has high commercial value. Managing semantic mappings for millions of useful forms is very challenging. March 25, 2013 Harnessing the Deep Web

Access to the Deep-Web The surfacing approach It focuses on pre-computing the most relevant form submissions, i.e., queries, for all interesting HTML forms. The URLs resulting from these submissions are generated off-line and indexed in a search engine like any other HTML page. The problem of building the models(to map queries to forms) has been solved in the case of surfacing as IR Index is built by analyzing the contents of the pages resulting from form submissions. March 25, 2013 Harnessing the Deep Web

Drawbacks to Surfacing Approach :- We have to identify values that are suitable for various form inputs. For select-menus, the values are already known, but for text inputs they need to be predicted. We have to minimize the number of queries over each form so as to not pose an unreasonable load during off-line analysis. Surfacing cannot be applied to HTML forms that use the POST method. March 25, 2013 Harnessing the Deep Web

The role of Semantics Semantics plays a crucial role in the virtual- integration approach to accessing the Deep Web and extending the coverage of deep-web surfacing. 1)Semantics of form inputs : Designing a mediated schema with lists of values associated with different elements. Distinguishing the application of the generated schema on the basis of type of input field (typed boxes and search boxes) March 25, 2013 Harnessing the Deep Web

The role of Semantics 2)Correlated inputs : Most useful forms have multiple inputs and ignoring dependencies between different inputs can lead to ineffective surfacing results. The prior approaches in case of surfacing of the deep-web content has either considered forms with single input or when considering multiple inputs queried them only one at a time or ignored any dependencies between them. March 25, 2013 Harnessing the Deep Web

The role of Semantics Correlated inputs(contd..) We outline two kinds of correlations between inputs that we consider especially important – Ranges & Database Selection. 1)Ranges :- Forms often have pairs of inputs defining a range over a single numeric property – one input each to restrict the maximum and minimum values of the property in the retrieved results. Analysis indicates that as many as 20% of the English forms hosted in the US have input pairs that are likely to be ranges. Not recognizing such input pairs can lead to wasteful generation of URLs. Resolution to the ranges issue :- First, we must identify pairs of inputs that are very likely to correspond to ranges. Second, we must determine which pairs of values would be most effective to use for the ranges. Once ranges are identified, we have seen that even simple strategies for picking value pairs can significantly reduce the total numbers of URLs generated. March 25, 2013 Harnessing the Deep Web

The role of Semantics 2)Database selection: It consists typically of two inputs – one text box and one select-menu. The value in the text box poses keyword queries, while the select-menu identifies which underlying database the query is targeted at. The best set of values for the text box varies with the value in the select menu. Generating different sets of keywords can be easily done, provided we can reliably identify such patterns in inputs, which is an open challenge. Resolution to the database selection issue :- Adding a Javascript emulator to the analysis of forms, one can identify such correlations easily. The canonical example of correlated inputs, namely, a pair of inputs that specify the make and model of cars (where the make restricts the possible models) is typically handled in a form by Javascript. March 25, 2013 Harnessing the Deep Web

Analysis of Surfaced Content Two problems areas that we believe are prime candidates for future deep-web research have to do with the interpretation of pages resulting from the surfacing. Semantics And Extraction :- When structured data in the Deep Web is surfaced, the structure and hence the semantics of the data is lost. The loss in semantics is also a lost opportunity for query answering. The challenge here is to find the right kind of annotation that can be used by the IR-style index most effectively. Coverage of the surfaced content :- This is an open research problem that does not yet have a satisfying solution. Most approaches to extracting deep-web content employ greedy algorithms that try to maximize coverage, but do not provide any guarantees. It is important to note that a natural goal for a surfacing algorithm might be to minimize the number of surfaced pages while maximizing coverage. March 25, 2013 Harnessing the Deep Web

Aggregating Structured Data on the Web The work on structured-data on the Web has focused mostly on providing users access to the data. However, our experience has shown that significant value can be obtained from analyzing collections of meta-data on the Web. In the case of forms & HTML pages, we can extract several artifacts such as :- A collection of forms A collection of several million schemata for tables A collection of columns, each having values in the same domain. March 25, 2013 Harnessing the Deep Web

Pros & Cons Pros :- A Deep Web search’s chief advantage is the depth and thoroughness of its results. Deep Web search also prove more efficient, retrieving higher quality, more relevant content. Deep Web searches analyze each page's entire content, ensuring returned results have a higher relevance to the desired search string. Cons :- Deep Web searches tend to be slower than standard searches. Deep Web searches may also return sensitive personal information from normally restricted databases. Exploring the Deep Web can be dangerous(users enjoy complete anonymity and may partake in illegal activities). March 25, 2013 Harnessing the Deep Web

Relation to Current Coursework In Content Extraction Given the set of resources, i.e. sites and databases, automate the information retrieval Analyzing the deep web content scenario. Implementation of distributed crawling approaches. Determining the real life application of Crawler’s complications. March 25, 2013 Harnessing the Deep Web

Conclusion The lines between search engine content and the Deep Web have begun to blur, as search services start to provide access to part or all of once-restricted content. An increasing amount of deep Web content is opening up to free search as publishers and libraries make agreements with large search engines. In the future, deep Web content may be defined less by opportunity for search than by access fees or other types of authentication. March 25, 2013 Harnessing the Deep Web

Harnessing the Deep Web Paper Authors : Jayant Madhavan , Loredana Afanasiev , Lyublena Antova , Alon Halevy