Harnessing the Deep Web : Present and Future

Harnessing the Deep Web : Present and Future Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,2009 -Tushar Mhaskar

Accessing Structured Data using Queries 1) Unstructured Queries e.g. Current mode of searching for information on the Web. 2) Single Page Structured Queries e.g. here we enter precise query via an interface like HTML forms. 3) Multi Page Structured Queries e.g. query to get data from more than one resource like Mashups.

Virtual Integration Approach • Best option for creating vertical search engine i.e. searching data related to a particular domain. • Use of Mediated schema which is aggregation of certain attributes from all source schemas . • Analysis of HTML forms is done to identify the domain of the underlying content. • Input of the forms are semantically mapped to elements of the mediated schema of that domain. • Hence an input query is mapped to mediated schema which routes it to the specific source corresponding to the mapping done earlier. Mediated Schema USER QUERY Source Schema Source Schema . . . Source Schema Note: Source Schema means structure of data underlying the HTML forms

Virtual Integration Approach • “Result retrieved from the specific sources are combined and ranked before being presented to the user”. • Challenges • Data cannot be restricted to a particular domain since defining the boundaries of domain can be tricky. A data can relate to more than one domain and hence routing the query to appropriate domain can become a challenge. • Once a user enters a keyword in the search box, the underlying system which is implementing virtual integration approach would have to identify the forms which are relevant to the search keyword and then if necessary reformulate the keyword to make it specific to a form input. • All this happens at run time hence identifying which set of forms are relevant needs to be done in an efficient manner and in less amount of time.

Surfacing Approach • Deep web content is surfaced by simulating form submissions i.e. pre-computing queries for finding web pages and putting them into the web index. • Resulting web pages are not confined to particular domain as in Virtual Integration approach. • A deep web source is accessed only when a user selects a web page that can be crawled from that source. Similarly, the query-routing issue is mostly avoided because web search is performed on HTML pages as before • “Pages surfaced by this approach from the top 10, 000 forms (ordered by the number of search engine queries they impacted) accounted for only 50% of deep-web results on Google.com, while even the top 100, 000 forms only accounted for 85%”.

Surfacing Approach • The main advantage of the surfacing approach is the ability to re-use existing indexing technology, no additional indexing structures are necessary. • Further, a search is not dependent on the run-time characteristics of the underlying sources because the form submissions can be simulated off-line and fetched by a crawler over time. • Limitations • 1) Semantics associated with the pages surfaced are lost by ultimately putting HTML pages into the web index. Still, at least the pages are in the index, and thus can be retrieved for many searches. • 2) It is not always possible to enumerate the data values that make sense for a particular form, and it is easy to create too many form submissions that are not relevant to a particular source.

Role of Semantics in Surfacing Deep web Semantics of form inputs • List of values can be created for a mediated schema’s various elements and when an input query similar to the element in mediated schema is found, the query can be routed to appropriate form for getting the underlying content. • To consider this approach we need to distinguish between 2 types of form inputs. • 1) Search boxes • Generate “seed” words from already indexed web pages and iteratively search for similar elements to produce the list of values for elements in mediated schema. • 2 ) Typed text boxes. • If we can derive the data type of the text box in an HTML form , various meaningless queries to irrelevant forms can be avoided. • E.g. US zip codes, dates, prices, etc.

Role of Semantics in Surfacing Deep web Correlated inputs • Ignoring dependencies between different search elements can pose the problem of retrieving irrelevant data. • Two kinds of correlations. • 1) Ranges: • HTML forms often have pairs of inputs defining minimum and maximum input values. Considering this correlation of input values we can retrieve data that is more relevant to what user has searched • E.g. Minimum and maximum value of budget while looking for apartments on a housing site. • “Analysis indicates that as many as 20% of the English forms hosted in the US have input pairs that are likely to be ranges”. • 2) Database selection • Select menus can help you identify which particular database should the query be routed to once the user has entered the query string

Analysis of Surfaced Content • Semantics and extraction • By surfacing the structured data in the Deep Web, the semantics of the data is lost. • e.g. “Suppose a user were to search for “used ford focus 1993”. Suppose there is a surfaced used-car listing page for Honda Civics, which has a 1993 Honda Civic for sale, but with a remark “has better mileage than the Ford Focus”. A simple IR index can very well consider such a surfaced web page a good result. Such a scenario can be avoided if the surfaced page had the annotation that the page was for used-car listings of Honda Civics and the search engine were able to exploit such annotations”. • Hence proper annotations are required that can be used by indexes.

Analysis of Surfaced Content • Coverage of the surfaced content • What portion of the web site has been surfaced? • “Candidate surfacing algorithm states that a probability of M% more than N% of the site’s content is exposed when surfacing algorithm is used”. • Greedy algorithms are used to maximize the coverage of data surfaced. • Web pages we surface must be useful for search engine index. • “Pages we extract should neither have too many results on a single surfaced page nor too few”.

Aggregating Structured Data on Web. • Structured data can be aggregated by considering the metadata from collections on the web and these collections can be used to derive artifacts . • Artifacts can be used to build a set of semantic services like given an entity or an attribute we should be able to derive other attributes or set of values for those attributes.

Conclusions and Understandings from the paper. • In this paper we can find how deep web data is extracted and what are the challenges in optimizing how relevant data is surfaced. • It is difficult to cover the entire hidden data behind the forms. • We can retrieve structured data in two forms, if the data itself is already structured e.g. tables, or the query interface to the data is structured like HTML forms. • Virtual Integration approach can be used to extract hidden data pertaining to a particular domain with the help of mediator schemas. • Surfacing approach does not deal with problem of mapping queries with forms and uses web search engine indexes to retrieve the data. It is used for web search where we answer keyword queries that span all possible domains and the expected results are ranked lists of web pages. • Just extracting the data and presenting the data is not useful ,we need to understand the semantics of the data as well which can be inferred from ranges of data and input forms like an text box and a select menu.

Relation of this paper to lectures • The surfacing approach described in this paper considers indexing of generated URLs resulting from relevant form submissions. • Role of Semantics topic describes how we can optimize a query to infer appropriate results. • Paper also features the characteristics of crawling like • Politeness i.e. no load on the website while crawling the web. • Span maximum amount of web pages while crawling. • Generate relevant results pertaining to the search query.

Pros and Cons of the Paper: • Pros • The author has beautifully described which approaches exists to extract Deep web data taking into consideration domain and structure of data. • It best identifies the challenges in extracting the Deep web data if Semantics of the input keywords are not considered. • Cons • How is the mediated schema constructed in Virtual Integration approach is not described in detail. • Also the author does not state, how the mapping of form elements to the mediated schema is done. • In surfacing approach the author fails to describe how the indexing of pages is done in offline manner and how the input form elements are predicted. • Paper fails to describe the future work in this field in detail.

Harnessing the Deep Web : Present and Future