Learning Based Web Query Processing

Learning Based Web Query Processing Yanlei Diao Computer Science Department Hong Kong U. of Science & Technology

Outline • Background • Learning Based Web Query Processing • FACT: A Prototype System • Preliminary System Evaluation • Conclusions • Demonstration

Searching the Web Want to find a piece of information on the Web? Heterogeneity Huge Size Lack of Structure Diversified User Bases Ever- Changing

Search Engines • Maintain indices, keyword input, match input keywords with indices, return relevant documents. • Problems • Large hit lists with low precision. Users find relevant documents by browsing. • URLs but not the required information are returned. Users read the pages for the required information.

Web Information Retrieval • IR: Vector-space model, search and browse capabilities • Web IR: Web navigation, indexing, query languages, query-document matching, output ranking, user relevance feedback • Recent Improvement: Hierarchical classification, better presentation of results, hypertext study, metasearching...

Web IR for Query Processing Problems • A list of URLs or documents is returned. Users browse a lot to find information. • It asks users for precise query requirements, which is hard for casual users. • It lacks a well-defined underlying model. Vector-space model does not convey as much as Hypertext. Large hit lists with low precision, rely on input queries

Intelligent Agents The agents learn user profiles/models from their search behaviors and employ the knowledge to predict URLs of interest to the user. • Some rely on search engines and heuristics to find targets of a specific type: e.g. papers or homepages • Some help users in an interactive mode: They learn while users are browsing. • Some adaptive agents work autonomously: They use heuristics, recommend pages of interest and take user feedback to improve.

Agents for Query Processing Problems • Recommending pages of interest, but not information of interest to the user • Using vector-space model or converting HTML to text documents • Requiring a prior knowledge, such as user profiles, or using heuristics for a particular domain Not well suited for ad hoc queries

Database Approaches • The Web is a directed graph: nodes are Web pages and edges are hyperlinks between pages. • Query languages: 1st generation combines content-based and structure-based queries. 2nd generation accesses structure of Web objects and creates complex objects. • Wrappers and mediators: they present an integrated view of the resources.

DB Approaches for Query Processing Problems • Wrapper generation is only feasible for a number of sites in a domain. The Web is growing very fast! • Web query languages require knowledge of the Web sites (content and linkage) and the language syntax. They are hard to use. Not scalable, good for Web site management but not queries on the entire Web.

Our Goal A Web query processing system for any Web users that • processes ad hoc queries on HTML pages • automatically extracts succinct and precise query results ( a result may take the form of a table, a list or a paragraph).  Learn the knowledge for query processing from the User!

Proposed Approach An approach with learning capabilities: • Keyword input (probably not precise) • Search engines return a URL list • During browsing, learns from users • to navigate through the web pages • to identifythe required information on a web page • Processes the rest URLs automatically • Returns succinct and precise results

Unique Features • Returning succinct and precise results, i.e. segments of pages; • No a prior knowledge or preprocessing, suited for ad hoc queries; • exploiting page formatting and linkage information simultaneously, good use of rich information conveyed by HTML.

Benefits from Learning • Bridging the gap between keyword input and real query requirements • Capable of navigating in the neighborhoods of documents returned by search engines • Automating the processing of all possibly relevant documents in one query • Almost imperceptible to users, user-friendly

Modeling a Web Page • Segment:a group of tag delimited elements, unit in query processing, e.g. paragraph, table, list, nested (atomic segments to the document),Segment Tree • Attributesofa segment • content: text in the scope of the segment • description: summary of the content • Hyperlink: represented as segments to be comparable • content: URL • description: anchor text • associated with the parent segment

<html><head> <title> … Hotel </title></head> <body><p>1999 Room Rates</p> <table><tr><td><ul> <li><a href="ac01a.html"> Guest Room</a></li> <li><a href="ac02a.html"> Executive Suite</a></li></ul></td> <td> Special Promotion <br> <table><tr><td>Room Type</td> <td>Single/Double (HK$)</td> <tr><td>Standard</td> <td>1000</td></tr> <tr><td>Excutive Suite</td> <td>2750</td> </tr></table></td></tr></table> </body></html> & contents of child paragraph and table Document Content Content Paragraph Table "Special Promotion" & the content of the child table Content List Table "1999 Room Rates" Content Link "Room TypeSingle /Double (HK$)Standard1000Executive Suite2750" 1. ac01a.html 2. ac02a.html Content A Sample

S13 S12 S11 S131 S3 S31 S32 S1 S21 S2 S4 S41 L1 L3 L2 L4 Definition: Sijk: Segment Lm:Hyperlink Modeling a Web Site Ignore backward links, links pointing to themselves, links outside a site. A Web site is modeled as hyperlink-connected segment trees, called Segment Graph.

1) Exhaustive search simplifies it, but is impractical. 2) Navigation in the graph should terminate if a segment answers the query well enough or conclusion of irrelevancy can be drawn. A decision of following a link or choosing a segment should be made on each page. Segments and links on a page should be comparable! Knowledge for the Locating Task The locating task is to find a segment in the Segment Graph of a site as the query result.

Segments and links on a page are not comparable by content! Two types of knowledge are needed! • One only concerns descriptive information and helps find the navigational path. • The otherchecks if a segment meets query requirements on both descriptive information and the result. Two Types of Knowledge A link conveys description of the pointed page while a queried segment contains both description and the result itself.

Navigation Knowledge • concerns descriptive information and helps find the navigational path • a set of (term, weight) pairs • Term:a selected word f the description of segments and links on the navigational path • Weight:indicating the importance of the term in leading to the queried segment

Learning Navigation Knowledge Navigational path, (link)*segment, e.g. L2L4S41. Extended navigational path, ((segment )*link)* ((segment )* segment), e.g. (S1S11L2)  (S3S31L4) (S4S41). Step1. Assign a weight to each component on the path, e.g. L2, S31, S41. The closer to the target, the higher the weight. Step2. Assign a weight to each term in the description of a component on the path. The weight of a term can be summed up over navigational paths. The set of (term, weight) pairs is stored into the navigation knowledge base.

Classification knowledge • Checks if a segment meets query requirements on both descriptive information and the result. • Cast in the Bayesianlearning framework. • Set of triples: (feature, NP, NN) • Feature: word, integer, real, symbol, …, date, time, email address, …, contained in a segment • NP: #occurrences of the feature in positive samples • NN:#occurrences of the feature in negative samples

Learning Classification knowledge The queried segment is a positive sample. All other segments on the same page are negative samples. The contentof each segment is parsed into a set of features, either simple and complex types. Count NP and NN accumulatively for each feature over all samples. Store all triples (feature, NP, NN) into the classification knowledge base.

Query Processing Using Learned Knowledge • After a Web page is retrieved, the segment graph is built • For each segment and link, a score is computed by applying the navigation knowledge (ApplyNavigation). • Segments/links are sorted on the score • If a link has the highest score, the system navigates through the link • If a segment has the highest score, all segments on the page are checked to see if there is a queried segment • The process is repeated until either a segment is found or conclusion can be made that the site does not contain queried information.

S13 S12 S11 S131 S3 S31 S32 S1 S21 S2 S4 S41 L1 L3 L2 L4 Definition: Sijk: Segment Lm:Hyperlink Locating Algorithm On each page, if the result is not found: choosing an unprocessed component with highest score: if a link is chosen if a segment is chosen

S13 S12 S11 S131 S3 S31 S32 S1 S21 S2 S4 S41 L1 L3 L2 L4 Definition: Sijk: Segment Lm:Hyperlink Locating Algorithm On each page, if the result is not found: choosing an unprocessed component with highest score: if a link is chosen if a segment is chosen  (ApplyClassification)

Applying Learned Knowledge • Application of Navigation Knowledge: • extracts terms in the description of a link/segment • reads the weights of the terms and assigns a score to the link/segment by a certain function (max currently) • sorts all links and segments by their scores • Application of Classification Knowledge: • computes the confidence Cto classify a segment as the queried result • chooses the segment on a page with the largest C. If the largest C is over a threshold, returns the segment

forward Hotel 1 3 Hotel 2 User browses it! done

User clicks here!

Room information User marks it!

Generating Navigation Knowledge • The navigation path looks like: Hotel Reservation->single hk$ double hk$ standard room deluxe room +executive room • By our weighting scheme, a weight is assigned to each term

Generating Classification Knowledge • Training Samples • Occurrences of each feature are counted Negative Holiday Inn Golden Mile In the heart of Tsim Sha Tsui - Kowloon, Holiday Inn Golden Mile is your number one choice for accommodation, dining, meetings and banquets. Ideally situated in the heart of ... Positive single hk$ double hk$ standard room 999.00 1,039.00 deluxe room 1,199.00 1,239.00 +executive room 1,399.00 1,499.00

back Fact starts here!

Applying Navigation Knowledge The page contains Navigation knowledge shows Paragraph 57 - 73 Lockhart Road, Wanchai, Hong Kong, SAR, PRC Paragraph Located in the hub of Wanchai, the Wharney Hotel is within walking distance of the Hong Kong Arts Centre, Convention and Exhibition Centre, busy commercial complexes and shopping malls. ... Paragraph TEL: (852) 2861-1000 FAX: (852) 2865-6023 Links Main Features & Services Dining and Banqueting Hotel Rates Reservation ...

0.285714 0.392857 0.230769 0.392857 0 0 Current 0.0666667 0 3.0 0.25 0 Navigation Knowledge assigns scores Fact chooses it!

Table: 0.586447 Paragraph: 3.0 Paragraph: 0.25 List: 0.25 Visited 0.0666667 0 Current 0.25 0 Navigation Knowledge assigns scores

C=1.0 C=0.3569 C=2.5e-007 C=6.3e-008 C=0.0001 Classification Knowledge computes confidence Apply Classification Knowledge to all Segments

Fact finds it!

A Query Processing System A learning based query processing system: • User Interface:accepts user queries, presents query results, a browser capable of capturing user actions • Query Analyzer:analyzes and transforms user queries • Session Controller: coordinates learning and locating • Learner:generates knowledge from captured user actions • Locator: applies knowledge and locates query results • Retriever & Parser: retrieves pages and parses to trees • Knowledge Base:stores learned knowledge

User User Interface Learner KnowledgeBase SessionController QueryAnalyzer Locator Retriever & Parser SearchEngine Web Reference Architecture

Learning Process Scripts Learner Browser User Actions SessionController URLs KnowledgeBase ResultBuffer TrainingStrategy SegmentGraph Queryresults Checking Locating Process Locator Query Result Presenter A Query Session

Training Strategies • Sequential • First nsites: user browses and system learns • Next N-n sites: system processes • Random • Randomly choose n sites: user browses and system learns • the system processes the rest • Interleaved • First n0sites, user browses and system learns • Next n - n0site, system makes decision. For incorrect ones, user browses and system re-learns • Next N-n sites: system processes

System Evaluation • System Capabilities • Performance • Effectiveness: precision, recall, correctness • Efficiency: in a site, how many pages the system visits to find a result or to recognize the irrelevancy • Training efficiency: how many training samples are needed • Key Issues • Effectiveness of the knowledge • Effectiveness of training strategies • Tests on A Range of Queries

A System Output Sample

System Capabilities • The system returns segments of the Web pages • The segments may not contain any input keyword but meet the requirement of room rates. • The system learned the query requirement from the user! • Segments can be from pages whose URLs are not directly returned by Yahoo!. • The system learned how to follow the hyperlinks to the queried segment!

System Evaluation - Effectiveness • Given a set of URLs in a query session, the system makes N decisions N =N1 + N2 + N3 + N4 Precision = N1 / (N1+N3) , Recall = N1 / # sites that contain results, Correctness = (N1+N2) / N .

Learning Based Web Query Processing

Learning Based Web Query Processing

Presentation Transcript

Ontology-Based Free-Form Query Processing for the Semantic Web

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query processing

Ontology-Based Free-Form Query Processing for the Semantic Web

Query Processing

Query Processing

Collaborative query processing based on reducts