learning based web query processing l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Learning Based Web Query Processing PowerPoint Presentation
Download Presentation
Learning Based Web Query Processing

Loading in 2 Seconds...

play fullscreen
1 / 66

Learning Based Web Query Processing - PowerPoint PPT Presentation


  • 239 Views
  • Uploaded on

Learning Based Web Query Processing. Yanlei Diao Computer Science Department Hong Kong U. of Science & Technology. Outline. Background Learning Based Web Query Processing FACT: A Prototype System Preliminary System Evaluation Conclusions Demonstration. Searching the Web.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Learning Based Web Query Processing


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
learning based web query processing

Learning Based Web Query Processing

Yanlei Diao

Computer Science Department

Hong Kong U. of Science & Technology

outline
Outline
  • Background
  • Learning Based Web Query Processing
  • FACT: A Prototype System
  • Preliminary System Evaluation
  • Conclusions
  • Demonstration
searching the web
Searching the Web

Want to find a piece of information on the Web?

Heterogeneity

Huge Size

Lack of

Structure

Diversified

User Bases

Ever-

Changing

search engines
Search Engines
  • Maintain indices, keyword input, match input keywords with indices, return relevant documents.
  • Problems
    • Large hit lists with low precision. Users find relevant documents by browsing.
    • URLs but not the required information are returned. Users read the pages for the required information.
web information retrieval
Web Information Retrieval
  • IR: Vector-space model, search and browse capabilities
  • Web IR: Web navigation, indexing, query languages, query-document matching, output ranking, user relevance feedback
  • Recent Improvement: Hierarchical classification, better presentation of results, hypertext study, metasearching...
web ir for query processing
Web IR for Query Processing

Problems

  • A list of URLs or documents is returned. Users browse a lot to find information.
  • It asks users for precise query requirements, which is hard for casual users.
  • It lacks a well-defined underlying model. Vector-space model does not convey as much as Hypertext.

Large hit lists with low precision, rely on input queries

intelligent agents
Intelligent Agents

The agents learn user profiles/models from their search behaviors and employ the knowledge to predict URLs of interest to the user.

  • Some rely on search engines and heuristics to find targets of a specific type: e.g. papers or homepages
  • Some help users in an interactive mode: They learn while users are browsing.
  • Some adaptive agents work autonomously: They use heuristics, recommend pages of interest and take user feedback to improve.
agents for query processing
Agents for Query Processing

Problems

  • Recommending pages of interest, but not information of interest to the user
  • Using vector-space model or converting HTML to text documents
  • Requiring a prior knowledge, such as user profiles, or using heuristics for a particular domain

Not well suited for ad hoc queries

database approaches
Database Approaches
  • The Web is a directed graph: nodes are Web pages and edges are hyperlinks between pages.
  • Query languages: 1st generation combines content-based and structure-based queries. 2nd generation accesses structure of Web objects and creates complex objects.
  • Wrappers and mediators: they present an integrated view of the resources.
db approaches for query processing
DB Approaches for Query Processing

Problems

  • Wrapper generation is only feasible for a number of sites in a domain. The Web is growing very fast!
  • Web query languages require knowledge of the Web sites (content and linkage) and the language syntax. They are hard to use.

Not scalable, good for Web site management but not queries on the entire Web.

our goal
Our Goal

A Web query processing system for any Web users that

  • processes ad hoc queries on HTML pages
  • automatically extracts succinct and precise query results ( a result may take the form of a table, a list or a paragraph).

 Learn the knowledge for query processing from the User!

proposed approach
Proposed Approach

An approach with learning capabilities:

  • Keyword input (probably not precise)
  • Search engines return a URL list
  • During browsing, learns from users
    • to navigate through the web pages
    • to identifythe required information on a web page
  • Processes the rest URLs automatically
  • Returns succinct and precise results
unique features
Unique Features
  • Returning succinct and precise results, i.e. segments of pages;
  • No a prior knowledge or preprocessing, suited for ad hoc queries;
  • exploiting page formatting and linkage information simultaneously, good use of rich information conveyed by HTML.
benefits from learning
Benefits from Learning
  • Bridging the gap between keyword input and real query requirements
  • Capable of navigating in the neighborhoods of documents returned by search engines
  • Automating the processing of all possibly relevant documents in one query
  • Almost imperceptible to users, user-friendly
outline15
Outline
  • Background
  • Learning Based Web Query Processing
  • FACT: A Prototype System
  • Preliminary System Evaluation
  • Conclusions
  • Demonstration
modeling a web page
Modeling a Web Page
  • Segment:a group of tag delimited elements, unit in query processing, e.g. paragraph, table, list, nested (atomic segments to the document),Segment Tree
  • Attributesofa segment
    • content: text in the scope of the segment
    • description: summary of the content
  • Hyperlink: represented as segments to be comparable
    • content: URL
    • description: anchor text
    • associated with the parent segment
a sample

<html><head>

<title> … Hotel </title></head>

<body><p>1999 Room Rates</p>

<table><tr><td><ul>

<li><a href="ac01a.html">

Guest Room</a></li>

<li><a href="ac02a.html">

Executive Suite</a></li></ul></td>

<td> Special Promotion <br>

<table><tr><td>Room Type</td>

<td>Single/Double (HK$)</td>

<tr><td>Standard</td>

<td>1000</td></tr>

<tr><td>Excutive Suite</td>

<td>2750</td>

</tr></table></td></tr></table>

</body></html>

& contents of child paragraph and table

Document

Content

Content

Paragraph

Table

"Special Promotion" & the content of the child table

Content

List

Table

"1999 Room Rates"

Content

Link

"Room TypeSingle /Double (HK$)Standard1000Executive Suite2750"

1. ac01a.html

2. ac02a.html

Content

A Sample
modeling a web site

S13

S12

S11

S131

S3

S31

S32

S1

S21

S2

S4

S41

L1

L3

L2

L4

Definition:

Sijk: Segment

Lm:Hyperlink

Modeling a Web Site

Ignore backward links, links pointing to themselves, links outside a site.

A Web site is modeled as hyperlink-connected segment trees, called

Segment Graph.

knowledge for the locating task

1) Exhaustive search simplifies it, but is impractical.

2) Navigation in the graph should terminate if a segment answers the query well enough or conclusion of irrelevancy can be drawn.

A decision of following a link or choosing a segment should be made on each page.

Segments and links on a page should be comparable!

Knowledge for the Locating Task

The locating task is to find a segment in the Segment Graph of a site as the query result.

two types of knowledge

Segments and links on a page are not comparable by content!

Two types of knowledge are needed!

  • One only concerns descriptive information and helps find the navigational path.
  • The otherchecks if a segment meets query requirements on both descriptive information and the result.
Two Types of Knowledge

A link conveys description of the pointed page while a queried segment contains both description and the result itself.

navigation knowledge
Navigation Knowledge
  • concerns descriptive information and helps find the navigational path
  • a set of (term, weight) pairs
    • Term:a selected word f the description of segments and links on the navigational path
    • Weight:indicating the importance of the term in leading to the queried segment
learning navigation knowledge
Learning Navigation Knowledge

Navigational path, (link)*segment, e.g. L2L4S41.

Extended navigational path, ((segment )*link)* ((segment )* segment), e.g. (S1S11L2)  (S3S31L4) (S4S41).

Step1. Assign a weight to each component on the path, e.g. L2, S31, S41. The closer to the target, the higher the weight.

Step2. Assign a weight to each term in the description of a component on the path.

The weight of a term can be summed up over navigational paths. The set of (term, weight) pairs is stored into the navigation knowledge base.

classification knowledge
Classification knowledge
  • Checks if a segment meets query requirements on both descriptive information and the result.
  • Cast in the Bayesianlearning framework.
  • Set of triples: (feature, NP, NN)
    • Feature: word, integer, real, symbol, …, date, time, email address, …, contained in a segment
    • NP: #occurrences of the feature in positive samples
    • NN:#occurrences of the feature in negative samples
learning classification knowledge
Learning Classification knowledge

The queried segment is a positive sample. All other segments on the same page are negative samples.

The contentof each segment is parsed into a set of features, either simple and complex types.

Count NP and NN accumulatively for each feature over all samples. Store all triples (feature, NP, NN) into the classification knowledge base.

query processing using learned knowledge
Query Processing Using Learned Knowledge
  • After a Web page is retrieved, the segment graph is built
  • For each segment and link, a score is computed by applying the navigation knowledge (ApplyNavigation).
  • Segments/links are sorted on the score
    • If a link has the highest score, the system navigates through the link
    • If a segment has the highest score, all segments on the page are checked to see if there is a queried segment
  • The process is repeated until either a segment is found or conclusion can be made that the site does not contain queried information.
locating algorithm

S13

S12

S11

S131

S3

S31

S32

S1

S21

S2

S4

S41

L1

L3

L2

L4

Definition:

Sijk: Segment

Lm:Hyperlink

Locating Algorithm

On each page, if the result is not found:

choosing an unprocessed component with highest score:

if a link is chosen

if a segment is chosen

locating algorithm27

S13

S12

S11

S131

S3

S31

S32

S1

S21

S2

S4

S41

L1

L3

L2

L4

Definition:

Sijk: Segment

Lm:Hyperlink

Locating Algorithm

On each page, if the result is not found:

choosing an unprocessed component with highest score:

if a link is chosen

if a segment is chosen 

(ApplyClassification)

applying learned knowledge
Applying Learned Knowledge
  • Application of Navigation Knowledge:
    • extracts terms in the description of a link/segment
    • reads the weights of the terms and assigns a score to the link/segment by a certain function (max currently)
    • sorts all links and segments by their scores
  • Application of Classification Knowledge:
    • computes the confidence Cto classify a segment as the queried result
    • chooses the segment on a page with the largest C. If the largest C is over a threshold, returns the segment
slide29

forward

Hotel 1

3

Hotel 2

User browses it!

done

slide31

Room information

User marks it!

generating navigation knowledge
Generating Navigation Knowledge
  • The navigation path looks like:

Hotel Reservation->single hk$ double hk$ standard room deluxe room +executive room

  • By our weighting scheme, a weight is assigned to each term
generating classification knowledge
Generating Classification Knowledge
  • Training Samples
  • Occurrences of each feature are counted

Negative

Holiday Inn Golden Mile

In the heart of Tsim Sha Tsui - Kowloon, Holiday Inn Golden Mile is your number one choice for accommodation, dining, meetings and banquets.

Ideally situated in the heart of ...

Positive single hk$ double hk$

standard room 999.00 1,039.00

deluxe room 1,199.00 1,239.00

+executive room 1,399.00 1,499.00

slide34
back

Fact starts here!

applying navigation knowledge
Applying Navigation Knowledge

The page contains

Navigation knowledge shows

Paragraph

57 - 73 Lockhart Road, Wanchai, Hong Kong, SAR, PRC

Paragraph

Located in the hub of Wanchai, the Wharney Hotel is within walking distance of the Hong Kong Arts Centre, Convention and Exhibition Centre, busy commercial complexes and shopping malls.

...

Paragraph

TEL: (852) 2861-1000 FAX: (852) 2865-6023

Links

Main

Features & Services

Dining and Banqueting

Hotel Rates

Reservation

...

slide37

0.285714

0.392857

0.230769

0.392857

0

0

Current

0.0666667

0

3.0

0.25

0

Navigation Knowledge

assigns scores

Fact chooses it!

slide38

Table: 0.586447

Paragraph: 3.0

Paragraph: 0.25

List: 0.25

Visited

0.0666667

0

Current

0.25

0

Navigation Knowledge

assigns scores

slide39

C=1.0

C=0.3569

C=2.5e-007

C=6.3e-008

C=0.0001

Classification Knowledge

computes confidence

Apply

Classification

Knowledge to

all Segments

outline41
Outline
  • Background
  • Learning Based Web Query Processing
  • FACT: A Prototype System
  • Preliminary System Evaluation
  • Conclusions
  • Demonstration
a query processing system
A Query Processing System

A learning based query processing system:

  • User Interface:accepts user queries, presents query results, a browser capable of capturing user actions
  • Query Analyzer:analyzes and transforms user queries
  • Session Controller: coordinates learning and locating
  • Learner:generates knowledge from captured user actions
  • Locator: applies knowledge and locates query results
  • Retriever & Parser: retrieves pages and parses to trees
  • Knowledge Base:stores learned knowledge
reference architecture

User

User Interface

Learner

KnowledgeBase

SessionController

QueryAnalyzer

Locator

Retriever & Parser

SearchEngine

Web

Reference Architecture
a query session

Learning Process

Scripts

Learner

Browser

User

Actions

SessionController

URLs

KnowledgeBase

ResultBuffer

TrainingStrategy

SegmentGraph

Queryresults

Checking

Locating Process

Locator

Query Result Presenter

A Query Session
training strategies
Training Strategies
  • Sequential
    • First nsites: user browses and system learns
    • Next N-n sites: system processes
  • Random
    • Randomly choose n sites: user browses and system learns
    • the system processes the rest
  • Interleaved
    • First n0sites, user browses and system learns
    • Next n - n0site, system makes decision. For incorrect ones, user browses and system re-learns
    • Next N-n sites: system processes
outline46
Outline
  • Background
  • Learning Based Web Query Processing
  • FACT: A Prototype System
  • Preliminary System Evaluation
  • Conclusions
  • Demonstration
system evaluation
System Evaluation
  • System Capabilities
  • Performance
    • Effectiveness: precision, recall, correctness
    • Efficiency: in a site, how many pages the system visits to find a result or to recognize the irrelevancy
    • Training efficiency: how many training samples are needed
  • Key Issues
    • Effectiveness of the knowledge
    • Effectiveness of training strategies
  • Tests on A Range of Queries
system capabilities
System Capabilities
  • The system returns segments of the Web pages
  • The segments may not contain any input keyword but meet the requirement of room rates.
    • The system learned the query requirement from the user!
  • Segments can be from pages whose URLs are not directly returned by Yahoo!.
    • The system learned how to follow the hyperlinks to the queried segment!
system evaluation effectiveness
System Evaluation - Effectiveness
  • Given a set of URLs in a query session, the system makes N decisions

N =N1 + N2 + N3 + N4

Precision = N1 / (N1+N3) ,

Recall = N1 / # sites that contain results,

Correctness = (N1+N2) / N .

system evaluation efficiency
System Evaluation - Efficiency
  • How efficiently the system finds a queried segment in a site?

Level of a Queried Segment = the length of the shortest path to find it

Absolute Path length = # Visited pages,

Relative Path Length = # Visited pages / Level of the Queried Segment .

basic performance
Basic Performance
  • Q11: Hong Kong Hotel Room Rate
  • Q12: Hong Kong Hotel

Sequential training

effectiveness of knowledge
Effectiveness of Knowledge

Other two systems implemented for comparison

  • Classification Knowledge Only: treat links and segments the same by the Bayes classifier
    • Learning
    • Locating

Action positive negative

click a link the link other links on the page

mark a segment the segment other segments on the page

Classify all segments and links

If a link has the highest confidence, follow the link;

If a segment has the highest confidence and passes

the threshold, return it.

effectiveness of knowledge54
Effectiveness of Knowledge
  • Navigation Knowledge Only: only checks the descriptive information of links and segments
    • Learning
    • Locating

Navigational path  Navigation Knowledge

Assigns scores to all links and segments using

navigation knowledge

If a link has the highest score, follow the link;

If a segment has the highest score, return it.

effectiveness of knowledge55

Bad filtering capability!

Navigation only checks description,

nearly not workable

Poor navigation

capability!

Only works for results

on the first page

Effectiveness of Knowledge
effects of training strategies
Effects of Training Strategies

Query Q12

Training Size 3-10

effects of training strategies57
Effects of Training Strategies
  • Random training performs badly, low in recall
  • As the training size increases, interleaved training outperforms sequential training
  • Best accuracy reaches or exceeds 90% in all metrics when the interleaved training strategy is used
  • Enlarging the training size for random and sequential training is not effective
slide58

Improved Performance

Interleaved training

a range of queries
A Range of Queries
  • Hotel room rates: targets at prices, easy to identify
  • Admission requirements on graduate student: includes items such as degree, GPA, GRE, etc. that are not easy to specify in keywords but easy to show by marking
  • Data Mining Researcher: concept, subjective, evidence including research interests, projects, professional activity, etc
results of a range of queries

More precise

More precise

Results of A Range of Queries

Interleaved training

performance for the queries
Performance for the Queries
  • Effectiveness
    • first 4 queries: accuracy is 80% to above 90%
    • the last query: still capable of filtering out irrelevant sites
  • Efficiency
    • relative path length to locate a queried segment is close to 1
    • absolute path length to conclude irrelevancy is no more than 2.5 pages.
  • The performance is not affected much by how precise the keyword query is. The system learns query requirements
outline62
Outline
  • Background
  • Learning Based Web Query Processing
  • FACT: A Prototype System
  • Preliminary System Evaluation
  • Conclusions
  • Demonstration
conclusions
Conclusions
  • Proposed and implemented learning based Web query processing with the following features
    • Returning succinct results: segments of pages;
    • No a prior knowledge or preprocessing, suited for ad hoc queries;
    • exploiting page formatting and linkage information simultaneously.
  • The preliminary results are promising
future work
Future Work
  • Better segmentation for HTML documents
  • Better knowledge, key factor that affects system performance
    • other weighting schemes for navigation knowledge
    • other implementation of classification knowledge
  • More system evaluation
  • Dynamic web pages
outline65
Outline
  • Background
  • Learning Based Web Query Processing
  • FACT: A Prototype System
  • Preliminary System Evaluation
  • Conclusions
  • Demonstration