Learning Based Web Query Processing. Yanlei Diao Computer Science Department Hong Kong U. of Science & Technology. Outline. Background Learning Based Web Query Processing FACT: A Prototype System Preliminary System Evaluation Conclusions Demonstration. Searching the Web.
Computer Science Department
Hong Kong U. of Science & Technology
Want to find a piece of information on the Web?
Large hit lists with low precision, rely on input queries
The agents learn user profiles/models from their search behaviors and employ the knowledge to predict URLs of interest to the user.
Not well suited for ad hoc queries
Not scalable, good for Web site management but not queries on the entire Web.
A Web query processing system for any Web users that
Learn the knowledge for query processing from the User!
An approach with learning capabilities:
<title> … Hotel </title></head>
<body><p>1999 Room Rates</p>
<td> Special Promotion <br>
& contents of child paragraph and table
"Special Promotion" & the content of the child table
"1999 Room Rates"
"Room TypeSingle /Double (HK$)Standard1000Executive Suite2750"
2) Navigation in the graph should terminate if a segment answers the query well enough or conclusion of irrelevancy can be drawn.
A decision of following a link or choosing a segment should be made on each page.
Segments and links on a page should be comparable!Knowledge for the Locating Task
The locating task is to find a segment in the Segment Graph of a site as the query result.
Two types of knowledge are needed!
A link conveys description of the pointed page while a queried segment contains both description and the result itself.
Navigational path, (link)*segment, e.g. L2L4S41.
Extended navigational path, ((segment )*link)* ((segment )* segment), e.g. (S1S11L2) (S3S31L4) (S4S41).
Step1. Assign a weight to each component on the path, e.g. L2, S31, S41. The closer to the target, the higher the weight.
Step2. Assign a weight to each term in the description of a component on the path.
The weight of a term can be summed up over navigational paths. The set of (term, weight) pairs is stored into the navigation knowledge base.
The queried segment is a positive sample. All other segments on the same page are negative samples.
The contentof each segment is parsed into a set of features, either simple and complex types.
Count NP and NN accumulatively for each feature over all samples. Store all triples (feature, NP, NN) into the classification knowledge base.
User browses it!
User marks it!
Hotel Reservation->single hk$ double hk$ standard room deluxe room +executive room
Holiday Inn Golden Mile
In the heart of Tsim Sha Tsui - Kowloon, Holiday Inn Golden Mile is your number one choice for accommodation, dining, meetings and banquets.
Ideally situated in the heart of ...
Positive single hk$ double hk$
standard room 999.00 1,039.00
deluxe room 1,199.00 1,239.00
+executive room 1,399.00 1,499.00
Fact starts here!
The page contains
Navigation knowledge shows
57 - 73 Lockhart Road, Wanchai, Hong Kong, SAR, PRC
Located in the hub of Wanchai, the Wharney Hotel is within walking distance of the Hong Kong Arts Centre, Convention and Exhibition Centre, busy commercial complexes and shopping malls.
TEL: (852) 2861-1000 FAX: (852) 2865-6023
Features & Services
Dining and Banqueting
Fact chooses it!
A learning based query processing system:
N =N1 + N2 + N3 + N4
Precision = N1 / (N1+N3) ,
Recall = N1 / # sites that contain results,
Correctness = (N1+N2) / N .
Level of a Queried Segment = the length of the shortest path to find it
Absolute Path length = # Visited pages,
Relative Path Length = # Visited pages / Level of the Queried Segment .
Other two systems implemented for comparison
Action positive negative
click a link the link other links on the page
mark a segment the segment other segments on the page
Classify all segments and links
If a link has the highest confidence, follow the link;
If a segment has the highest confidence and passes
the threshold, return it.
Navigational path Navigation Knowledge
Assigns scores to all links and segments using
If a link has the highest score, follow the link;
If a segment has the highest score, return it.
Training Size 3-10