Block-based Web Search

Block-based Web Search Deng Cai1*, Shipeng Yu2*, Ji-Rong Wen*, Wei-Ying Ma* SIGIR’04 *Microsoft Research Asia Beijing, China {jrwen, wyma}@microsoft.com 1Tsinghua University Beijing, China cai_deng@yahoo.com 2Institute for Computer Science University of Munich Yushipeng@yahoo.com

Introduction • Passage retrieval is a research topic with long history in IR. • Particularly when documents contain multiple drifting subjects • The content of a web page is usually diverse and encompasses multiple regions with unrelated topics. • We argue that the characteristics of web pages make passage a more effective mechanism for IR. • Highly relevant region may be obscured by low overall relevance. • It is necessary to segment a web page into semantically independent units.

Introduction • In document retrieval the similarity measure is sensitive to document length. • Some measures(e.g. cosine measure) favor short documents. • Web pages suffer from the same. • Compare four page segmentation approaches for improving web IR, and show that unlike fixed-window, semantic partitioning can be easier and more accurate.

Web Page Segmentation • Passages can be categorized into three classes: • Discourse passages rely on logical structure of the documents marked by punctuation. • Semantic passages partition a document into topics according to its semantic structures. • Fix-length passages are defined to contain fixed number of words. • There exist new characteristics in web pages. • Two-dimensional logical structure: each region could have relationships with four directions. • Instead of using “passage”, we prefer to use block to denote a region of web page.

Web Page Segmentation • There have been some research on web page segmentation: • Traditional passages: the results are not encouraging • DOM: not targeting on web IR thus difficult to evaluate. • We introduce a VIPS(Vision-based Page Segmentation) method using visual cues to achieve more accurate content structure on the semantic level. • Still have varying length problem • Introduce a combined algorithm which takes advantage of both visual layout and length normalization.

The Four Methods • Fixed-length Page Segmentation (FixedPS) • For web documents it is identical to traditional window approach except that all HTML tags are removed. • DOM-based Page Segmentation (DomPS) • Partition pages based on their pre-defined syntactic structures, i.e., the HTML tags. • No consistent way to do, and few works are done on web IR. • DOM is still a linear structure and visually adjacent blocks may be far from each other. • DOM prefers more on presentation to content.

The Four Methods • Vision-based Page Segmentation (VIPS) • A closely packed block within the web page is much likely about a single semantic. • Blocks obtained are based on semantic structure of web pages. • Discard traditional content analysis and produce blocks based on visual cues. • The DOM structure and visual information are used iteratively to generate vision-based content structure.

The Four Methods • A Combined Approach (CombPS) • The distribution of block length is very diverse using VIPS with WT10g dataset. • Since fixed-length window show great consistence on dealing with varying length problem, we propose this combined approach. • After applying the VIPS method, apply fixed-length block extraction. • First window from the first word of the block and subsequent windows half-overlap preceding ones till the end of the block. • For visual blocks smaller than pre-defined length, directly output.

VIPS: a Vision-based Page Segmentation Algorithm Deng Cai Shipeng Yu Ji-Rong Wen Wei-Ying Ma Microsoft Research

Visual Block Extraction • Judge if a DOM node can be divided bases on: • The properties of the DOM node itself. • HTML tag, background color, size, shape… • The properties of the children of the DOM node. • Same as above, # of different kinds of children also a consideration. • Definition: • Inline node: node with inline text HTML tags, such as <B>, <FONT>,… • Line-break node: others. • Valid node: a node that can be seen on the browser(width and height not zero). • Text node: node corresponding to free text. • Virtual text node: inline node with text node(and virtual text node) children only.

Visual Block Extraction • Important cues to produce heuristics: Tag cue, color cue, text cue, and size cue. At the same time assign DoC to each block. • When <TABLE> is met, trace into the <TR> node(R2: If the DOM node has only one valid child and the child is not a text node, divide this node). • Only three of the five children are valid. • The <TR> node is split(R8: If the bgcolor is different from one of its children, divide this node and this child node not be divided in this round). • The Second and fourth child of <TR> node is not valid(R1: If the node is not a text node and it has no valid children, then this node will be cut). • The third and fifth children of <TR> will not be divided in this round(R11: If previous sibling node has not been divides, do not divide this node).

Visual Separator Detection • Block contained in/cross with/ covers the separator -> split/update/remove the separator. • Weight of separator be assigned based on visual difference between neighboring blocks. • Distance between blocks on different sides of the separator. • Overlapped with some certain HTML tags(e.g., <HR>). • Background color of the two sides. • Different font properties for horizontal separators.

Visual Separator Detection • Six blocks are put in the pool and five separators are detected. • S23 and S45 gets a higher weight(different font).

Content Structure Construction • Construction starts from the separators with lowest weight. • Merge the blocks till separators with maximum weights are met. • Each leaf node is checked whether it meets the granularity requirement. If not, go back to Visual Block Extraction step to construct sub content structure within that node. • In the first iteration, the first, third and fifth separators are chosen to form VB_2_2_1, VB_2_2_2, and CB_2_2_3, and so on. Each leaf node will be checked to see whether it meets the granularity requirement.

Three steps: block extraction, separator detection and content structure construction.

Experiment Setup • Four page segmentation methods are evaluated: • FixedPS: window length set to 200 words. • DomPS: iterate the DOM tree for some structural tags. If there are no more structural tags within the current structural tag, a block is constructed. • VIPS: the permitted degree of coherence set to 0.6. • CombPS: in the second step, window length set to be 200 words. • A full document approach is also implemented for comparison, in which no segmentation performed. • Block Retrieval verifies whether page segmentation are helpful to deal with length normalization and multiple-topic problems.

Experiment Setup • Query Expansion test whether page segmentation can benefit the selection of query terms. • The experiments are based on Web Tracks of TREC 2001 and 2002. • Choose Okapi as IR system and BM2500 for weighting function. • Use precision at 10(P@10) as main evaluation metric and also evaluate average precision for TREC 2001 since it is more on ad-hoc retrieval.

Block Retrieval • The experiments are conducted into three steps: initial retrieval, page segmentation, and block retrieval. • We obtain the document rank(DR) and pages can be re-ranked based on the single best-ranked block within each page(BR). • The rank of each page is • Table 3 shoes the results. FullDoc is not listed since it gets the baseline. The last column shows results of combining block and document rank, with α being optimal for each method. • The dependency between P@10 and α is illustrated is Figure 4.

Block Retrieval • If BR only, DomPS performs worst, and in TREC 2002, none exceeds baseline. • When BR+DR, all four methods increase significantly and all better than baseline, this shows the effect of rank combination, similar to passage retrieval. • DomPS still the worst, VIPS and CombPS still better and show similar comparison characteristics to the non-combining situations.

Block Retrieval • From Figure 4 the winner for either dataset shows a consistent improvement compared to the other methods. • For TREC 2001 CombPS wins almost in every combination, and for TREC 2002 CombPS shares rather similar trends when α>0.4.

Block Retrieval • DomPS is always the worst partly because the produced blocks are too detailed and cannot be mapped to a single semantic part. • FixedPS shows very good performance in AvP, which confirms that varying-length is still an important factor to web IR. • FixedPS gives way to VIPS and CombPS when P@10 is the main concern partly because it lacks semantic partition and fails to recognize best semantic blocks. • FixedPS and VIPS have different advantages and should be selected for different purposes. • By combining VIPS and FixedPS, CombPS aims to find a tradeoff and gets very good and stable(the best or very close to the best) results.

Query Expansion • After block ranks obtained, the following 4th and 5th steps are executed: • Expansion term selection: all terms except original query terms in the selected blocks are weighted according to the term selection value TSV: TSV=w(1) * r/R, where R is # of selected blocks, r the # of blocks which contain this term. In this top 10 terms are selected. • Final Retrieval: for original terms, new weight is tf * 3, expansion term 1-(n-1)/m, n is the TSV, m is the # of expansion terms, i.e., 10 in our experiments. • Figure 5 illustrates the P@10 values given different number of blocks, and in Table 4, the best P@10 value for each method, Figure 6 shows AvP comparison for TREC 2001.

Query Expansion • DomPS is still unstable and sometimes even worse than baseline. • VIPS and FixedPS are similar, except that VIPS shows better in AvP, and CombPS always the best. • Since TREC 2002 aims for topic distillation, it seems that query expansion makes little improvement over baseline. Although CombPS wins, it shows no significant improvement.

Query Expansion • Since baseline is very low, top-ranked documents are actually irrelevant, thus FullDoc obtain low result in all experiments. • DomPS shows no significant improvement partly because the segmentation is too detailed(average length is 540 bytes) and usually does not provide complete information. • VIPS considers more visual information and is more likely to obtain a semantic partition of a web page. VIPS tends to reach best performance at small number of blocks, which means that top blocks have very good quality. • FixedPS also achieves good performance. In some cases it can deal with those “badly” presented pages while VIPS cannot. Because of no priorities for short blocks, FixedPS shows great steadiness.

Conclusion • We verified that page segmentation can significantly improve IR by dealing with multiple-topic and mixed-length problems of web pages. • By integrating semantic and fixed-length properties, we can overcome both problems and achieve best performance.

Block-based Web Search