A densitometric approach to web page segmentation

遠山研- M輪 Brice Pesci A densitometric approach to web page segmentation M輪　Brice Pesci

About the paper • « A densitometric approach to web page segmentation » • Leibniz Universität Hannover • Germany • CIKM 2008 • Conference on Information and Knowledge management M輪　Brice Pesci

Introduction • It is more and more difficult to retrieve distinct information elements on the Web • Menu • Text ads • Snippet... • Need to identify the informative sections • Remove the noise M輪　Brice Pesci

Web Page Segmentation • Goals • De-duplication • Abstract content from layout • Content Extraction • Remove noise, increase classifier performance... • Keyword-based web search • Accuaracy of the results M輪　Brice Pesci

Related works • DOM tree analysis • Mine block speficic patterns • Determine template blocks • Shingles, elements frequencies, isotonic regression, ... • Entropies, common DOM subtries, ... • Vision-based • Still render the DOm • Graph theoric perspective M輪　Brice Pesci

Segmentation as... (1) • A visual problem • Heterogeneity • Various kind of layouts • Various way to generate the same layout • DOM level rule-based algorithms are bound to fail • High complexity • Too slow? • Relationship to image segmentation • Image recognition M輪　Brice Pesci

Segmentation as... (2) • A linguistic problem • Statistical measures to identify structure patterns in plain text documents • Subtopic, ... • : probability of class x depends only on the probability of the neighboring lower class • Examine the statical properties of subsequent blocks with respect to the quantitives properties M輪　Brice Pesci

Segmentation as... (2) • A linguistic problem • Distribution of document lengths • Zipf’s law : • Reasonnable for segmenting the intra-doc text docs? • Sentence length • Stochastic process • « The sentence lengths change along with the text flow » • Occurence probability of a sentence length x q hyperpascal distrbution y : frequency of objects of a class x : rank of the class M輪　Brice Pesci

Segmentation as... (3) • A densitometric problem • Atomic text portion : without element tag • Gap : a sequence of text portions interleaved by opening/closing element tag(s) • Which gaps seperate? Do not separate? • Most likely caused by a change in text flow • Short sentences : navigational menu... • We cannot use « sentences » because of templates M輪　Brice Pesci

Segmentation as... (3) • A densitometric problem • Text density • Number of words within a 2D area • Similar to intensity of a region in computer vision • Word-wrap text : wmax = 80 • English : 5.1 chars / word thus at max words / line • French : 5,13 chars / word • German : 6,26 chars / word M輪　Brice Pesci

Segmentation as... (3) • A densitometric problem • Need to remove last line (might not be complete) • Text density becomes : • Where • Not influanced by the number of additional tokens • Does not measure lexical/grammatical properties • Studies on language show that this may be suffisent T set of tokens in L set of wrapped lines bx block M輪　Brice Pesci

Segmentation as... (4) • A 1-dimentional problem • Detecting block-separating gaps on a web page • Finding neighbored text portions with signficant change in text density • Ex : M輪　Brice Pesci

Onto the block fusion algorithm • A greedy approach is plausible • Thanks to the relation between text flow and sentence length and text density • Based on the Block Growing algorithm • From Computer Vision • Slope delta between 2 blocks • Surrounding blocks dominate enclosed ones • If density of previous and next one are identical and highter, we fuse the 3 of them M輪　Brice Pesci

The Block Fusion algorithm • Plain • We fuse if slope is below a threshold • Smoothed • Surrounding blocks dominate • If density of previous and next one are identical and highter, we fuse the 3 of them M輪　Brice Pesci

A few notes • 2 parameters • : not document-specific • Input blocks B • Complexity : O(n) • About the gaps : • <h1> produces the same gap as <b> ! • Version with rules : • Tforce gap : block-level elements • Tno gap : inline elements H1-H6, UL, DL, OL, HR, TABLE, ADDRESS, HR, IMG, SCRIPT A, B, BR, EM, FONT, I, S, SPAN, STRONG, SUB, SUP, U, TT M輪　Brice Pesci

Experiments (0) • WebSpam UK-2007 • 106 millions pages from 115,000 hosts • 111 non-spam pages from 102 differents sites • Manual results compared to • Word wrap : everyline is a segment • Tag gap : text portions between tag (except A) • BF-plain / smoothed / rulebased • Just rules ( ) • GCuts M輪　Brice Pesci

Experiments (1) • Statistical properties of web page text • Text density /sentence length ? • Adjacent block with samedensity : one block • Not what we expected, but still holds • Also, the number of tokens in a segment follows the Zipf’s law : M輪　Brice Pesci

Experiments (2) • Segmentation accuracy • 2 cluster correlation metrics between 0 and 1 • Adjusted Rand Index • Normalized Multual Information BF-plain/smoothed BF-rulebased M輪　Brice Pesci

Experiments (2) • Segmentation : • BF-plain M輪　Brice Pesci

Experiments (2) • Segmentation : • BF-smoothed M輪　Brice Pesci

Experiments (2) • Segmentation : • BF-rulebased M輪　Brice Pesci

Experiments (3) • Average accuracy : • Performance • Most of the error getsremoved after thefirst iteration • On a standard laptop15ms per page M輪　Brice Pesci

Experiments (4) • Effect of wmax • Confirms relation betweenlanguage-specific average and line width • Stable between80 and 100 M輪　Brice Pesci

Experiments (5) • On near-duplicate detection • Using the LYRICS dataset • 2359 web pages song lyrics by 6 artists • Very effective on near duplicate detection • Narrow winner :JustRules M輪　Brice Pesci

Conclusion • Web page segmentation • Token-level text density is an effective property • New method is inspired by quantitative linguistics and computer vision M輪　Brice Pesci

Fin • Thanks for you listening M輪　Brice Pesci

A densitometric approach to web page segmentation

A densitometric approach to web page segmentation

Presentation Transcript

Building a Web Page

Designing a Web Page

Designing a Web Page

Designing a Web Page

Designing a Web Page

Creating A Web Page

Web Page

How to Make a Web Page:

Web page

Web Page

Building a Web Page

Web page

Creating a web page

Making a Web page

Building a Web Page

Building a Web Page

Creating a Web Page

Creating a Web Page

DEVELOP A WORDPRESS ACCOUNT WEB PAGE THE EASY APPROACH

Design a Web page

How to Evaluate a Web Page

A Strategic Approach to Web Evaluation