260 likes | 270 Views
遠山研 - M 輪 Brice Pesci. A densitometric approach to web page segmentation. About the paper. « A densitometric approach to web page segmentation » Leibniz Universität Hannover Germany CIKM 2008 Conference on Information and Knowledge management. Introduction.
E N D
遠山研- M輪 Brice Pesci A densitometric approach to web page segmentation M輪 Brice Pesci
About the paper • « A densitometric approach to web page segmentation » • Leibniz Universität Hannover • Germany • CIKM 2008 • Conference on Information and Knowledge management M輪 Brice Pesci
Introduction • It is more and more difficult to retrieve distinct information elements on the Web • Menu • Text ads • Snippet... • Need to identify the informative sections • Remove the noise M輪 Brice Pesci
Web Page Segmentation • Goals • De-duplication • Abstract content from layout • Content Extraction • Remove noise, increase classifier performance... • Keyword-based web search • Accuaracy of the results M輪 Brice Pesci
Related works • DOM tree analysis • Mine block speficic patterns • Determine template blocks • Shingles, elements frequencies, isotonic regression, ... • Entropies, common DOM subtries, ... • Vision-based • Still render the DOm • Graph theoric perspective M輪 Brice Pesci
Segmentation as... (1) • A visual problem • Heterogeneity • Various kind of layouts • Various way to generate the same layout • DOM level rule-based algorithms are bound to fail • High complexity • Too slow? • Relationship to image segmentation • Image recognition M輪 Brice Pesci
Segmentation as... (2) • A linguistic problem • Statistical measures to identify structure patterns in plain text documents • Subtopic, ... • : probability of class x depends only on the probability of the neighboring lower class • Examine the statical properties of subsequent blocks with respect to the quantitives properties M輪 Brice Pesci
Segmentation as... (2) • A linguistic problem • Distribution of document lengths • Zipf’s law : • Reasonnable for segmenting the intra-doc text docs? • Sentence length • Stochastic process • « The sentence lengths change along with the text flow » • Occurence probability of a sentence length x q hyperpascal distrbution y : frequency of objects of a class x : rank of the class M輪 Brice Pesci
Segmentation as... (3) • A densitometric problem • Atomic text portion : without element tag • Gap : a sequence of text portions interleaved by opening/closing element tag(s) • Which gaps seperate? Do not separate? • Most likely caused by a change in text flow • Short sentences : navigational menu... • We cannot use « sentences » because of templates M輪 Brice Pesci
Segmentation as... (3) • A densitometric problem • Text density • Number of words within a 2D area • Similar to intensity of a region in computer vision • Word-wrap text : wmax = 80 • English : 5.1 chars / word thus at max words / line • French : 5,13 chars / word • German : 6,26 chars / word M輪 Brice Pesci
Segmentation as... (3) • A densitometric problem • Need to remove last line (might not be complete) • Text density becomes : • Where • Not influanced by the number of additional tokens • Does not measure lexical/grammatical properties • Studies on language show that this may be suffisent T set of tokens in L set of wrapped lines bx block M輪 Brice Pesci
Segmentation as... (4) • A 1-dimentional problem • Detecting block-separating gaps on a web page • Finding neighbored text portions with signficant change in text density • Ex : M輪 Brice Pesci
Onto the block fusion algorithm • A greedy approach is plausible • Thanks to the relation between text flow and sentence length and text density • Based on the Block Growing algorithm • From Computer Vision • Slope delta between 2 blocks • Surrounding blocks dominate enclosed ones • If density of previous and next one are identical and highter, we fuse the 3 of them M輪 Brice Pesci
The Block Fusion algorithm • Plain • We fuse if slope is below a threshold • Smoothed • Surrounding blocks dominate • If density of previous and next one are identical and highter, we fuse the 3 of them M輪 Brice Pesci
A few notes • 2 parameters • : not document-specific • Input blocks B • Complexity : O(n) • About the gaps : • <h1> produces the same gap as <b> ! • Version with rules : • Tforce gap : block-level elements • Tno gap : inline elements H1-H6, UL, DL, OL, HR, TABLE, ADDRESS, HR, IMG, SCRIPT A, B, BR, EM, FONT, I, S, SPAN, STRONG, SUB, SUP, U, TT M輪 Brice Pesci
Experiments (0) • WebSpam UK-2007 • 106 millions pages from 115,000 hosts • 111 non-spam pages from 102 differents sites • Manual results compared to • Word wrap : everyline is a segment • Tag gap : text portions between tag (except A) • BF-plain / smoothed / rulebased • Just rules ( ) • GCuts M輪 Brice Pesci
Experiments (1) • Statistical properties of web page text • Text density /sentence length ? • Adjacent block with samedensity : one block • Not what we expected, but still holds • Also, the number of tokens in a segment follows the Zipf’s law : M輪 Brice Pesci
Experiments (2) • Segmentation accuracy • 2 cluster correlation metrics between 0 and 1 • Adjusted Rand Index • Normalized Multual Information BF-plain/smoothed BF-rulebased M輪 Brice Pesci
Experiments (2) • Segmentation : • BF-plain M輪 Brice Pesci
Experiments (2) • Segmentation : • BF-smoothed M輪 Brice Pesci
Experiments (2) • Segmentation : • BF-rulebased M輪 Brice Pesci
Experiments (3) • Average accuracy : • Performance • Most of the error getsremoved after thefirst iteration • On a standard laptop15ms per page M輪 Brice Pesci
Experiments (4) • Effect of wmax • Confirms relation betweenlanguage-specific average and line width • Stable between80 and 100 M輪 Brice Pesci
Experiments (5) • On near-duplicate detection • Using the LYRICS dataset • 2359 web pages song lyrics by 6 artists • Very effective on near duplicate detection • Narrow winner :JustRules M輪 Brice Pesci
Conclusion • Web page segmentation • Token-level text density is an effective property • New method is inspired by quantitative linguistics and computer vision M輪 Brice Pesci
Fin • Thanks for you listening M輪 Brice Pesci