800 likes | 946 Views
Web Noises Detection and Elimination. PengBo Dec 3, 2010. What are Web Noises ?. 导航 NavGuide. 主题 Topic. 广告 Adv. Call them Noises. 虽然这些信息对于人浏览 Web 有用,但常常对自动 Web 信息处理带来负面影响,比如 Web page clustering, classification, information retrieval and information extraction.
E N D
Web NoisesDetection and Elimination PengBo Dec 3, 2010
导航NavGuide 主题Topic 广告Adv
Call them Noises • 虽然这些信息对于人浏览Web有用,但常常对自动Web信息处理带来负面影响,比如Web page clustering, classification, information retrieval and information extraction. • hamperautomatedinformation gathering and Web data mining, “Template Detection via Data Mining and its Applications”
Non-Relevant Data on the Web • A fundamental problem on the Web: • “non-relevant” – not directly related to the main topic / functionality of the page • Local (intra-page) noise • Irrelevant items within a Web page. • E.g., banner ads, navigational guides Many pages contain lots of non-relevant data
Duplicate data on the Web • Another problem on the Web: • Mirrors,News copy, etc, • Global noise • Redundant objects • Larger than individual page • E.g., mirror sites, duplicated Web pages There are much duplicate or near duplicate data
Why it influences? • Hypertext IR Principles--principles of all link based IR tools: • Relevant Linkage Principle • p links to q q is relevant to p • Topical Unity Principle • q1 and q2 are co-cited in p q1 and q2 are related to each other • Lexical Affinity Principle • The closer the links to q1 and q2 are the stronger the relation between them.
Violations of Relevant Linkage Principle • Navigational links • http://www.ibm.com/ • Download links • http://www.beethoven.com/ • Advertisement links • http://www.yahoo.com/ • Endorsement links • http://www.ebay.com/ • Spam links
Violations of Topical Unity Principle • Violations of the Relevant Linkage Principle • Bookmark pages • http://bookmark.yinsha.com/网上书签 • General resource lists • http://sewm.pku.edu.cn/IR-Guide.txt IR Guide • Personal homepages • http://www.cse.iitb.ac.in/~soumen/ Soumen’s Home Page
Violations of Lexical Affinity Principle • Alphabetical index lists • Computer and Communication Companies ("M" entries) • HTML representation • Adjacent cells in the same column are far from each other in the HTML text
IR Tool Problems • Generalization • Search for “Frequency Division Multiplexing” and get back general Electrical Engineering sites • Topic drift • Search for “Finite Model Theory” and get SF 49’ers fan web sites • Irrelevance • Get “Yahoo” as a result regardless of the query • Bias • Search for “computing companies” and get Microspy highly ranked
Hypertext Improvement Problem • remove violations of the Hypertext IR principles • process quickly millions of pages Main Goal • Develop hypertext processing techniques that: • automatically improve hypertext data • are efficient and scalable
HypertextCleaner Web Hypertext Cleaning Crawler IR Tool
DOM Tree 模版Template
Templates Detection • Semantic Definition: • A template is a master HTML shell page that is used as a basis for composing new pages • Content of new pages plugged into template shell • All pages share common look & feel • Usually controlled by a central authority • Not necessarily confined to a single site • May include variety of data • Navigational bars • Advertisements • Company info and policies
Search pagelet Ad pagelet Navigation pagelet Services pagelet Company info pagelet
Pagelets • Semantic Definition: • A pagelet is a maximal region of a page that has a single topic or functionality • Not too large • has only one topic / functionality • Not too small • any larger region that contains it has other topics / functionalities
IR with Pagelets Main Idea 1 Use pagelets rather than pages as atomic units for information retrieval Main Idea 2 Eliminate pagelets belonging to templates
Pagelets: Syntactic Definition • A pagelet is a node in the HTML parse tree of a page satisfying the following: • Its HTML tag is one of the following: • <TABLE>, <OL>, <UL>, <AREA>, <P>, <DL>, … • None of it’s children contains more than k hyperlinks • None of its ancestor is a pagelet
p1 p2 p3 p4 p5 Templates: Syntactic Definition A template is a collection T = (p1,…,pk) of pagelets satisfying: • Similarity:p1,…,pk are identical or almost identical • Connectivity • Every two pages owning pagelets in T are reachable from each other (undirectedely) through other pages owning pagelets in T. Template Recognition Problem: Given a set of pages S find all the templates in S.
Calculate shingle(p) for each pagelet pS Discard clusters of size 1 Template Recognition in Large Sets Cluster pagelets in S according to shingle Construct graph Gc of pages that own pagelets in C Find undirected connected components of Gc For each remaining cluster C: Output components of size > 1
Evaluation • Question: • How to evaluate the performance/effectiveness of this cleaning algorithm?
Cleaning via feature weighting • In a given Web site • Noisy blocks — Share common contents or presentation styles • Meaningful (or main) blocks — diverse in contents and presentation style • Weighting features makes cleaning automatic (nothing is eliminated) “Eliminating noisy information in Web pages for data mining”
root bc=white BODY width=800 height=200 width=800 bc=red TABLE TABLE IMG DOM trees <BODY bgcolor=WHITE> <TABLE width=800 height=200 > … </TABLE> <IMG src="image.gif" width=800> <TABLE bgcolor=RED> … </TABLE> </BODY>
Build Site style tree (SST) common
SST • Style Node S = (ELEMENTs, n) • ELEMENTs — a sequence of element nodes • n — number of pages that has this style • Element Node E = (Tag, Attr, STYLEs) • Tag— tag name. E.g., TABLE, IMG; • Attr— display attributes of Tag. E.g., bgcolor=RED • STYLEs— style nodes below E
Inner Node Leaf Node Quantify the importance
Weighting policy • Inner Node Importance (1) • l = |E.STYLEs| • m = number of pages containing E, |E.parent.n| • pi — percentage of tag nodes (in E.parent.n) using the i-th presentation style • Inner NodeImp(E) — diversity of presentation styles
NodeImp(Body) = -1log1001 = 0 • NodeImp(Table) • = -(0.35log1000.35 + 2*0.25log1000.25+ 0.15log1000.15) • = 0.29 >0
Weighting policy • Features( terms) of Leaf Node • Importance of Leaf Node’s Features (3) • m = number of pages containing E, |E.parent.n| • pij — probability of ai appears in E of page j • HE(ai) — information entropy of ai • the higherHE(ai), the less important ai
Weighting policy • Leaf Node Importance (2) • N — number of features in E • ai — a feature of content in E • (1-HE(ai)) — information contained in ai • Leaf NodeImp(E) —content diversity of E
root SST: Ep IMG TABLE 3 E t1: PCMag, samsung t2: PCMag, epson t3: PCMag, canon m = 3 N = |{PCMag, samsung, epson, canon}| = 4 HE(PCMag) = -3 * (1/3log31/3) = 1 HE(samsung)=HE(epson) =HE(canon) = -(0+0+1log31) = 0 NodeImp(E) = ((1-1) + 3*(1-0))/4 = 0.75
Transitive Weighting policy 0 0.29 0 Composite Importance 0.75
Page nosie • noisy element node • For an element node E in the SST, if all of its descendents and itself have composite importance less than a specified thresholdt, then we say element node E is noisy. • Maximal noisy element node • meaningfulelement node : • If an element node E in the SST does not contain any noisy descendent, we say that E is meaningful. • Maximal meaningfulelement node
Web page cleaning via block elimination • We can use SST (site style tree) to identify & eliminate noise content blocks in a page. • Build SST by sample pages crawled from a site. • Computing an importance value for each block, using a specified threshold t to decide noisy or not noisy • Matching to noisy blocks and not noisy blocks in the tree, given a new page.
Noise Detection and Elimination root Body Table Img Table Table P Tr Tr Text Text A P Img A P P P A Img A A A A A
root Body Table Img Table Table Tr Tr Text After simplification
Summary of the technique • Evaluate Common and Diversity of content and styles • DOM trees SST • Information Entropy Based Evaluation • Node Importance • Composite Importance • Noise detection and automatic matching
Syntactic clustering of the web contents WWW6,1997
Document Representation • How to represent a document? • Represent document content by a feature set,preparing the computations of resemblance or similarity. • For documentD, extract it’s feature set as S(D)
Defining similarity of documents • How to express the concept “roughly the same”precisely? • QuantityDefinition: resemblance • The resemblance fo two documents A and B is a number between 0 and 1.