Web Noises Detection and Elimination

Web NoisesDetection and Elimination PengBo Dec 3, 2010

What are Web Noises？

导航NavGuide 主题Topic 广告Adv

Call them Noises • 虽然这些信息对于人浏览Web有用，但常常对自动Web信息处理带来负面影响，比如Web page clustering, classification, information retrieval and information extraction. • hamperautomatedinformation gathering and Web data mining, “Template Detection via Data Mining and its Applications”

Non-Relevant Data on the Web • A fundamental problem on the Web: • “non-relevant” – not directly related to the main topic / functionality of the page • Local (intra-page) noise • Irrelevant items within a Web page. • E.g., banner ads, navigational guides Many pages contain lots of non-relevant data

Duplicate data on the Web • Another problem on the Web: • Mirrors，News copy, etc, • Global noise • Redundant objects • Larger than individual page • E.g., mirror sites, duplicated Web pages There are much duplicate or near duplicate data

Why it influences? • Hypertext IR Principles--principles of all link based IR tools: • Relevant Linkage Principle • p links to q  q is relevant to p • Topical Unity Principle • q1 and q2 are co-cited in p  q1 and q2 are related to each other • Lexical Affinity Principle • The closer the links to q1 and q2 are the stronger the relation between them.

Violations of Relevant Linkage Principle • Navigational links • http://www.ibm.com/ • Download links • http://www.beethoven.com/ • Advertisement links • http://www.yahoo.com/ • Endorsement links • http://www.ebay.com/ • Spam links

Violations of Topical Unity Principle • Violations of the Relevant Linkage Principle • Bookmark pages • http://bookmark.yinsha.com/网上书签 • General resource lists • http://sewm.pku.edu.cn/IR-Guide.txt IR Guide • Personal homepages • http://www.cse.iitb.ac.in/~soumen/ Soumen’s Home Page

Violations of Lexical Affinity Principle • Alphabetical index lists • Computer and Communication Companies ("M" entries) • HTML representation • Adjacent cells in the same column are far from each other in the HTML text

IR Tool Problems • Generalization • Search for “Frequency Division Multiplexing” and get back general Electrical Engineering sites • Topic drift • Search for “Finite Model Theory” and get SF 49’ers fan web sites • Irrelevance • Get “Yahoo” as a result regardless of the query • Bias • Search for “computing companies” and get Microspy highly ranked

Hypertext Improvement Problem • remove violations of the Hypertext IR principles • process quickly millions of pages Main Goal • Develop hypertext processing techniques that: • automatically improve hypertext data • are efficient and scalable

HypertextCleaner Web Hypertext Cleaning Crawler IR Tool

Template detection

DOM Tree 模版Template

Templates

Templates Detection • Semantic Definition: • A template is a master HTML shell page that is used as a basis for composing new pages • Content of new pages plugged into template shell • All pages share common look & feel • Usually controlled by a central authority • Not necessarily confined to a single site • May include variety of data • Navigational bars • Advertisements • Company info and policies

Search pagelet Ad pagelet Navigation pagelet Services pagelet Company info pagelet

Pagelets • Semantic Definition: • A pagelet is a maximal region of a page that has a single topic or functionality • Not too large • has only one topic / functionality • Not too small • any larger region that contains it has other topics / functionalities

IR with Pagelets Main Idea 1 Use pagelets rather than pages as atomic units for information retrieval Main Idea 2 Eliminate pagelets belonging to templates

Pagelets: Syntactic Definition • A pagelet is a node in the HTML parse tree of a page satisfying the following: • Its HTML tag is one of the following: • <TABLE>, <OL>, <UL>, <AREA>, <P>, <DL>, … • None of it’s children contains more than k hyperlinks • None of its ancestor is a pagelet

p1 p2 p3 p4 p5 Templates: Syntactic Definition A template is a collection T = (p1,…,pk) of pagelets satisfying: • Similarity:p1,…,pk are identical or almost identical • Connectivity • Every two pages owning pagelets in T are reachable from each other (undirectedely) through other pages owning pagelets in T. Template Recognition Problem: Given a set of pages S find all the templates in S.

Calculate shingle(p) for each pagelet pS Discard clusters of size 1 Template Recognition in Large Sets Cluster pagelets in S according to shingle Construct graph Gc of pages that own pagelets in C Find undirected connected components of Gc For each remaining cluster C: Output components of size > 1

Evaluation • Question: • How to evaluate the performance/effectiveness of this cleaning algorithm?

Benefits of template detection

Cleaning via feature weighting

Cleaning via feature weighting • In a given Web site • Noisy blocks — Share common contents or presentation styles • Meaningful (or main) blocks — diverse in contents and presentation style • Weighting features makes cleaning automatic (nothing is eliminated) “Eliminating noisy information in Web pages for data mining”

root bc=white BODY width=800 height=200 width=800 bc=red TABLE TABLE IMG DOM trees <BODY bgcolor=WHITE> <TABLE width=800 height=200 > … </TABLE> <IMG src="image.gif" width=800> <TABLE bgcolor=RED> … </TABLE> </BODY>

Build Site style tree (SST) common

SST • Style Node S = (ELEMENTs, n) • ELEMENTs — a sequence of element nodes • n — number of pages that has this style • Element Node E = (Tag, Attr, STYLEs) • Tag— tag name. E.g., TABLE, IMG; • Attr— display attributes of Tag. E.g., bgcolor=RED • STYLEs— style nodes below E

Inner Node Leaf Node Quantify the importance

Weighting policy • Inner Node Importance (1) • l = |E.STYLEs| • m = number of pages containing E, |E.parent.n| • pi — percentage of tag nodes (in E.parent.n) using the i-th presentation style • Inner NodeImp(E) — diversity of presentation styles

NodeImp(Body) = -1log1001 = 0 • NodeImp(Table) • = -(0.35log1000.35 + 2*0.25log1000.25+ 0.15log1000.15) • = 0.29 >0

Weighting policy • Features( terms) of Leaf Node • Importance of Leaf Node’s Features (3) • m = number of pages containing E, |E.parent.n| • pij — probability of ai appears in E of page j • HE(ai) — information entropy of ai • the higherHE(ai), the less important ai

Weighting policy • Leaf Node Importance (2) • N — number of features in E • ai — a feature of content in E • (1-HE(ai)) — information contained in ai • Leaf NodeImp(E) —content diversity of E

root SST: Ep IMG TABLE 3 E t1: PCMag, samsung t2: PCMag, epson t3: PCMag, canon m = 3 N = |{PCMag, samsung, epson, canon}| = 4 HE(PCMag) = -3 * (1/3log31/3) = 1 HE(samsung)=HE(epson) =HE(canon) = -(0+0+1log31) = 0 NodeImp(E) = ((1-1) + 3*(1-0))/4 = 0.75

Transitive Weighting policy 0 0.29 0 Composite Importance 0.75

Page nosie • noisy element node • For an element node E in the SST, if all of its descendents and itself have composite importance less than a specified thresholdt, then we say element node E is noisy. • Maximal noisy element node • meaningfulelement node : • If an element node E in the SST does not contain any noisy descendent, we say that E is meaningful. • Maximal meaningfulelement node

Web page cleaning via block elimination • We can use SST (site style tree) to identify & eliminate noise content blocks in a page. • Build SST by sample pages crawled from a site. • Computing an importance value for each block, using a specified threshold t to decide noisy or not noisy • Matching to noisy blocks and not noisy blocks in the tree, given a new page.

Noise Detection and Elimination root Body Table Img Table Table P Tr Tr Text Text A P Img A P P P A Img A A A A A

root Body Table Img Table Table Tr Tr Text After simplification

Summary of the technique • Evaluate Common and Diversity of content and styles • DOM trees SST • Information Entropy Based Evaluation • Node Importance • Composite Importance • Noise detection and automatic matching

Near duplicate detection

Syntactic clustering of the web contents WWW6,1997

Document Representation • How to represent a document? • Represent document content by a feature set，preparing the computations of resemblance or similarity. • For documentD, extract it’s feature set as S(D)

Defining similarity of documents • How to express the concept “roughly the same”precisely? • QuantityDefinition: resemblance • The resemblance fo two documents A and B is a number between 0 and 1.

Web Noises Detection and Elimination

Web Noises Detection and Elimination

Presentation Transcript

Funny Noises in Mommy’s Ears

GAUSS ELIMINATION AND GAUSS-JORDAN ELIMINATION

Halloween Noises!

Noises and Requirement BRT

Illicit Discharge Detection and Elimination

Substitution and Elimination

Animal Noises

Scalable Clone Detection and Elimination for Erlang Programs

Vermont Illicit Discharge Detection and Elimination (IDDE) Program

Illicit Discharge Detection and Elimination: Program Component Considerations

Illicit Discharge Detection and Elimination

Substitution and Elimination

MAIN CONTRIBUTOR NOISES

Animal Noises

Traffic noises and its abatement

Substitution and Elimination

Substitution and Elimination

Substitution and Elimination