web noises detection and elimination n.
Skip this Video
Loading SlideShow in 5 Seconds..
Web Noises Detection and Elimination PowerPoint Presentation
Download Presentation
Web Noises Detection and Elimination

Web Noises Detection and Elimination

119 Views Download Presentation
Download Presentation

Web Noises Detection and Elimination

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Web NoisesDetection and Elimination PengBo Dec 3, 2010

  2. What are Web Noises?

  3. 导航NavGuide 主题Topic 广告Adv

  4. Call them Noises • 虽然这些信息对于人浏览Web有用,但常常对自动Web信息处理带来负面影响,比如Web page clustering, classification, information retrieval and information extraction. • hamperautomatedinformation gathering and Web data mining, “Template Detection via Data Mining and its Applications”

  5. Non-Relevant Data on the Web • A fundamental problem on the Web: • “non-relevant” – not directly related to the main topic / functionality of the page • Local (intra-page) noise • Irrelevant items within a Web page. • E.g., banner ads, navigational guides Many pages contain lots of non-relevant data

  6. Duplicate data on the Web • Another problem on the Web: • Mirrors,News copy, etc, • Global noise • Redundant objects • Larger than individual page • E.g., mirror sites, duplicated Web pages There are much duplicate or near duplicate data

  7. Why it influences? • Hypertext IR Principles--principles of all link based IR tools: • Relevant Linkage Principle • p links to q  q is relevant to p • Topical Unity Principle • q1 and q2 are co-cited in p  q1 and q2 are related to each other • Lexical Affinity Principle • The closer the links to q1 and q2 are the stronger the relation between them.

  8. Violations of Relevant Linkage Principle • Navigational links • • Download links • • Advertisement links • • Endorsement links • • Spam links

  9. Violations of Topical Unity Principle • Violations of the Relevant Linkage Principle • Bookmark pages •网上书签 • General resource lists • IR Guide • Personal homepages • Soumen’s Home Page

  10. Violations of Lexical Affinity Principle • Alphabetical index lists • Computer and Communication Companies ("M" entries) • HTML representation • Adjacent cells in the same column are far from each other in the HTML text

  11. IR Tool Problems • Generalization • Search for “Frequency Division Multiplexing” and get back general Electrical Engineering sites • Topic drift • Search for “Finite Model Theory” and get SF 49’ers fan web sites • Irrelevance • Get “Yahoo” as a result regardless of the query • Bias • Search for “computing companies” and get Microspy highly ranked

  12. Hypertext Improvement Problem • remove violations of the Hypertext IR principles • process quickly millions of pages Main Goal • Develop hypertext processing techniques that: • automatically improve hypertext data • are efficient and scalable

  13. HypertextCleaner Web Hypertext Cleaning Crawler IR Tool

  14. Template detection

  15. DOM Tree 模版Template

  16. Templates

  17. Templates Detection • Semantic Definition: • A template is a master HTML shell page that is used as a basis for composing new pages • Content of new pages plugged into template shell • All pages share common look & feel • Usually controlled by a central authority • Not necessarily confined to a single site • May include variety of data • Navigational bars • Advertisements • Company info and policies

  18. Search pagelet Ad pagelet Navigation pagelet Services pagelet Company info pagelet

  19. Pagelets • Semantic Definition: • A pagelet is a maximal region of a page that has a single topic or functionality • Not too large • has only one topic / functionality • Not too small • any larger region that contains it has other topics / functionalities

  20. IR with Pagelets Main Idea 1 Use pagelets rather than pages as atomic units for information retrieval Main Idea 2 Eliminate pagelets belonging to templates

  21. Pagelets: Syntactic Definition • A pagelet is a node in the HTML parse tree of a page satisfying the following: • Its HTML tag is one of the following: • <TABLE>, <OL>, <UL>, <AREA>, <P>, <DL>, … • None of it’s children contains more than k hyperlinks • None of its ancestor is a pagelet

  22. p1 p2 p3 p4 p5 Templates: Syntactic Definition A template is a collection T = (p1,…,pk) of pagelets satisfying: • Similarity:p1,…,pk are identical or almost identical • Connectivity • Every two pages owning pagelets in T are reachable from each other (undirectedely) through other pages owning pagelets in T. Template Recognition Problem: Given a set of pages S find all the templates in S.

  23. Calculate shingle(p) for each pagelet pS Discard clusters of size 1 Template Recognition in Large Sets Cluster pagelets in S according to shingle Construct graph Gc of pages that own pagelets in C Find undirected connected components of Gc For each remaining cluster C: Output components of size > 1

  24. Evaluation • Question: • How to evaluate the performance/effectiveness of this cleaning algorithm?

  25. Benefits of template detection

  26. Cleaning via feature weighting

  27. Cleaning via feature weighting • In a given Web site • Noisy blocks — Share common contents or presentation styles • Meaningful (or main) blocks — diverse in contents and presentation style • Weighting features makes cleaning automatic (nothing is eliminated) “Eliminating noisy information in Web pages for data mining”

  28. root bc=white BODY width=800 height=200 width=800 bc=red TABLE TABLE IMG DOM trees <BODY bgcolor=WHITE> <TABLE width=800 height=200 > … </TABLE> <IMG src="image.gif" width=800> <TABLE bgcolor=RED> … </TABLE> </BODY>

  29. Build Site style tree (SST) common

  30. SST • Style Node S = (ELEMENTs, n) • ELEMENTs — a sequence of element nodes • n — number of pages that has this style • Element Node E = (Tag, Attr, STYLEs) • Tag— tag name. E.g., TABLE, IMG; • Attr— display attributes of Tag. E.g., bgcolor=RED • STYLEs— style nodes below E

  31. Inner Node Leaf Node Quantify the importance

  32. Weighting policy • Inner Node Importance (1) • l = |E.STYLEs| • m = number of pages containing E, |E.parent.n| • pi — percentage of tag nodes (in E.parent.n) using the i-th presentation style • Inner NodeImp(E) — diversity of presentation styles

  33. NodeImp(Body) = -1log1001 = 0 • NodeImp(Table) • = -(0.35log1000.35 + 2*0.25log1000.25+ 0.15log1000.15) • = 0.29 >0

  34. Weighting policy • Features( terms) of Leaf Node • Importance of Leaf Node’s Features (3) • m = number of pages containing E, |E.parent.n| • pij — probability of ai appears in E of page j • HE(ai) — information entropy of ai • the higherHE(ai), the less important ai

  35. Weighting policy • Leaf Node Importance (2) • N — number of features in E • ai — a feature of content in E • (1-HE(ai)) — information contained in ai • Leaf NodeImp(E) —content diversity of E

  36. root SST: Ep IMG TABLE 3 E t1: PCMag, samsung t2: PCMag, epson t3: PCMag, canon m = 3 N = |{PCMag, samsung, epson, canon}| = 4 HE(PCMag) = -3 * (1/3log31/3) = 1 HE(samsung)=HE(epson) =HE(canon) = -(0+0+1log31) = 0 NodeImp(E) = ((1-1) + 3*(1-0))/4 = 0.75

  37. Transitive Weighting policy 0 0.29 0 Composite Importance 0.75

  38. Page nosie • noisy element node • For an element node E in the SST, if all of its descendents and itself have composite importance less than a specified thresholdt, then we say element node E is noisy. • Maximal noisy element node • meaningfulelement node : • If an element node E in the SST does not contain any noisy descendent, we say that E is meaningful. • Maximal meaningfulelement node

  39. Web page cleaning via block elimination • We can use SST (site style tree) to identify & eliminate noise content blocks in a page. • Build SST by sample pages crawled from a site. • Computing an importance value for each block, using a specified threshold t to decide noisy or not noisy • Matching to noisy blocks and not noisy blocks in the tree, given a new page.

  40. Noise Detection and Elimination root Body Table Img Table Table P Tr Tr Text Text A P Img A P P P A Img A A A A A

  41. root Body Table Img Table Table Tr Tr Text After simplification

  42. Summary of the technique • Evaluate Common and Diversity of content and styles • DOM trees SST • Information Entropy Based Evaluation • Node Importance • Composite Importance • Noise detection and automatic matching

  43. Near duplicate detection

  44. Syntactic clustering of the web contents WWW6,1997

  45. Document Representation • How to represent a document? • Represent document content by a feature set,preparing the computations of resemblance or similarity. • For documentD, extract it’s feature set as S(D)

  46. Defining similarity of documents • How to express the concept “roughly the same”precisely? • QuantityDefinition: resemblance • The resemblance fo two documents A and B is a number between 0 and 1.