Tim Weninger Department of Computer Science University of Illinois Urbana-Champaign - PowerPoint PPT Presentation

exploring structure and content on the web extraction and integration of the semi structured web n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Tim Weninger Department of Computer Science University of Illinois Urbana-Champaign PowerPoint Presentation
Download Presentation
Tim Weninger Department of Computer Science University of Illinois Urbana-Champaign

play fullscreen
1 / 125
Tim Weninger Department of Computer Science University of Illinois Urbana-Champaign
117 Views
Download Presentation
deacon
Download Presentation

Tim Weninger Department of Computer Science University of Illinois Urbana-Champaign

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Exploring Structure and Content on the Web Extraction and Integration of the Semi-Structured Web Tim Weninger Department of Computer Science University of Illinois Urbana-Champaign weninge1@illinois.edu

  2. Rules of this tutorial • Ask questions • Ask lots of questions • If something is not clear, ask a question

  3. The Web • Social Networks • Early Messenger Networks • Social Media • Gaming Networks • Professional Networks • Hyperlink Networks • Blog Networks • Wiki-networks • Web-at-large • Internal links • External links

  4. The Web is a Hyperlink Network

  5. Ranking on the Web Query:

  6. Clustering on the Web Sim(

  7. This Tutorialis about the structure and content of the Web Name Phone Office Age Gender Email Author Dateline Topic Persons Location

  8. Imagine what we could do… • Search • Show structured information in response to query • Automatically rank and cluster entities • Reasoning on the Web • Who are the people at some company? • What are the courses in some college department? • Analysis • Expand the known information of an entity • What is a professor’s phone number, email, courses taught, research, etc?

  9. Outline • Preliminaries • Information Extraction • Break (30 min) • Information Integration • Web Information Networks

  10. Databases and Schemas • Databases usually have a well defined schema

  11. Databases and Schemas • Databases usually have a well defined schema

  12. XML – a data description language • XML Schema

  13. XML – a data description language • XML Instance

  14. HTML and Semi-Structured data

  15. HTML and Semi-Structured data What’s the schema?

  16. HTML and Semi-Structured data • HTML has no schema! • HTML is a markup language • A description for a browser to render • HTML describes how the data should be displayed • HTML was never meant to describe the data.

  17. HTML and Semi-Structured data • HTML was never meant to describe the data. • But there is so much data on the Web • …we have to try

  18. Document Object Model • HTML -> DOM • DOM is a tree model of the HT markup language

  19. What the DOM is not • From the W3C: • The Document Object Model does not define what information in a document is relevant or how information in a document is structured. For XML, this is specified by the W3C XML Information Set [Infoset]. The DOM is simply an API to this information set.

  20. Web page rendering • HTML -> DOM -> WebPage • Web page rendering according to Web standards • Uses the Boxes Model

  21. Web databases • LOTS of pages on the Web are database interfaces

  22. Web databases • Some pages are not database interfaces • ….but they could be

  23. Relational Databases on the Web • WebPages can have relational data

  24. Data can be hidden in text too!

  25. HTML and Semi-Structured data • Our goal is to extract information from the Web • …and make sense out of it!

  26. Outline • Preliminaries • Information Extraction from text • Break (30 min) • Information Extraction from tables and lists • Web Information Networks

  27. Content Extraction

  28. Web Content Extraction Extract only the content of a page Taken from The Hutchinson News on 8/14/2008

  29. Web Content Extraction • Two Approaches • Heuristic Approaches Work one “document-at-a-time” • Template Detection Approaches Require multiple documents that contain the same template • Benefits of content extraction • Reduce the noise in the document • Reduce document size • Better indexing, search processing • Easier to fit on small screens

  30. Wrapper Generation • Documents on the Web are made from templates • Popularity of Content Management Systems • Database queries are used to “fill out” HTML content • Template are the framework of the Web page(s) • The structure of is very similar (near identical) among template Web pages. • Cluster similarly structured documents • Generate Wrappers • Extract Information

  31. Wrapper Generation • Documents on the Web are made from templates • Database query “fills in” the content • Separate AJAX/HTTP calls “fill in” content

  32. Locating Web page templates • First Bar-Yossef and Rajagopalan‘02 proposed a template recognition algorithmusing DOM tree segmentation • Template detection via data mining and its applications • Lin and Ho ‘02 developed InfoDiscovererwhich uses the heuristic that template generated contents appear more frequently. • Discovering informative content blocks from web documents • Debnath et al. ‘05 develop ContentExtractorbut also include features like image or script elements. • Automatic extraction of informative blocks from webpages

  33. Locating Web page templates • Yi, Liu and Li ‘03 use the Site Style Tree(SST) approach finds that identically formatted DOM sub-trees denote the template • Eliminating noisy information in web pages for data mining • Crecensi et al. ’01 developRoadrunner which uses the Align, collapse under mismatch and extract (ACME) approach to generate wrappers. • Towards Automatic Data Extraction from Large Web Sites. • Buttler ‘04proposes the path shingling approach which makes use of the shingling technique. • A short survey of document structure similarity algorithms

  34. Wrapper Generation • Generate extraction rules • //div[@class ="content"]/table[1]/tr/td[2]/text() A home away from school Day care has after-school duties as some clients start academic year By Kristen Roderick – The Hutchinson News – kroderick@hutchnews.edu The doors at Hadley Day Care opened Wednesday afternoon, and children scurried in with tales of…

  35. Wrapper Generation • Advantages • Easy to implement and learn • Can have perfect precision and recall • Disadvantages • Web sites change their templates often • Any small change breaks the wrapper • Need several examples to learn the wrapper • Called “domain-centric” approaches

  36. Single Document Content Extraction • Look at a single document at a time • Use heuristics and data mining principles to find main content. • No template detection • No extraction rule learning • Called “Web-centric” approaches

  37. Early Content Extraction Approaches • Body Text Extraction (BTE) • Interprets HTML document as word and tag tokens • Identifies a single, continuous region which contains most words while excluding most tags. • Document Slope Curves (DSC) • Extension of BTE that looks at several document regions. • Link Quota Filters (LQF) • Remove DOM elements which consist mainly of text occurring in hyperlink anchors.

  38. Tag Ratios Content Extraction • Two algorithms • Same time, same conference • Same concept • Gottron, et al. ‘07 Content Code Blurring • Weninger, et al. ‘07 Content Extraction via Tag Ratios

  39. Text to Tag Ratio Text: 21 - Tags: 8 -> TTR: 2.63 Text: 22 - Tags: 8 -> TTR: 2.75 Text: 298 - Tags: 6 -> TTR: 49.67 Text: 0 - Tags: 0 -> TTR: 0 Text: 0 - Tags: 1 -> TTR: 0 http://www2010.org/www/2010/04/program-guide/

  40. Text to Tag Ratio Histogram

  41. Histogram Clustering in 2-Dimensions Looks for jumps in the moving average of TTR

  42. Histogram Clustering in 2-Dimensions Absolute value gives insight

  43. Histogram Clustering in 2-Dimensions Make a scatterplot

  44. Modified k-Means

  45. Single Document Content Extraction • Advantages • Only need a single document at a time • Unsupervised • No training required • Disadvantages • Precision and Recall varies • On the (1) algorithm, (2) parameters, (3) Web page

  46. Rule Extraction

  47. Textual Extraction • Web text holds good information, but full NLP understanding is difficult • Two flavors of text extraction • Domain-at-a-time • Web-at-large (domain-agnostic) • Very different techniques required for each

  48. Domain at a time • Documents on the Web are made from templates • A single domain has similar language

  49. Domain at a time text extraction • If we know the schema/domain, we know the rules BBC Business – “owned by”, “sales of”, “CEO of”, etc.

  50. Known Domains: Rule Learning • User provides initial data • Algorithm searches for terms, then induces rules. “Servers at Microsoft’s headquarters in Redmond…” “The Armonk-based IBM has introduced…” “Intel, Santa Clara, cut prices of its Pentium…” [ORGANIZATION]’s headquarters in [LOCATION] [LOCATION]-based [ORGANIZATION] [ORGANIZATION], [LOCATION]