Exploring Structure and Content on the Web Extraction and Integration of the Semi-Structured Web Tim Weninger Department of Computer Science University of Illinois Urbana-Champaign firstname.lastname@example.org
Rules of this tutorial • Ask questions • Ask lots of questions • If something is not clear, ask a question
The Web • Social Networks • Early Messenger Networks • Social Media • Gaming Networks • Professional Networks • Hyperlink Networks • Blog Networks • Wiki-networks • Web-at-large • Internal links • External links
Ranking on the Web Query:
This Tutorialis about the structure and content of the Web Name Phone Office Age Gender Email Author Dateline Topic Persons Location
Imagine what we could do… • Search • Show structured information in response to query • Automatically rank and cluster entities • Reasoning on the Web • Who are the people at some company? • What are the courses in some college department? • Analysis • Expand the known information of an entity • What is a professor’s phone number, email, courses taught, research, etc?
Outline • Preliminaries • Information Extraction • Break (30 min) • Information Integration • Web Information Networks
Databases and Schemas • Databases usually have a well defined schema
Databases and Schemas • Databases usually have a well defined schema
XML – a data description language • XML Schema
XML – a data description language • XML Instance
HTML and Semi-Structured data What’s the schema?
HTML and Semi-Structured data • HTML has no schema! • HTML is a markup language • A description for a browser to render • HTML describes how the data should be displayed • HTML was never meant to describe the data.
HTML and Semi-Structured data • HTML was never meant to describe the data. • But there is so much data on the Web • …we have to try
Document Object Model • HTML -> DOM • DOM is a tree model of the HT markup language
What the DOM is not • From the W3C: • The Document Object Model does not define what information in a document is relevant or how information in a document is structured. For XML, this is specified by the W3C XML Information Set [Infoset]. The DOM is simply an API to this information set.
Web page rendering • HTML -> DOM -> WebPage • Web page rendering according to Web standards • Uses the Boxes Model
Web databases • LOTS of pages on the Web are database interfaces
Web databases • Some pages are not database interfaces • ….but they could be
Relational Databases on the Web • WebPages can have relational data
HTML and Semi-Structured data • Our goal is to extract information from the Web • …and make sense out of it!
Outline • Preliminaries • Information Extraction from text • Break (30 min) • Information Extraction from tables and lists • Web Information Networks
Web Content Extraction Extract only the content of a page Taken from The Hutchinson News on 8/14/2008
Web Content Extraction • Two Approaches • Heuristic Approaches Work one “document-at-a-time” • Template Detection Approaches Require multiple documents that contain the same template • Benefits of content extraction • Reduce the noise in the document • Reduce document size • Better indexing, search processing • Easier to fit on small screens
Wrapper Generation • Documents on the Web are made from templates • Popularity of Content Management Systems • Database queries are used to “fill out” HTML content • Template are the framework of the Web page(s) • The structure of is very similar (near identical) among template Web pages. • Cluster similarly structured documents • Generate Wrappers • Extract Information
Wrapper Generation • Documents on the Web are made from templates • Database query “fills in” the content • Separate AJAX/HTTP calls “fill in” content
Locating Web page templates • First Bar-Yossef and Rajagopalan‘02 proposed a template recognition algorithmusing DOM tree segmentation • Template detection via data mining and its applications • Lin and Ho ‘02 developed InfoDiscovererwhich uses the heuristic that template generated contents appear more frequently. • Discovering informative content blocks from web documents • Debnath et al. ‘05 develop ContentExtractorbut also include features like image or script elements. • Automatic extraction of informative blocks from webpages
Locating Web page templates • Yi, Liu and Li ‘03 use the Site Style Tree(SST) approach finds that identically formatted DOM sub-trees denote the template • Eliminating noisy information in web pages for data mining • Crecensi et al. ’01 developRoadrunner which uses the Align, collapse under mismatch and extract (ACME) approach to generate wrappers. • Towards Automatic Data Extraction from Large Web Sites. • Buttler ‘04proposes the path shingling approach which makes use of the shingling technique. • A short survey of document structure similarity algorithms
Wrapper Generation • Generate extraction rules • //div[@class ="content"]/table/tr/td/text() A home away from school Day care has after-school duties as some clients start academic year By Kristen Roderick – The Hutchinson News – email@example.com The doors at Hadley Day Care opened Wednesday afternoon, and children scurried in with tales of…
Wrapper Generation • Advantages • Easy to implement and learn • Can have perfect precision and recall • Disadvantages • Web sites change their templates often • Any small change breaks the wrapper • Need several examples to learn the wrapper • Called “domain-centric” approaches
Single Document Content Extraction • Look at a single document at a time • Use heuristics and data mining principles to find main content. • No template detection • No extraction rule learning • Called “Web-centric” approaches
Early Content Extraction Approaches • Body Text Extraction (BTE) • Interprets HTML document as word and tag tokens • Identifies a single, continuous region which contains most words while excluding most tags. • Document Slope Curves (DSC) • Extension of BTE that looks at several document regions. • Link Quota Filters (LQF) • Remove DOM elements which consist mainly of text occurring in hyperlink anchors.
Tag Ratios Content Extraction • Two algorithms • Same time, same conference • Same concept • Gottron, et al. ‘07 Content Code Blurring • Weninger, et al. ‘07 Content Extraction via Tag Ratios
Text to Tag Ratio Text: 21 - Tags: 8 -> TTR: 2.63 Text: 22 - Tags: 8 -> TTR: 2.75 Text: 298 - Tags: 6 -> TTR: 49.67 Text: 0 - Tags: 0 -> TTR: 0 Text: 0 - Tags: 1 -> TTR: 0 http://www2010.org/www/2010/04/program-guide/
Histogram Clustering in 2-Dimensions Looks for jumps in the moving average of TTR
Histogram Clustering in 2-Dimensions Absolute value gives insight
Histogram Clustering in 2-Dimensions Make a scatterplot
Single Document Content Extraction • Advantages • Only need a single document at a time • Unsupervised • No training required • Disadvantages • Precision and Recall varies • On the (1) algorithm, (2) parameters, (3) Web page
Textual Extraction • Web text holds good information, but full NLP understanding is difficult • Two flavors of text extraction • Domain-at-a-time • Web-at-large (domain-agnostic) • Very different techniques required for each
Domain at a time • Documents on the Web are made from templates • A single domain has similar language
Domain at a time text extraction • If we know the schema/domain, we know the rules BBC Business – “owned by”, “sales of”, “CEO of”, etc.
Known Domains: Rule Learning • User provides initial data • Algorithm searches for terms, then induces rules. “Servers at Microsoft’s headquarters in Redmond…” “The Armonk-based IBM has introduced…” “Intel, Santa Clara, cut prices of its Pentium…” [ORGANIZATION]’s headquarters in [LOCATION] [LOCATION]-based [ORGANIZATION] [ORGANIZATION], [LOCATION]