Deployment of rdfa microdata and microformats on the web a quantitative analysis
Download
1 / 17

Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis - PowerPoint PPT Presentation


  • 123 Views
  • Uploaded on

Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. OC Working Group – 21.01.2014 Serge Tymaniuk. Overview. Introduction Methodology Results Questions. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis' - hailey


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Deployment of rdfa microdata and microformats on the web a quantitative analysis

Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

OC Working Group – 21.01.2014

Serge Tymaniuk


Overview
Overview – A Quantitative Analysis

  • Introduction

  • Methodology

  • Results

  • Questions


Introduction
Introduction – A Quantitative Analysis

  • Written by Christian Bizer (1), Kai Eckert (1), Robert Meusel (1), Hannes Mühleisen (2), Michael Schuhmacher (1), and Johanna Völker (1)

    • (1) Data and Web Science Group, University of Mannheim, Germany

    • (2) Database Architectures Group, Centrum Wiskunde & Informatica, Netherlands

  • Features:

    • Analysis of RDFa, Microdata, and Microformats adoption on the Web

    • Based on large public Web crawl of 3 billion HTML pages

    • Aims at revealing the main topical areas of the published data and different vocabularies within each topical area

    • Examine structural richness (which properties are used to described popular types of entities)


Web crawl
Web Crawl – A Quantitative Analysis

  • Web crawl provided by Common Crawl foundation available as ARC files from Amazon S3.

  • 3,005,626,093 unique HTML pages from 40.6 million pay-level-domains.

  • Crawling conducted between Jan. - June 2012

  • Compressed size of the corpus is 48TB

  • Relies on the PageRank algorithm


Data extraction process
Data Extraction Process – A Quantitative Analysis

  • Parsing framework is executed on Amazon EC2

  • Relies on Anything To Triples (http://any23.apache.org/) parsing library from Apache

  • Rapidminerdata mining framework is used for vocabulary term co-occurrence analyses


Results overall picture
Results: Overall picture – A Quantitative Analysis

  • Structured data was discovered within 369Mout of 3Bpages contained in the Common Crawl corpus (12.3%), and within 2.29M out of 40.6M domains (5.64%)


Results deployment by format
Results: Deployment by FORMAT – A Quantitative Analysis

* PLDs – Public Level Domains (i.e. websites)

* URLs – HTML pages


Results deployment by popularity
Results: – A Quantitative AnalysisDeployment by POPULARITY

* According to Alexa Internet Inc. (AL) list of the most frequently visited websites


Results deployment by domains
Results: Deployment – A Quantitative Analysisby domains


Results deployment on the same website
Results: Deployment – A Quantitative Analysison the same Website

  • 93,5% of all website which has structured data use only a single format


Results deployment of rdfa
Results: Deployment – A Quantitative Analysisof RDFa

Most frequently used properties co-occurring with all the 4 most frequently used OGP classes:

Most frequently used RDFa classes:

  • Alexa top 100 websites that use RDFa:

  • IMDB

  • Microsoft News Portal

  • BBC


Results deployment of microdata
Results: Deployment of Microdata – A Quantitative Analysis

Most frequently used Microdata classes:

  • Alexa top 100 websites that use Microdata:

  • eBay

  • Microsoft Corp.

  • Apple Inc.


Results deployment of microformats
Results: Deployment of – A Quantitative AnalysisMicroformats

  • Alexa top 100 websites that use Microformats:

  • Wikipedia

  • Adobe

  • Taobao marketplace

Most frequently used Microformats classes:


Results topical domains
Results – A Quantitative Analysis: Topical Domains

  • Dominant Domains of the published data:

    • Persons and Organizations (by all 3 formats)

    • Blog- and CMS-related metadata (by RDFa and Microdata)

    • Navigational metadata (by RDFa and Microdata)

    • Product data (by all 3 formats)

    • Event data (by Microformats)


Results structural richness
Results – A Quantitative Analysis: Structural Richness

  • Only a small set of generic properties is used to describe entities:

    • Instances of OGP class “Product” are described by title, url, site_name, description in most classes

    • Instances of Schema class “Product” is described largely only by name and description.

       Additional extraction techniques has to be employed for deeper understanding


Sources
Sources – A Quantitative Analysis

Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, and Johanna Völker, (2012). Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. Retrieved from: http://hannes.muehleisen.org/Bizer-etal-DeploymentRDFaMicrodataMicroformats-ISWC-InUse-2013.pdf


Thank you for your attention
Thank you for your attention! – A Quantitative Analysis

Questions?


ad