Deployment of rdfa microdata and microformats on the web a quantitative analysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on
  • Presentation posted in: General

Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. OC Working Group – 21.01.2014 Serge Tymaniuk. Overview. Introduction Methodology Results Questions. Introduction.

Download Presentation

Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Deployment of rdfa microdata and microformats on the web a quantitative analysis

Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

OC Working Group – 21.01.2014

Serge Tymaniuk


Overview

Overview

  • Introduction

  • Methodology

  • Results

  • Questions


Introduction

Introduction

  • Written by Christian Bizer (1), Kai Eckert (1), Robert Meusel (1), Hannes Mühleisen (2), Michael Schuhmacher (1), and Johanna Völker (1)

    • (1) Data and Web Science Group, University of Mannheim, Germany

    • (2) Database Architectures Group, Centrum Wiskunde & Informatica, Netherlands

  • Features:

    • Analysis of RDFa, Microdata, and Microformats adoption on the Web

    • Based on large public Web crawl of 3 billion HTML pages

    • Aims at revealing the main topical areas of the published data and different vocabularies within each topical area

    • Examine structural richness (which properties are used to described popular types of entities)


Web crawl

Web Crawl

  • Web crawl provided by Common Crawl foundation available as ARC files from Amazon S3.

  • 3,005,626,093 unique HTML pages from 40.6 million pay-level-domains.

  • Crawling conducted between Jan. - June 2012

  • Compressed size of the corpus is 48TB

  • Relies on the PageRank algorithm


Data extraction process

Data Extraction Process

  • Parsing framework is executed on Amazon EC2

  • Relies on Anything To Triples (http://any23.apache.org/) parsing library from Apache

  • Rapidminerdata mining framework is used for vocabulary term co-occurrence analyses


Results overall picture

Results: Overall picture

  • Structured data was discovered within 369Mout of 3Bpages contained in the Common Crawl corpus (12.3%), and within 2.29M out of 40.6M domains (5.64%)


Results deployment by format

Results: Deployment by FORMAT

* PLDs – Public Level Domains (i.e. websites)

* URLs – HTML pages


Results deployment by popularity

Results: Deployment by POPULARITY

* According to Alexa Internet Inc. (AL) list of the most frequently visited websites


Results deployment by domains

Results: Deployment by domains


Results deployment on the same website

Results: Deployment on the same Website

  • 93,5% of all website which has structured data use only a single format


Results deployment of rdfa

Results: Deployment of RDFa

Most frequently used properties co-occurring with all the 4 most frequently used OGP classes:

Most frequently used RDFa classes:

  • Alexa top 100 websites that use RDFa:

  • IMDB

  • Microsoft News Portal

  • BBC


Results deployment of microdata

Results: Deployment of Microdata

Most frequently used Microdata classes:

  • Alexa top 100 websites that use Microdata:

  • eBay

  • Microsoft Corp.

  • Apple Inc.


Results deployment of microformats

Results: Deployment of Microformats

  • Alexa top 100 websites that use Microformats:

  • Wikipedia

  • Adobe

  • Taobao marketplace

Most frequently used Microformats classes:


Results topical domains

Results: Topical Domains

  • Dominant Domains of the published data:

    • Persons and Organizations (by all 3 formats)

    • Blog- and CMS-related metadata (by RDFa and Microdata)

    • Navigational metadata (by RDFa and Microdata)

    • Product data (by all 3 formats)

    • Event data (by Microformats)


Results structural richness

Results: Structural Richness

  • Only a small set of generic properties is used to describe entities:

    • Instances of OGP class “Product” are described by title, url, site_name, description in most classes

    • Instances of Schema class “Product” is described largely only by name and description.

       Additional extraction techniques has to be employed for deeper understanding


Sources

Sources

Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, and Johanna Völker, (2012). Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. Retrieved from: http://hannes.muehleisen.org/Bizer-etal-DeploymentRDFaMicrodataMicroformats-ISWC-InUse-2013.pdf


Thank you for your attention

Thank you for your attention!

Questions?


  • Login