achieving efficient access to large integrated sets of semantic data in web applications
Download
Skip this Video
Download Presentation
Achieving Efficient Access to Large Integrated Sets of Semantic Data in Web Applications

Loading in 2 Seconds...

play fullscreen
1 / 23

Achieving Efficient Access to Large Integrated Sets of Semantic Data in Web Applications - PowerPoint PPT Presentation


  • 283 Views
  • Uploaded on

Achieving Efficient Access to Large Integrated Sets of Semantic Data in Web Applications. Pieter Bellekens, Kees van der Sluijs , William van Woensel, Sven Casteleyn, Geert-Jan Houben. ICWE 2008 07/16/2008. Introduction. Context Semantic Web (SW) grows

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Achieving Efficient Access to Large Integrated Sets of Semantic Data in Web Applications' - flora


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
achieving efficient access to large integrated sets of semantic data in web applications

Achieving Efficient Access to Large Integrated Sets of Semantic Data in Web Applications

Pieter Bellekens, Kees van der Sluijs, William van Woensel,

Sven Casteleyn, Geert-Jan Houben

ICWE 2008 07/16/2008

introduction
Introduction

Context

Semantic Web (SW) grows

Ever more SW datasets are available

Legacy data can usually be transformed in SW terms

Class of Web application emerges that want to utilize the potential of these Web-sized SW datasets

Practical Application

iFanzy, a commercial personalized EPG

Uses large sets of integrated SW data

Runner up app in the SW-challenge; showed that the class of large scale SW-applications needs more research

ifanzy
iFanzy

Personalized EPG

Multi-platform

Set-top box, focus on experience

Web, focus on interactivity

Central server architecture

Integrating data from various heterogeneous sources

TV-Grabbers, IMDb, Movie-trailers, etc

Context sensitive recommendations

Recommendations based on semantics structure of e.g. genre ontology, connections between programs, etc

ifanzy datasets
iFanzy Datasets

(Live) Heterogeneous data sources

Online TV guides, XMLTV format

sample: 1.278.718, daily updated

Online movie databases, IMDBtext dumps

currently 53.268.369 (full), 7.986.199 (trimmed)

trailers from Videodetective.com (API)

Broadcast descriptions, BBC-backstage, TV-Anytime format (domain model)

sample: 91.447, daily updated

Various vocabularies and ontologies

converting tv metadata in rdf owl
Converting TV Metadata in RDF/OWL

Input source 1:

Input source 2:

  • <program title="Match of the Day">
    • <channel>BBC One</channel>
    • <start>2008-03-09T19:45:00Z</start>
    • <duration>PT01H15M00S</duration>
    • <genre>sport</genre>
  • </program>
  • <program channel="NED1">
    • <source>http://foo.bar/</source>
    • <title>Sportjournaal</title>
    • <start>20080309184500</start>
    • <end>20080309190000</end>
    • <genre>sport nieuws</genre>
  • </program>

Translation to TV-Anytime in RDF/OWL

  • <TVA:ProgramInformation ID="crid://foo.bar/0001">
    • <hasTitle>Sportjournaal</hasTitle>
    • <hasGenre rdf:resource="TVAGenres:3.1.1.9"/>
  • </TVA:ProgramInformation>
  • <TVA:Schedule ID="TVA:Schedule_0001">
    • <serviceIDRef>NED1</serviceIDRef>
    • <hasProgram crid="crid://foo.bar/0001"/>
    • <startTime rdf:resource="TIME:TimeDesc_0001"/>
  • </TVA:Schedule>
  • <TIME:TimeDescription ID= "TIME:TimeDesc_0001">
    • <year>2008</year>
    • <month>3</month>
    • <day>9</day>
    • <hour>18</hour>
    • <minute>45</minute>
    • <second>0</second>
  • </TIME:TimeDescription>
converting vocabularies in rdf owl
Converting Vocabularies in RDF/OWL
  • <Term termID="3.1">
    • <Name xml:lang="en">NON-FICTION/INFORMATION</Name>
    • <Term termID="3.1.1“>
      • <Name xml:lang="en">News</Name>
      • <Term termID="3.1.1.9">
        • <Name xml:lang="en">Sport News</Name>
        • <Definition xml:lang="en">News of sports</Definition>
      • </Term>
    • </Term>
  • </Term>
  • <Term termID="3.2">
    • <Name xml:lang="en">SPORTS</Name>
    • <Term termID="3.2.1“>
      • <Name xml:lang="en">Athletics</Name>
      • <Term termID="3.2.1.1">
      • </Term>
    • </Term>
  • </Term>

Translation of TV-Anytime genres to RDF/OWL using SKOS

  • <TVAGenres:genre ID="TVAGenres:3.1.1.9">
    • <rdfs:label>Sport News</rdfs:label>
    • <skos:broader rdf:resource="TVAGenres:3.1.1"/> <skos:related rdf:resource="TVAGenres:3.2"/>
  • </TVAGenres:genre>
  • <TVAGenres:genre ID="TVAGenres:3.2">
    • <rdfs:label>Sport</rdfs:label>
    • <skos:related rdf:resource="TVAGenres:3.1.1.9"/>
  • </TVAGenres:genre>
  • <TVAGenres:genre ID="TVAGenres:3.1.1">
    • <rdfs:label>News</rdfs:label>
    • <skos:narrower rdf:resource="TVAGenres:3.1.1.9"/>
    • <skos:broader rdf:resource="TVAGenres:3.1"/>
  • </TVAGenres:genre>
aligning and enriching vocabularies
Aligning and Enriching Vocabularies

Alignment of Genre vocabularies

The content sources use several different genre vocabularies

Semantic enrichment of Genre vocabulary

Via SKOS narrower, broader and related relations

Enrichment of the user model

Import of social network profile adds interests in programs, persons (actors, directors,...), locations, etc.

XMLTV:documentaire  TVA:”Documentary”

IMDB:Thriller  TVA:”Thriller”

IMDB:Sci-Fi  TVA:”Science Fiction”

News –skos:narrower-> Sports News => Original Term hierarchy

Sport News –skos:related-> Sport => Partial label matches

Skating –skos:related-> ‘Ice skating’ => Partial label matches

  • ‘American Football’ -skos:related-> Rugby => Domain expert
aligning and enriching vocabularies10
Aligning and Enriching Vocabularies

Semantic enrichment of TV metadata with IMDB movie descriptions

Programs are matched across sources

Use part of relations in a geographical hierarchy to relate locations in the different sources

Alignment of date/time descriptions to Time ontology concepts to allow temporal reasoning

<time:year>2006</time:year>

<time:day>01</time:day>

<time:hour>12</time:hour>

“2006-01-01T12:00:00”

“Buono, il brutto, il cattivo, Il (1966)”

“The Good, the Bad and the Ugly”

“White Plains” “New York”  “USA”

using the semantic graph
Using the Semantic Graph

Recommendations

are generated based on usage data, the RDF/OWL graph and behavior analysis

Search functionality

uses the graph to show connections between items

Showing semantically related content

by following the relationships

Interface visualization

genres and locations in the interface can be browsed based on their relations to other concepts

scalability performance issues
Scalability & Performance Issues
  • Large scale SW-applications face performance issues with current day SW tools
    • Current RDF databases are not performance-mature
      • Especially for complex queries
    • Inference is time consuming or space intensive
    • RDF databases are generic; do not use specific knowledge about the sources
  • Target: Efficient access to our data
    • Low latency, users expect quick response from Web applications
      • Web 2.0 allows asynchronous updates
    • We need to be able to scale to thousands of users
technologies and strategies
Technologies and Strategies

Technological choices

  • RDF Database: Sesame (version 1 and 2)
  • Query Language: SeRQL
  • We looked at different data decomposition strategies
    • Vertical Decomposition
    • Horizontal Decomposition
  • We applied several application specific optimizations
    • Using Relational Database where possible
    • Using freetext search engine
natural solution one big dataset
Natural Solution: One big dataset
  • All sources in one repository
    • Pro:
      • Data is highly integrated
      • One query to get all data
    • Con:
      • Maintenance can be hard
      • The bigger the store, the longer query execution times (i.e. also for simple queries)
  • Some typical iFanzy queries together with execution times:

Query1: All programs with genre ‘drama’ (or one of its subgenres)

Query2: All programs with genre ‘drama’ and a keyword in the

program metadata (title, synopsis and keywords)

Query3: All programs with a keyword in the program metadata

(title, synopsis and keywords)

Query4: All programs with genre ‘drama’ and a keyword in the

program metadata or the person metadata (person name)

decomposition
Decomposition

Table X can be decomposed in: x1, x2,…,xn

Vertical decomposition (splitting properties)

n Query results from the decompositionneed to be combined to find the final result

Building the final result gets more complicated as more tables are involved

  • Horizontal decomposition (splitting instances)
    • n Query results from the decomposition need to be united via a UNION to find the final result
    • If the result set needs to be ordered, ordering needs to be done after all query execution
vertical decomposition
Vertical Decomposition

Splitting the data sources based on properties

Genres, Geo and Synonyms (WordNet) are split off

Relations between sources are not broken due to uniqueness of URIs

  • Result of one query is input of the next in the query pipeline
    • E.g. synonyms found in WordNet are used to query the data
    • Different strategies influence performance greatly (see table)
horizontal decomposition
Horizontal Decomposition

Splitting the data sources based on instances

The BBC and XMLTV datasets(which have identical structures)are separated into two tables

Joining the results is a simple UNION

Retrieve from one sourceuntil enough results are found

Queries to the split sources can be executed in parallel

horizontal decomposition cont
Horizontal Decomposition cont.
  • The biggest data source (the IMDb set) is also accountable for the biggest delay in responsiveness
  • While containing nearly one million movies, only a fraction are also know by the general public
  • Indicator: The more votes a movie received, the more known it is
  • Trimming the IMDb database based on nr of votes (see table)
  • Filtering all movies which have more than 500 votes resulted in 11.500 movies or 7.986.199triples in the database
  • Also querying time was reduced severely
reasoning optimization
Reasoning optimization
  • In RDF, we can reason over facts to deduce new facts
  • Inference can be pre-calculated
    •  More triples in database
  • Inference can be taken into account while querying
    •  Much more complex query
  • Inference for sublocations

(“California”: 8877 sublocations)

  • Inference for subgenres

(“Action”: 10 subgenres)

further optimization
Further optimization
  • Different types of databases
    • Some well-structured data repositories can be saved in relational databases
    • Different versions of Sesame, or different triple store back-ends for Sesame can have severe impact on performance
  • Using the LIMIT clause where applicable
    • At the interface side, users usually can browse results in chunks of 20, whereas more can be retrieved on request
    • If the result set needs to be ordered, this advantage is lost
  • Keyword Indices
    • Keeping indices over all strategic metadata fields helps when the user searches for keywords
    • Using a Lucene index allows misspellings as well, while maintaining high speed
conclusions
Conclusions

General conclusions

  • Maturing SW provides lots of available data
  • Data integration and alignment necessary. Involves some manual transformation, but it is surmountable
  • Many options to tune performance are possible by combining techniques

iFanzy Conclusions

  • Concrete product that was produced with bleeding edge technology – but we showed that it can be done
  • SW technologies matured enough to become commercially interesting: iFanzy starts to sell
current and future work
Current and Future Work
  • Benchmarking with many alternative SW data backends
    • OWLIM, Oracle, Jena, Mulgara, SWI-Prolog, etc
  • Parallellization using top-end servers
    • Scaling to thousands of users
  • Further research on Semantic recommendation algorithms
    • Combining SW techniques with collaborative filtering
  • User tests and product refinement
    • First: 300 customers have been selected for user tests
    • User interviews and usage data as feedback to improve application
slide23

Questions…

http://www.ifanzy.nl

23

ad