Path processing using solid state storage
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

Path Processing using Solid State Storage PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Path Processing using Solid State Storage. Manos Athanassoulis, DIAS, EPFL* Mustafa Canim , IBM Watson Research Labs Kenneth Ross, IBM Watson Research Labs, Columbia University Bishwaranjan Bhattacharjee , IBM Watson Research Labs. *work done during an internship at IBM.

Download Presentation

Path Processing using Solid State Storage

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Path processing using solid state storage

Path Processing usingSolid State Storage

Manos Athanassoulis, DIAS, EPFL*

Mustafa Canim, IBM Watson Research Labs

Kenneth Ross, IBM Watson Research Labs, Columbia University

BishwaranjanBhattacharjee, IBM Watson Research Labs

*work done during an internship at IBM.

Why path processing

Why Path Processing?

Why Solid State Storage (SSS)?

  • App’s use linkage information

    • Social

    • Scientific

    • Government

    • Financial

    • Knowledge

      • Watson (Jeopardy Champ)

  • Graph processing not enough

    • Link type modeled by RDF

  • Increasing capacity

    • Exponential increase

    • Follows Moore’s law

  • Read performance

    • OOM faster than disks

    • Random read performance

      • Crucial for path processing

  • New technologies

    • Flash already mature

    • Phase Change Memory (PCM)

    • … more tech’s are coming

Path processing

Path processing

Path processing1

Path processing

1) Cannot prefetch

2) Retrieve-data-then-follow-link

3) A lot of useless data are retrieved

How can Solid State Storage help?

Path processing and solid state storage

Path processing (and Solid State Storage)

1) Small access latency

2) Read mostly usefull data

3) Efficient random IO accesses

4) Can we do something better?

Build SSS-aware systems

In the rest of the talk

In the rest of the talk …

  • RDF data model and systems

  • Solid State Storage for Path Processing

    • Technology

    • Flash vsPCM

  • Storing and managing RDF data over Solid State Storage

  • Conclusions

Resource description framework rdf meta data model

Resource Description Framework (RDF) meta-data model

  • Data is represented in Statements each one comprised by a triple

    • Statement: <Subject, Predicate, Object>

  • Each statement describes a property of a subject:

    • <“IBM”, “is-a”, “Corporation”>

  • or a connection between to objects:

    • <“Manos”, “interned-at”, “IBM”>

  • or a value of a Property of a Subject:

    • <“Manos”, “born-in”, “1984”>

  • The notation is more complex:

    • Subjects are Universal Resource Identifiers (URIs)

    • Predicates are URIs

    • Objects are either URIs or literals

Rdf data management

RDF data management

  • Two alternatives are used to store data

  • Relational RDF storage

    • Use existing relational stores

    • Create relational tables

    • Basic approach: A triple-store

      • One big table with three columns

  • Native RDF storage

    • Tailored to the needs of the specific workload

    • No underlying system assumed

Can we take the best of both worlds?



  • RDF data model and systems

  • Solid State Storage for Path Processing

    • Technology

    • Flash vs PCM

  • Storing and managing RDF data over Solid State Storage

  • Conclusions

Solid state storage facts

Solid State Storage facts

  • We have access to a PCI-based PCM prototype (compared with fusionIO)

  • PCM prototype vs Flash state-of-the-art

*Very early Micron PCM prototype

Exploiting solid state storage for path processing

Exploiting Solid State Storage for path processing

  • Path-processing involves link-following queries

    • Access latency is critical

  • Solid State Storage is tailored for path-processing:

    • OOM lower read latency than traditional storage

    • Very fast random-read performance

  • PCM is expected to outperform Flash in read performance

  • Next in this talk:

    • PCM vs Flash when running link-following queries

    • Storing and managing RDF data on Solid State Storage

Pcm vs flash in path processing

PCM vs Flash in path processing

  • Prototype implementation of link-following queries

  • Workload: Given a randomly generated graph, execute link-following queries of variable length without buffering

  • Graph generation

    5GB synthetic data with random number of edges (between 3 and 30 edges per vertex)

  • Querying Parameters

    Number of threads (1, 2, 4, 8, 16, 32, 64, 96, 128, 192)

    Pagesize (4K, 8K, 16K, 32K)

    Length of the query (2, 4, 10, 100 accesses per query)

  • Hypothesis: PCM can offer important performance improvements

Pcm vs flash

PCM vs Flash

Query length: 100 hops

PCM performs consistently better for smaller page granularities


An RDF repository for Solid State Storage


Building a sss aware rdf repository

Building a SSS-aware RDF repository

  • We focused on building a graph-based RDF repository

  • We need to design a new system which:

    • Takes into account the graph-structure of the data

    • Supports any RDF-based query

  • We introduce Pythia, a new RDF repository, which uses:

    • The notion of RDF-tuple

    • New internal structures

    • New data layout

Rdf tuple



<Predicate1>, {<Object1_1>, <Object1_2>, …},

<Predicate2>, {<Object2_1>, <Object2_2>, …},

<PredicateN>, {<ObjectN_1>, <ObjectN_2>, …},

  • The RDF-tuple design:

    • allows us to locate within a page the most important information of a Subject.

    • allows us to avoid repeating redundant information (Subject and Predicate resources)

      • This is further optimized by the URL Dictionary





Query Engine



URL Dictionary

Hash Index

Hash Index

  • Repository for Very Large Objects

Aux storage: O, P, S

Main storage: S, P, O

Data layout on pythia

Data layout on Pythia

Tuple 0

Tuple Metadata

Subject Resource

Predicates dictionary IDs

Objects: (if literal) Literal dictionary ID

Objects: (else) Object Resource and pageID, tupleID

Tuple 1

Tuple 2

Tuple 3

Storing yago2 using pythia

Storing Yago2 using Pythia

  • Yago2 is a semantic knowledge base, introduced by Max-Planck Institute in 2007, derived from wikipedia, WordNet, and GeoNames (currently ~10M entries, 460M facts).

    Yago2 in Pythia

  • Initial data: 2.3GB

  • Main DB files: 1.3GB

  • Large objects: 192MB

    • Can be aggressively decreased with page-level compression (tuples will move to main file as well)

  • Indexes: 121MB (hash-based, in memory)

  • Dictionaries: 569MB

    • Possible optimization: Take into account type of literal (now string)

  • More than 99% of the SPO tuples can fit in a single 4K page

  • Evaluating pythia setup dataset

    Evaluating Pythia (Setup & Dataset)

    • Prototype C++ implementation

    • System Setup

      • 24-core Intel XEON X560 with linux x86_64 (2.6.32-28)

      • 32GB of memory

      • 12GB PCM card (Micron prototype card)

      • 74GB Flash card (fusionIO)

    • Workload: Yago2

    • Queries: a mix of 6 queries with randomized parameters

    How often can you ask pythia

    How often can you ask Pythia?

    Path processing using solid state storage

    How fast does Pythia answer?

    Pythia vs rdf 3x

    Pythia vs RDF-3X

    • RDF – 3X is the de facto research state-of-the-art

    • Data in a virtual table and accessed through compressed indexes

      • 6 indexes (all permutations of S,P,O) and 3 aggregate indexes

    Pythia vs rdf 3x1

    Pythia vs RDF-3X

    • Q1: Find all male citizens of Greece.

    • Q2: Find all OECD member economies that Switzerlanddeals with.

    • Q3: Find all mafia films that Al Pacino acted in.

    • Size on disk for Yago2: Raw data 2.3GB

    • Pythia: 2.2GB (no compression)

      1.5GB db files (on disk)

      0.7GB dictionaries/indexes (loaded in memory during startup)

    • RDF-3X: 2.2GB (aggressive compression)

      2.2GB a single file (on disk)



    • Solid State Storage is naturally tailored for path processing

      • PCM, Flash and more new technologies

    • PCM comparative advantage against flash is lower read latency

      • 1.5x-2.5x speedup in a workload with dependent reads

    • Pythia: A solid-state-storage-aware path-processing system

      • 1.5x – 2.5x high bandwidth on PCM compared to Flash

      • 1.5x – 2.0x lower response times on PCM compared to Flash

      • Competitive against state-of-the-art (RDF-3X)

    Thank you

    Thank you!

    Pythia (Greek: Πυθία; IPA pɪθiːɑː), commonly known as the Oracle of Delphi, was the priestess at the Temple of Apollo at Delphi, located on the slopes of Mount Parnassus, delivering prophecies.

  • Login