path processing using solid state storage
Download
Skip this Video
Download Presentation
Path Processing using Solid State Storage

Loading in 2 Seconds...

play fullscreen
1 / 26

Path Processing using Solid State Storage - PowerPoint PPT Presentation


  • 135 Views
  • Uploaded on

Path Processing using Solid State Storage. Manos Athanassoulis, DIAS, EPFL* Mustafa Canim , IBM Watson Research Labs Kenneth Ross, IBM Watson Research Labs, Columbia University Bishwaranjan Bhattacharjee , IBM Watson Research Labs. *work done during an internship at IBM.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Path Processing using Solid State Storage' - aisha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
path processing using solid state storage

Path Processing usingSolid State Storage

Manos Athanassoulis, DIAS, EPFL*

Mustafa Canim, IBM Watson Research Labs

Kenneth Ross, IBM Watson Research Labs, Columbia University

BishwaranjanBhattacharjee, IBM Watson Research Labs

*work done during an internship at IBM.

why path processing
Why Path Processing?

Why Solid State Storage (SSS)?

  • App’s use linkage information
    • Social
    • Scientific
    • Government
    • Financial
    • Knowledge
      • Watson (Jeopardy Champ)
  • Graph processing not enough
    • Link type modeled by RDF
  • Increasing capacity
    • Exponential increase
    • Follows Moore’s law
  • Read performance
    • OOM faster than disks
    • Random read performance
      • Crucial for path processing
  • New technologies
    • Flash already mature
    • Phase Change Memory (PCM)
    • … more tech’s are coming
path processing1
Path processing

1) Cannot prefetch

2) Retrieve-data-then-follow-link

3) A lot of useless data are retrieved

How can Solid State Storage help?

path processing and solid state storage
Path processing (and Solid State Storage)

1) Small access latency

2) Read mostly usefull data

3) Efficient random IO accesses

4) Can we do something better?

Build SSS-aware systems

in the rest of the talk
In the rest of the talk …
  • RDF data model and systems
  • Solid State Storage for Path Processing
    • Technology
    • Flash vsPCM
  • Storing and managing RDF data over Solid State Storage
  • Conclusions
resource description framework rdf meta data model
Resource Description Framework (RDF) meta-data model
  • Data is represented in Statements each one comprised by a triple
    • Statement: <Subject, Predicate, Object>
  • Each statement describes a property of a subject:
    • <“IBM”, “is-a”, “Corporation”>
  • or a connection between to objects:
    • <“Manos”, “interned-at”, “IBM”>
  • or a value of a Property of a Subject:
    • <“Manos”, “born-in”, “1984”>
  • The notation is more complex:
    • Subjects are Universal Resource Identifiers (URIs)
    • Predicates are URIs
    • Objects are either URIs or literals
rdf data management
RDF data management
  • Two alternatives are used to store data
  • Relational RDF storage
    • Use existing relational stores
    • Create relational tables
    • Basic approach: A triple-store
      • One big table with three columns
  • Native RDF storage
    • Tailored to the needs of the specific workload
    • No underlying system assumed

Can we take the best of both worlds?

outline
Outline
  • RDF data model and systems
  • Solid State Storage for Path Processing
    • Technology
    • Flash vs PCM
  • Storing and managing RDF data over Solid State Storage
  • Conclusions
solid state storage facts
Solid State Storage facts
  • We have access to a PCI-based PCM prototype (compared with fusionIO)
  • PCM prototype vs Flash state-of-the-art

*Very early Micron PCM prototype

exploiting solid state storage for path processing
Exploiting Solid State Storage for path processing
  • Path-processing involves link-following queries
    • Access latency is critical
  • Solid State Storage is tailored for path-processing:
    • OOM lower read latency than traditional storage
    • Very fast random-read performance
  • PCM is expected to outperform Flash in read performance
  • Next in this talk:
    • PCM vs Flash when running link-following queries
    • Storing and managing RDF data on Solid State Storage
pcm vs flash in path processing
PCM vs Flash in path processing
  • Prototype implementation of link-following queries
  • Workload: Given a randomly generated graph, execute link-following queries of variable length without buffering
  • Graph generation

5GB synthetic data with random number of edges (between 3 and 30 edges per vertex)

  • Querying Parameters

Number of threads (1, 2, 4, 8, 16, 32, 64, 96, 128, 192)

Pagesize (4K, 8K, 16K, 32K)

Length of the query (2, 4, 10, 100 accesses per query)

  • Hypothesis: PCM can offer important performance improvements
pcm vs flash
PCM vs Flash

Query length: 100 hops

PCM performs consistently better for smaller page granularities

building a sss aware rdf repository
Building a SSS-aware RDF repository
  • We focused on building a graph-based RDF repository
  • We need to design a new system which:
    • Takes into account the graph-structure of the data
    • Supports any RDF-based query
  • We introduce Pythia, a new RDF repository, which uses:
    • The notion of RDF-tuple
    • New internal structures
    • New data layout
rdf tuple
RDF-tuple

<Subject>,

<Predicate1>, {<Object1_1>, <Object1_2>, …},

<Predicate2>, {<Object2_1>, <Object2_2>, …},

<PredicateN>, {<ObjectN_1>, <ObjectN_2>, …},

  • The RDF-tuple design:
    • allows us to locate within a page the most important information of a Subject.
    • allows us to avoid repeating redundant information (Subject and Predicate resources)
      • This is further optimized by the URL Dictionary
pythia1

DRAM

Pythia

SSS

Query Engine

Literals

Dictionary

URL Dictionary

Hash Index

Hash Index

  • Repository for Very Large Objects

Aux storage: O, P, S

Main storage: S, P, O

data layout on pythia
Data layout on Pythia

Tuple 0

Tuple Metadata

Subject Resource

Predicates dictionary IDs

Objects: (if literal) Literal dictionary ID

Objects: (else) Object Resource and pageID, tupleID

Tuple 1

Tuple 2

Tuple 3

storing yago2 using pythia
Storing Yago2 using Pythia
    • Yago2 is a semantic knowledge base, introduced by Max-Planck Institute in 2007, derived from wikipedia, WordNet, and GeoNames (currently ~10M entries, 460M facts).

Yago2 in Pythia

  • Initial data: 2.3GB
  • Main DB files: 1.3GB
  • Large objects: 192MB
    • Can be aggressively decreased with page-level compression (tuples will move to main file as well)
  • Indexes: 121MB (hash-based, in memory)
  • Dictionaries: 569MB
    • Possible optimization: Take into account type of literal (now string)
  • More than 99% of the SPO tuples can fit in a single 4K page
evaluating pythia setup dataset
Evaluating Pythia (Setup & Dataset)
  • Prototype C++ implementation
  • System Setup
    • 24-core Intel XEON X560 with linux x86_64 (2.6.32-28)
    • 32GB of memory
    • 12GB PCM card (Micron prototype card)
    • 74GB Flash card (fusionIO)
  • Workload: Yago2
  • Queries: a mix of 6 queries with randomized parameters
pythia vs rdf 3x
Pythia vs RDF-3X
  • RDF – 3X is the de facto research state-of-the-art
  • Data in a virtual table and accessed through compressed indexes
    • 6 indexes (all permutations of S,P,O) and 3 aggregate indexes
pythia vs rdf 3x1
Pythia vs RDF-3X
  • Q1: Find all male citizens of Greece.
  • Q2: Find all OECD member economies that Switzerlanddeals with.
  • Q3: Find all mafia films that Al Pacino acted in.
  • Size on disk for Yago2: Raw data 2.3GB
  • Pythia: 2.2GB (no compression)

1.5GB db files (on disk)

0.7GB dictionaries/indexes (loaded in memory during startup)

  • RDF-3X: 2.2GB (aggressive compression)

2.2GB a single file (on disk)

conclusions
Conclusions
  • Solid State Storage is naturally tailored for path processing
    • PCM, Flash and more new technologies
  • PCM comparative advantage against flash is lower read latency
    • 1.5x-2.5x speedup in a workload with dependent reads
  • Pythia: A solid-state-storage-aware path-processing system
    • 1.5x – 2.5x high bandwidth on PCM compared to Flash
    • 1.5x – 2.0x lower response times on PCM compared to Flash
    • Competitive against state-of-the-art (RDF-3X)
thank you
Thank you!

Pythia (Greek: Πυθία; IPA pɪθiːɑː), commonly known as the Oracle of Delphi, was the priestess at the Temple of Apollo at Delphi, located on the slopes of Mount Parnassus, delivering prophecies.

ad