1 / 19

A 3-Dimensional Data Model for Large Time-Series Dataset Analysis in HBase

Dan Han, Eleni Stroulia University of Alberta. A 3-Dimensional Data Model for Large Time-Series Dataset Analysis in HBase. Outline. Background and Motivation Related Work A 3-Dimensional Data Model in HBase Case Study and Experiment Results Discussion Conclusions and Future Work.

erv
Download Presentation

A 3-Dimensional Data Model for Large Time-Series Dataset Analysis in HBase

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dan Han, Eleni Stroulia University of Alberta A 3-Dimensional Data Model for Large Time-Series Dataset Analysisin HBase MESOCA 2012

  2. Outline • Background and Motivation • Related Work • A 3-Dimensional Data Model in HBase • Case Study and Experiment Results • Discussion • Conclusions and Future Work MESOCA 2012

  3. Migrating Applications To the Cloud • Cloud is an attractive computing platform • Elasticity, Excellent Scalability, High Availability, Low operating cost • Applications are moving to the cloud • Social networking, online shopping, monitoring system • Time-Series data: grows monotonously over time • Analysis of large scale time-series data • May lead to new knowledge • May lead to Improvements of existing services • Success adoption of this movement paradigm requires a new model of storage MESOCA 2012

  4. Migrating RDBMS ContentTo NoSQL • From RDBMS to NoSQL storage systems • Enable the storage of big data, in order of row key • Scale horizontally across storage nodes easily • Not much data-organization support • Migration challenges • Few experiences and principles to follow • Steep learning curve for programming • Much experimentation is required before deployment • Much time is spent in designing the data schema • The “wrong” schema may lead to inefficient, high-latency queries MESOCA 2012

  5. We need Design Patterns for HBase Schemas • Our objective is to develop a systematic method for • Guiding data organization in NoSQL databases, given • the types of data stored, • the amount of data • its usage patterns • We start our investigation with HBase • A NoSQL database offering, built on top of Hadoop • Parallel Distributed Computation • MapReduce Framework • Coprocessor Framework MESOCA 2012

  6. Related Work • Talks in HBaseCon2012, held in May • Data schema and Coprocessor are two main topics • Experience from 30 enterprises, such as Facebook, Yapmap, eBay, Adobe • Organizing time-series data into period-specific “buckets” • OpenTSDB: a distributed scalable time series database, written on top of HBase • A data Model in Cassendra, another NoSQL database offering • Applied into our case study MESOCA 2012

  7. Data Organization in HBase • Cell in HBase • (Row, Family: Column, Version) => (X,Y,Z) = value VS Y Z MESOCA 2012 Y X X

  8. Case study: The Datasets • Cosmology Dataset • Product of an N-body simulation • Three types of particles: dark matter, gas and star • Particles evolve over a series of discrete timestamps • Each snapshot records the properties of all particles at the time of the snapshot • 9 snapshots, consists of 321,065,547 particles • Bixi Dataset • Data from a bicycle-renting service in the city of Montreal • Every minute, the statistic information about bike usage a station is collected by the sensor • 96,842 data points involved MESOCA 2012

  9. Three Schemas for the Cosmology Dataset Schema1 Schema2 Schema3 Region 24-2-33446666 2-33446666 2-00005533 Region 64-2-33559999 2-33550000 2-66664433 Region 84-2-33550000 2-33559999 2-99995533 Z MESOCA 2012 Y X

  10. Three Schemas for the Bixi Dataset Schema1 Schema2 Schema3 Time metrics X Time MESOCA 2012 metrics Time X X

  11. Experiment Results • Experiment Environment • Hadoop 0.20, HBase 0.93-snapshot (Coprocessor support) • A four-node cluster on virtual machines • Quires for each dataset • Three Queries of Cosmology dataset from related research • One query of Bixi dataset from business requirement • Query processing Implementation • Native java API • User-Level Coprocessor Implementation MESOCA 2012

  12. Query1 of Cosmology Dataset • Get all the particles of this type in this snapshot whose property matches the expression MESOCA 2012

  13. Query2 of Cosmology Dataset • Get all the particles added/destroyed between S1 and s2 MESOCA 2012

  14. Query3 of Cosmology Dataset • Get the values of the property for the given set of particles across the selected snapshots. MESOCA 2012

  15. Bixi Query • For a given list of stations and a time, get their average bike usage for last 1, 2, 4, 8 and 16 days MESOCA 2012

  16. Discussion • “Qualitative” versus “Quantitative” Suggestions • Dynamic Data versus Static Data • Historical Dataset versus Real-Time Datasets • Supported versus Non-Supported Datasets MESOCA 2012

  17. Conclusion • A 3-dimensional data model • Improved performance can be got from the data schema that use the version dimension of HBase • Fit in “write-once, read-many” system • Monitoring system • Sensor-based system • Version-based analysis MESOCA 2012

  18. Future Work • More Evaluation of this data model • scalability, elasticity, and utilization • How to design data model for other datasets • Spatial dataset • Graphic dataset MESOCA 2012

  19. Questions?Thank you MESOCA 2012

More Related