1 / 15

A fast time series data server

A fast time series data server. Bob Weigel George Mason University Status: In development. Motivation. Want to do fast large scale analysis on time series data Data volume and data processing speed often do matter! Speed enables many services.

elsu
Download Presentation

A fast time series data server

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A fast time series data server Bob Weigel George Mason University Status: In development

  2. Motivation • Want to do fast large scale analysis on time series data • Data volume and data processing speed often do matter! • Speed enables many services. • If you want users to contribute data, provide • $, • free storage, • better organization and search than their OS + local file system provides, or • services on their data that are better than what they can do on their local machine.

  3. Demo

  4. The problems • Heliophysics “data bases” • The “granule” paradigm. The fundamental unit is the granule (file) contains many parameters. • The “small-box” paradigm. Given a user request, return a list of granule URLs that match. User needs to do the rest. Leads to slow-downs in response time to queries by up to a factor of 100! • Fundamental unit exposed to scientist should be the data set. Requires “aggregation”. Can be client-side or server-side. • Well-know and widely available RDBS don’t work well for time series (“column-based” versus “row-based”)

  5. Approaches for Large Scale Analysis 0. Let the user do the “aggregation” • Service: The “run-on-demand” paradigm - A reader (or “accessor”) is developed for each data provider that downloads data to the user's computer, extracts the relevant parts, and puts the data in a uniform form in an array or structure in the user's software analysis program. • Disadvantages: Requires high server reliability (servers are typically run by scientists …). Higher sever load, higher data transfer volume. • Advantages: No additional disk space. Always up-to-date. • Service: The “pre-caching” approach - The data are stored in a uniform manner on an intermediate server. The user makes a request to a single server. • Disadvantages: more disk space. Cache may be out-of-date • Advantages: 5-100x speed-ups in response. Reliability (Errors are caught ahead-of-time as are server problems). Many new services will be enabled.

  6. Ideal Approach Note that pre-caching requires “run-on-demand” solutions, but takes data a step further Note that “run-on-demand” approach will eventually develop a caching approach anyway – better to develop caching as a separate component => Use “pre-caching” for reasonably sized data sets. Use “run-on-demand” for large data sets and for filling cache . A significant portion of heliophysics data could be pre-cached.

  7. Question Why hasn’t this been done before? • Looks like data centralization. • Without improved data base, improvements using existing infrastructure is incremental.

  8. Only one data type • Focus on only one data type: time series. • Defined as • Scalar x(t),x(t+1), … • Vector Bx(t),By(t),Bx(t+1),By(t+1),… • Spectrogram A1(t),A2(t),…,AN(t),…, A1(t+1),A2(t+1),…AN(t+1)

  9. Development history • Developed as a part of ViRBO • Built on OPeNDAP

  10. Codebase • Java • OPeNDAP • Have written “I/O Service Provider” for data files. • Added ability to do pass time constraint expressions • Added ability to output data as an ASCII table • Added basic filters

  11. Technical details • Each time series is stored as a single flat binary file with IEEE 754 floating point values. • Time series that are close to being on a uniform grid are re-gridded with fill values. • All time series use a single fill value of NaN. • Files are stored on a compressed file system. • Fast random access to compressed files. About 6x slower access speed, but compression ratio is usually 8. • Files are stored on a versioning file system. Only differences are stored.

  12. API – lowest level • HTTP byte-range request http://timeseries.org/data/TimeSeries.ncml (contains data structure information and a URL to the science metadata) http://timeseries.org/data/TimeSeries.bin (just a time-ordered set of values Bx(t),By(t),Bx(t+1)By(t+1))

  13. API –highest level DAP protocol (builds on HTTP) http://timeseries.org/data/TimeSeries.{ascii.bin,dods,dat,etc.}?time<1999:01:01 http://timeseries.org/data/TimeSeries.ascii?time<1999:01:01&value>10 http://timeseries.org/data/TimeSeries.ascii?time<1999:01:01&value>10&filter=5minboxcar

  14. Future • Add submission API • Implement versioning file system • Implement suite of filters • Add ability to scale • Implement suite of applications • Connect to Universal Reader Library • Connect to QData set

More Related