Analysis with Extremely Large Datasets. CHEP’2012 New York, USA. Jacek Becla SLAC National Accelerator Laboratory. Outline. Background. Techniques. SciDB, Qserv and XLDB-2012. The Main Challenge. Scale. & data complexity. & usage. Storing petabytes is easy.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
New York, USA
SLAC National Accelerator Laboratory
& data complexity
Storing petabytes is easy.
It is what you do with them that matters
multiple data sources
Lots of commonalities between science and industry
Data access drives the architecture more than anything else!
$0 billion market for analytics
Analytics as service / private clouds
One size does not fit all
In database + near database processing
Easy crude cost estimate
Easy intermediate results
Chunking = indexing
“download now, try it out, and give us feedback and feature requests”
– SciDB CEO
Intercepting user queries
Metadata, result cache
Worker dispatch, query fragmentation generation, spatial indexing, query recovery, optimizations, scheduling, aggregation
MySQL dispatch, shared scanning, optimizations, scheduling
Single node RDBMS
Auto-load balancing between MySQL servers, auto fail-over
Worker failure recovery
carry no state
Two copies of data on different workers