Big data open source software and projects abds in summary viii level 11a
1 / 12

Big Data Open Source Software and Projects ABDS in Summary VIII: Level 11A - PowerPoint PPT Presentation

  • Uploaded on

Big Data Open Source Software and Projects ABDS in Summary VIII: Level 11A. I590 Data Science Curriculum August 15 2014. Geoffrey Fox [email protected] School of Informatics and Computing Digital Science Center Indiana University Bloomington. HPC-ABDS Layers.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Big Data Open Source Software and Projects ABDS in Summary VIII: Level 11A' - myra-boone

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Big data open source software and projects abds in summary viii level 11a

Big Data Open Source Software and ProjectsABDS in Summary VIII: Level 11A

I590 Data Science Curriculum

August 15 2014

Geoffrey Fox

[email protected]

School of Informatics and Computing

Digital Science Center

Indiana University Bloomington

Hpc abds layers

Here are 17 functionalities. Technologies are presented in this order

4 Cross cutting at top

13 in order of layered diagram starting at bottom

  • Message Protocols

  • Distributed Coordination:

  • Security & Privacy:

  • Monitoring:

  • IaaSManagement from HPC to hypervisors:

  • DevOps:

  • Interoperability:

  • File systems:

  • Cluster Resource Management:

  • Data Transport:

  • SQL / NoSQL / File management:

  • In-memory databases&caches / Object-relational mapping / Extraction Tools

  • Inter process communication Collectives, point-to-point, publish-subscribe

  • Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI:

  • High level Programming:

  • Application and Analytics:

  • Workflow-Orchestration:

File management and Formats


  • The Integrated Rule-Oriented Data System (iRODS) is an BSD Open source data management software in use at research organizations and government agencies worldwide.

  • iRODSis a production-level distribution aimed at deployment in mission critical environments.

  • It functions independently of storage resources and abstracts data control away from storage devices and device location allowing users to take control of their data.

  • iRODSabstracts data services from data storage to facilitate executing services across heterogeneous, distributed storage systems.

  • iRODS empowers data stewards by partitioning policies, rules, and services they develop from repository management.

  • iRODS executes your data policies on your schedule.

  • iRODS virtualizes data policy by separating data management policy enforcement from repository management.

Scientific data file formats i netcdf cdf
Scientific Data/File Formats INetCDF, CDF

  • NetCDF Common Data Form) is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. The project homepage is hosted by the University Corporation for Atmospheric Research (UCAR). This work started in 1989 and was based on CDF

  • Common Data Format (CDF) a library and toolkit that was developed by the National Space Science Data Center (NSSDC) at NASA starting in 1985. The software is an interface for the storage and manipulation of multi-dimensional data sets.

Scientific data file formats ii hdf
Scientific Data/File Formats IIHDF

  • Hierarchical Data Format (HDF) a set of file formats (HDF4, HDF5) designed to store and organize large amountsof numerical data. Originally developed at the National Center for Supercomputing Applications, it is supported by the non-profit HDF Group, whose mission is to ensure continued development of HDF5 technologies, and the continued accessibility of data stored in HDF.

  • In keeping with this goal, the HDF format, libraries and associated tools are available under a liberal, BSD-like license for general use. HDF is supported by many commercial and non-commercial software platforms, including Java, MATLAB/Scilab, Octave, IDL, Python, and R. The freely available HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView).

  • The current version, HDF5, differs significantly in design and API from the major legacy version HDF4.

  • HDF5 simplifies the file structure to include only two major types of object:

    • Datasets, which are multidimensional arrays of a homogeneous type

    • Groups, which are container structures which can hold datasets and other groups

  • This results in a truly hierarchical, filesystem-like data format. In fact, resources in an HDF5 file are even accessed using the POSIX-like syntax /path/to/resource. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets. More complex storage APIs representing images and tables can then be built up using datasets, groups and attributes.

  • In addition to these advances in the file format, HDF5 includes an improved type system, and dataspace objects which represent selections over dataset regions. The API is also object-oriented with respect to datasets, groups, attributes, types, dataspaces and property lists.

  • The latest version of NetCDF, version 4, is based on HDF5.

Scientific data file formats iii opendap
Scientific Data/File Formats IIIOPeNDAP

  • OPeNDAP"Open-source Project for a Network Data Access Protocol“ is a data transport architecture and protocol widely used by earth scientists. The protocol is based on HTTP and the current specification is OPeNDAP 2.0 draft.

    • OPeNDAPincludes standards for encapsulating structured data, annotating the data with attributes and adding semantics that describe the data.

    • The protocol is maintained by, a publicly funded non-profit organization that also provides free reference implementations of OPeNDAP servers and clients.

  • An OPeNDAP client could be an ordinary browser or specialized visualization/analysis engine

  • An OPeNDAP client sends requests to an OPeNDAP server, and receives various types of documents or binary data as a response. One such document is called a DDS (received when a DDS request is sent), that describes the structure of a data set. A data set, seen from the server side, may be a file, a collection of files or a database. Another document type that may be received is DAS, which gives attribute values on the fields described in the DDS. Binary data is received when the client sends a DODS request.

  • An OPeNDAP server can serve an arbitrarily large collection of data. Data on the server is often in HDF or NetCDF format, but can be in any format including a user-defined format. Compared to ordinary file transfer protocols (e.g. FTP), a major advantage using OPeNDAP is the ability to retrieve subsets of files, and also the ability to aggregate data from several files in one transfer operation.

  • OPeNDAP is widely used by governmental agencies such as NASA and NOAA to serve satellite, weather and other observed earth science data.

Scientific data file formats iv fits
Scientific Data/File Formats IVFITS

  • Flexible Image Transport System an open astronomy standard defining a digital file format useful for storage, transmission and processing of scientific and other images.

    • FITS is the most commonly used digital file format in astronomy.

    • Unlike many image formats, FITS is designed specifically for scientific data and hence includes many provisions for describing photometric and spatial calibration information, together with image origin metadata.

  • The FITS format was first standardized in 1981; it has evolved gradually since then, and the most recent version (3.0) was standardized in 2008.

  • FITS was designed with an eye towards long-term archival storage, and the maxim once FITS, always FITS represents the requirement that developments to the format must be backwards compatible.

  • A major feature of the FITS format is that image metadata is stored in a human-readable ASCII header, so that an interested user can examine the headers to investigate a file of unknown provenance.

  • The information in the header is designed to calculate the byte offset of some information in the subsequent data unit to support direct access to the data cells.

  • Each FITS file consists of one or more headers containing ASCII card images (80 character fixed-length strings) that carry keyword/value pairs, interleaved between data blocks. The keyword/value pairs provide information such as size, origin, coordinates, binary data format, free-form comments, history of the data, and anything else the creator desires: while many keywords are reserved for FITS use, the standard allows arbitrary use of the rest of the name-space.

Rcfile hive data format i
RCFile Hive Data Format I

  • See Hive’s RCFile (Row Column File) introduced in 2011


  • Distributed Row-Groups among Nodes as in row store but organize by columns in each group

  • Minimize unnecessary I/O operations

    • In a row group, table is partitioned by columns

    • Only read needed columns from disks

  • Minimize network costs in row construction

    • All columns of a row are located in same HDFS block

  • Comparable data loading speed to Row-Store

    • Only adding a vertical-partitioning operation in the data loading procedure of Row-Store

  • Applying efficient data compression algorithms

    • Can use compression schemes used in Column-store

Rcfile hive data format ii
RCFile Hive Data Format II

Unnecessary I/O transfers



RCFile: Combined row-stores and column-store


Unnecessary network transfers (MBytes)

Orc hive data format
ORC Hive Data Format


  • Hive’s RCFile has been the standard format for storing Hive data since 2011. However, RCFile has limitations because it treats each column as a binary blob without semantics.

  • Hive 0.11 adds a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files.

  • Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes.

  • Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.

  • Finally, ORC works together with query vectorization work providing a high bandwidth reader/writer interface


Apache parquet io format i
Apache Parquet IO/Format I


  • In its relatively short lifetime (co-founded by Twitter and Cloudera in July 2013), Parquet has already become the de facto standard for efficient columnar storage of Apache Hadoop data —

    • with native support in Impala, Apache Hive, Apache Pig, Apache Spark, MapReduce, Apache Tajo, Apache Drill, Apache Crunch, and Cascading, Scalding, Kite, Presto and Shark.

  • Supports data models: Apache Avro, Thriftand Google Protocol Buffers

  • Based on Google Dremel paper



Apache parquet io format ii
Apache Parquet IO/Format II


  • Parquet makes the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.

  • Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.

  • Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.