Database file systems in support of escience
Download
1 / 37

Database File Systems in Support of eScience - PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on

Database File Systems in Support of eScience. Philip A. Adams – LLNL/National Ignition Facility John C. Hax – Oracle Corporation. Science – A product of data analysis.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Database File Systems in Support of eScience' - claudia-schneider


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Database file systems in support of escience

Database File Systems in Support of eScience

Philip A. Adams – LLNL/National Ignition Facility

John C. Hax – Oracle Corporation


Science a product of data analysis
Science – A product of data analysis

“Science does not result from the launch of a mission or the collection of data. Rather, science only occurs through the analysis and understanding of that data.”

- Philosophy of the NASA Science Mission Directorate (SMD)


Questions to ask
Questions to Ask

  • Are we building IT Systems that support Research and Analysis or Infrastructure that supports the collection of data?


Scientific computing history
Scientific Computing History

Scientific Systems

Commercial Relational Databases

  • Scientific (minimal data shared)

  • Raw Data

  • Decentralized/Desktop Management

  • Open source software

  • Low quality of support/service

    • Best Effort

    • Mission critical operations

  • Primarily file based – HDF5,Lustre

    • Millions of Files

    • Write once, read many

  • Background processing

    • Pipelines

    • Computationally intensive applications

    • Long running transactions

    • Output of Large Data Sets

  • Single application profile

vs.

  • Enterprise (all data shared)

  • Metadata

  • Centralized management

  • Industrial strength software

  • High qualities of support/service

    • SLA guarantees

    • Mission Critical Operations

  • Mission critical operations

  • Databases & files

    • Read and Update

    • Enforced data integrity

  • Interactive processing

    • Interactive workflows

    • Transactional, intensive applications

    • Short running transactions (<8 hours)

    • Output of Individual Rows

  • Mixed application profile


Filesystems and legacy databases the gap
Filesystems and Legacy Databases – The Gap

Filesystem Benefits

Database Benefits

vs.

  • Provided maximum scalability to meet data

  • volume and ingestion requirements

    • HDF5

    • GFS (Google Filesystem)

    • Lustre

  • Ubiquity of accessing filesystems

    • Number of protocols

    • NFS, SMB, CIFS and FTP

    • Able to access the data right from the OS

    • Windows, Mac, Linux, Solaris, HP/UX

    • Application programming interfaces

    • support native access

    • file open (f_open), file close (f_close)

    • importing the java io package

    • ifstream/ofstream C++ file I/O classes

Superior query/search capability over filesystems

SQL standard

Easy manipulation of data

Functions

PL/SQL

Java, C, PHP, Perl

Low latency, interactive data access

suited for application access

Provides a structured way of storing data and ensuring data integrity

Tables/Constraints

Superior backup and recovery capabilities

RMAN, Redo/Archive logging

Block and Point-in-Time Recovery

Block Level Corruption Detection

Institutional Resources


Data challenges
Data Challenges

  • Physical Limitations

    • I/O Intensive - limitations on max IOPS

    • Network speeds - time to ship data to compute nodes

  • Multiple Data Silos

    • Governance issues

      • Pedigree of the data

      • Multiple access policies to get to the data

      • Duplicate data stored in each silo

    • Need to scale disparate systems as data grows

  • Increased effort required for Scientists, Developers, Administrators

    • Correlating the data across data silos

    • Coordinated backup and recovery plan

    • Multiple Data Aggregation Efforts


The result the split architecture a step in the wrong direction
The Result: The Split Architecture – a step in the wrong direction

  • These drawbacks include but are not limited to:

    • Data curation

    • Security

    • Availability

    • Recoverability

    • Manageability

  • Because no common database and filesystem access protocol was available, the burden shifted to the application developers and scientific researchers to make sense of the two silos of information


How much of an issue is this
How much of an issue is this?

  • Level 0 (Raw) data is typically enriched with data from other sources.

  • What happens when/if a diagnostic is found to have incorrect calibration data?

  • Without strict relationships, this could be a nightmare. It may be easier to rerun analysis to reproduce the Level 1, 2 and 3 data. However, an unknown quantity of Level 4 content has been generated from this data and is stored on many researchers’ workstations and file shares.

Lack of pedigree in data analysis can result in instrument/machine damage, increased financial costs, or embarrassment to scientific researchers who rely on the data


Future of scientific computing and analysis
Future of Scientific Computing and Analysis

Data Intensive

+

  • Collaborative

  • Data Intensive Collaborative Science


Data intensive collaborative science
Data Intensive Collaborative Science

Cost

Complexity

Knowledge

Base

Interdependence

Drivers

Collaboration

Enablers

Network

Capacity

Standards

The Web

Clustering/

Grid

Technologies

Moores

Law


What s driving the data volumes
What’s driving the data volumes?

Better and more diverse instrumentation

Flexible optics

Coordinated multi-instrument observatories

Increased Precision

Genomics

Diverse types of data generated: SQL/Scalar, XML, Image, Monte Carlo simulations, Audio/Video, telemetry, and spectrometers


Database filesystems
Database Filesystems

Bridge the Gap between Filesystems and Relational Database Systems

Maintain Filesystem Performance

Leverage multiple access methods

Single Security Mechanism

Unified Administrative Tools

Data Pedigree

Unified Architecture and Skill sets

Leverage Institutional Resources for IT

Enabling Collaboration around Data

Optimized for Data Access



Modern databases have much to offer in the realm of data analysis
Modern databases have much to offer in the realm of data analysis

  • RDF/OWL can allow semantic searching of data

  • Predictive Analytics

  • Spatial Data Analysis

  • Text Mining of Unstructured Content


Some of the native data mining techniques and algorithms available
Some of the native data mining techniques and algorithms available

Algorithms

Logistic Regression

Naive Bayes

Support Vector Machine

Decision Tree

Multiple Regression

Minimum Description Length

One-Class Support Vector Machine

Enhanced K-Means

Orthogonal Partitioning Clustering

Apriori

Non-negative Matrix Factorization

Technique

Classification

Regression

Attribute Importance

Anomaly Detection

Clustering

Association

Feature Extraction


Key components of secure files architecture
Key Components of Secure Files Architecture available

Delta Update

Write Gather Cache

Transformation Management

Inode Management

Space management

I/O Management

Finally the database can accept both structured and non-structured data in an efficient manner



UCRL-PRES-236394 available

National Ignition Facility and 11g SecureFiles

NLIT 2009

Philip A. Adams

Sr. Systems ArchitectNational Ignition Facility

Lawrence Livermore National Laboratory

June 1-3 2009

This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344


Overview of the national ignition facility
Overview of the National Ignition Facility available

The National Ignition Facility (NIF) is known as the world’s largest and most energetic laser

When fully operational, its 192 beams will converge 1.8 MJ of laser energy onto a single target to achieve thermonuclear ignition

NIF will enable experiments that produce temperatures and densities like those in the Sun or in an exploding nuclear weapon

NIF-1107-14129.ppt

Oracle, 11/12/07

19


Overview of the national ignition facility1
Overview of the National Ignition Facility available

The 192 laser beams of NIF will generate:

A peak power of 500 trillion watts, 1000 times the electric generating power of the United States

A pulse energy of 1.8 million joules of ultraviolet light

A pulse length of three to twenty billionths of a second

NIF-1107-14129.ppt

Oracle, 11/12/07

20


The optics make nif work
The Optics make NIF work available

Optical components:

7500 large optics including 3072 laser glass slabs as well as large lenses, mirrors, and crystals

More than 15,000 small optical components

Precision optics:

Total area of 33,000 square feet (3/4 of an acre)

More than 40 times the total precision optical surface in the world’s largest telescope (Keck Observatory, Hawaii)

NIF-1107-14129.ppt

Oracle, 11/12/07

21


Example of optic damage
Example of Optic Damage available

3 ns

2 µm

NIF-1107-14129.ppt

Oracle, 11/12/07

22


On high quality optical surfaces initiated damage sites are very small
On high quality optical surfaces initiated damage sites are very small

NIF-1107-14129.ppt

Oracle, 11/12/07

23


Performance gains found in nif with 11g securefiles
Performance Gains found in NIF with 11g SecureFiles very small

Test Environment

Database Server

HP Blade Server w/ 4-way AMD Opteron CPUs

RHEL 4 – 32-bit kernel

11g Oracle Database – 32-bit version

Single Instance

ASM

Dual Port Fibre Channel Mezzanine Card (2 Gbit)

Application Server

Dell PowerEdge 2650 w/ 2-way Intel Xeon CPUs

RHEL 4 – 32-bit kernel

10g Oracle Application Server

10g Oracle CMSDK (Content Management Software Development Toolkit)

NIF-1107-14129.ppt

Oracle, 11/12/07

24


Performance gains found in nif with 11g securefiles1
Performance Gains found in NIF with 11g SecureFiles very small

Test Environment

SAN Storage

3PAR S400

Production Environment

11g RAC Environment

10g CMSDK Clustered Application Server Environment

NIF-1107-14129.ppt

Oracle, 11/12/07

25


Measure the throughput of the environment
Measure the throughput of the environment very small

Perform dd tests to the disks to establish the theoretical max:

WRITE

> dd if=/dev/zero of=/dev/raw/raw6 count=10000 bs=1M

READ

> dd if=/dev/raw/raw6 if=/dev/null count=10000 bs=1M

MONITOR

> iostat –xdk 3 100

We saw 180 MB/sec Read/Write throughput to the disks

Warning: Be sure not to perform dd write tests on your ASM configured storage or else you’ll damage it

NIF-1107-14129.ppt

Oracle, 11/12/07

26


Create a few test tables
Create a few test tables very small

Create a test table for BasicFiles and a test table for SecureFiles:

BasicFile Example:

CREATE TABLE "FOO_BASICFILE_TABLE"

( "PKEY" NUMBER(4) NOT NULL ,

"DOCUMENT" BLOB)

TABLESPACE "LOB_DEMO"

LOB ("DOCUMENT")

STORE AS BASICFILE

( TABLESPACE "LOB_DEMO");

SecureFiles Example:

CREATE TABLE "FOO_SECUREFILE_TABLE"

( "PKEY" NUMBER(4) NOT NULL ,

"DOCUMENT" BLOB)

TABLESPACE "LOB_DEMO"

LOB ("DOCUMENT")

STORE AS SECUREFILE

( TABLESPACE "LOB_DEMO");

NIF-1107-14129.ppt

Oracle, 11/12/07

27


Throughput results of table tests
Throughput Results of Table Tests very small

  • Speed tests from database server (Oracle 11.1.0 DB, using Oracle jdk 1.5.0_11 in $OH/jdk, using ojdbc5.jar)

  • Inserting twenty 32MB image files per test


Securefile vs basicfile server results
SecureFile vs. BasicFile Server Results very small

NIF-1107-14129.ppt

Oracle, 11/12/07

29


Measure the throughput of the network
Measure the throughput of the network very small

Used a tool called iperf available at:

http://sourceforge.net/projects/iperf/

On Server run:

./iperf -s –fM

On Client run:

./iperf -f M -c blackstone

------------------------------------------------------------

Client connecting to blackstone.llnl.gov, TCP port 5001

TCP window size: 0.06 MByte (default)

------------------------------------------------------------

[ 5] local XXX.XXX.XXX.XXX port 58590 connected with XXX.XXX.XXX.XXX port 5001

[ ID] Interval Transfer Bandwidth

[ 5] 0.0-10.0 sec 1120 MBytes 112 MBytes/sec

NIF-1107-14129.ppt

Oracle, 11/12/07

30


Throughput results of client server tests
Throughput Results of Client-Server Tests very small

  • Speed tests from database server (Oracle 10.1.2 Client, using jdk 1.5.0_11 and ojdbc14.jar)

  • Inserting twenty 32MB image files per test



Securefile performance benefits
SecureFile Performance Benefits very small

During our testing, we’ve seen a 2-20 times increase in performance using SecureFiles over traditional BasicFiles

We’ve seen equivalent or better performance using SecureFiles as we see writing the same file to our NFS mounted NetApp

NIF-1107-14129.ppt

Oracle, 11/12/07

33


Database tuning to optimize for securefiles
Database Tuning to optimize for SecureFiles very small

Create a separate tablespace for your LOB data

Use Uniform Extents – 1M seems best overall

Tried 32M/64M extents with no performance increase; your mileage may vary

Enable Automatic Segment Space Management on the tablespace

Create large enough redo log files

We used 200M – 1024M to reduce log file switches during heavy loads

NIF-1107-14129.ppt

Oracle, 11/12/07

34


Database tuning to optimize for securefiles1
Database Tuning to optimize for SecureFiles very small

Utilize the AWR Snapshots before and after a SecureFile load and note the wait conditions

SQL> EXECUTE dbms_workload_repository.create_snapshot(); 

PL/SQL procedure successfully completed

Run the AWR report

$ORACLE_HOME/rdbms/admin/awrrpt.sql

NIF-1107-14129.ppt

Oracle, 11/12/07

35


Conclusion
Conclusion very small

  • The ultimate goal of science is to create new knowledge and new discoveries.

  • Database Filesystems have a number of features which can benefit the scientific community and ease the burden of pedigree, data management, and analysis

  • Using a database filesystem will enable data intensive collaborative science.

  • As new discoveries are made and data volumes increase, it is imperative to have a robust database system that is not only capable of managing the pedigree of that data, but also serve as a knowledge repository for the future.


For more information
For More Information very small

http://search.oracle.com

SecureFiles

or

http://www.oracle.com/


ad