Addressing the Challenges of the Scientific Data Deluge

Addressing the Challenges of the Scientific Data Deluge Kenneth ChiuSUNY Binghamton

Outline • Overview of collaborative projects that I’m working on. • Discussion of challenges and approaches. • Technical overview of specific projects

Autoscaling Project • Traditional research focus in sensor networks on energy, routing, etc. • In “environmental observatories”, management is the problem. • Adding a sensor takes a lot of manual reconfiguration. • Calibration, recalibration. • QA/QC is also a major issue. • What corrections have been applied to the data, and, what calibrations/maintenance have been applied to the sensor? • With U. Wisconsin, SDSC, and Indiana University.

Motivation • Adding a sensor requires a great deal of manual effort. • Reconfiguring datalogger • Reconfiguring data acquisition software • Reconfiguring QA/QC triggers • Reconfiguring database tables • QA/QC is not very automated • Result: Sensor networks are not very scalable. • Goal: Automate.

Metadata for each final table • Metadata • describes each final table • are used to generate forms • dynamically for data retrieval • from website • entered manually

Approach • Use a agent-based, bottom-up approach. • Agents coordinate among themselves, as much as possible. • Unify communications. All communications done via data streams. • Data streams represented as content-based, publish-subscribe systems.

Long-Term Ecological Research (LTER)

Agents • Characteristics • Autonomous • Bottom up • Distributed coordination • Independence/loosely-coupled • Can be thought of as a “style” for implementing distributed systems.

Sensor Metadata • Each sensor has intrinsic and extrinsic properties. • Intrinsic are type, model number, etc. • Static: Cannot be changed. • Dynamic: SDI-12 address. • Extrinsic are location, sampling rate, etc. • Use code generation techniques to generate the proper code based on the sensor data.

Automatic Sensor Detection and Inventory Data Center Field Station Computer Database Sensor WebService 2 7 Datalogger 3 3 6: Data 2 InstrumentAgent SensorMetadataRepository 1: Detection event 4: Generate DataloggerProgram 5: Upload Request Response Acquisition Computer

QA/QC • Malfunctioning anemometer detected as an abnormal occurrence of zero wind speed values. frequency of zero hourly average wind speed values per month

Another Example • Buoy was pulled down in the water by the ice. water temperature (deg C) normal winter sensors displaced Hu and Benson

Crystal Grid Framework • Seeks to develop standards and middleware for integrating instrument and sensor data into wide-area infrastructures, such as grid computing. • With Indiana University.

Motivation • Process of collecting and generating data is often critical. • Current mechanisms for monitoring and control either require physical presence, or use ad hoc protocols, formats. • Instruments and sensors are already “wired”. • Usually via obscure, or perhaps proprietary protocols. • Using standard mechanisms and protocols can give these devices a grid presence. • Benefit from a single, unified paradigm, terminology. • Single set of standards; exploit existing grid standards. • Simplifies end-to-end provenance tracking. • Faster, seamless interactions between data acquisition and data processing. • Greater interoperability and compatibility. Philosophy: Push grid standards as close to the instrument or sensor as possible. (But no further!) Deal with “impedance mismatches” close to the instrument, so as to localize complexity.

Goals • Develop a set of standard grid services for accessing and controlling instruments. • Based on Web standards such as WSDL, SOAP, XML, etc. • Develop a instrument ontology for describing instruments. • Applications use the description to interact. • The goal is to develop middleware that abstracts and layers functionality. • Minor differences in instruments should only result in minor loss of functionality to the application. • Move metadata and provenance as close to the instrument as possible.

Overview

Distributed X-Ray Crystallography • Crystallographer, chemist, and technician may be separated. • Large resources such as synchrotrons • Convenience and productivity • Expanding usage to smaller institutions • Data collection, analysis, and curation may be separated. • Approximate data requirements: 1-10 TB/year. • Currently stored at IU. • Real-time data collection and control. • Collaboration with IU, Sydney, JCU, Southampton.

X-Ray Crystallography • Scientists are very reluctant (understandably) to install your software on the acquisition machine. • Use a proxy box by which to access files via CIFS or NFS. • Scan for files which indicate activity. • Unfortunately, scientists can manually create files, which can confuse the scanner. No ideal solution. • For sensor data, request-response is not ideal. • Push data using one-way messages. • In WSDL 2.0, consider “connecting” out-only services to in-only services.

Acquisition Machine Acquisition Machine Proxy Box Proxy Box InstrumentServices InstrumentServices CIFS CIFS X-Ray Crystallography Argonne National Labs Indiana University DataArchive Fromdiffractometer InstrumentManager Portal University of Southhampton University of Sydney DataArchive Fromdiffractometer InstrumentManager Portal Non-grid service Persistent Grid service Non-persistent

TASCS: Center for Technology for Advanced Scientific Component Software • Multi-institution DOE project. • Seeks to develop a common component architecture for scientific components. • My focus within it is to develop a BabelRMI/Proteus implementation. • And develop C++ reflection techniques to improve dynamic connection abilities. • With LLNL and many other institutions.

Babel • Language interoperability toolkit developed at LLNL. • Allows writing objects in a number of languages, including non-OOP ones such as Fortran. • Began as a purely in-process tool, now includes an RMI interface.

Proteus • Started off as a unification API for messaging over multiple standards and implementations, such as CORBA, JMS, SOAP. • Moving towards focusing on multiprotocol web services. • Though almost always bound to SOAP, WSDL actually fully supports almost any protocol.

Runtime Stub Impl Generated Library IOR Skel User Babel-ProteusGenerated C++ Skel IOR SerializableObject RMI Stub C++ Stub SerializableObject WSIT WSIT B-PAdapter Proteus Proteus B-PAdapter

Client Client Proteus Proteus ProviderA ProviderA ProviderB ProviderB Multiprotocol Process 1 Process 2 Protocol A Protocol B Network

Lake Sunapee • Most e-Science/cyberinfrastructure R&D is for institutional science. • Assume significant resources and expertise. • Much less work on CI for citizen science, non-profits organizations, etc. • This project explores how to engage them in the development of cyberinfrastructure and e-Science. • Also with a focus on how to use e-Science to engage and educate K-12. • Also with a focus on how to train CS students to better engage scientists. • With U. Wisconsin, U. Michigan, LSPA, and IES.

Hold a series of workshops to understand needs. • Research and develop systems to allow them accessible means to interpret the sensor data. • Course component: seminar/project course where students will work with citizen scientists in small groups to define and implement e-Science projects with the lake association.

Semantic publish-subscribe. • Content-based publish-subscribe needs a content model. • Semantic web/description logics provide an ideal content model.

Many Small Datasets • Much ecological data is characterized not by a few large datasets, but many small datasets. • e-Science has up to now chosen to focus on a few large datasets, mostly.

Flexible Electronics and Nanotechnology • Work with Howard Wang in BU ME. • “Ontologies” for materials science processes (internal). • Undergraduate education project (NSF).

Material Processes • Materials science research product is the characterization of a process (vibration, heating, chemical, electrical, etc.). • Applying such research is finding a sequence of processes that will transform a material A (with certain properties such as particle size) to a material B (with certain other properties). • Very difficult to search the research literature. • Also, this is a type of path finding problem.

This is an anonymous node that only serves to “bind” the other nodes together. You can think of it as representing the process as a whole. tempSchedule “a differentschedule” hasName “annealing” tempSchedule Conceptually, the schedule is just a function that gives the temperature as output given the time as input. One question is whether or not to attempt to represent it partially in the graph model, or to treat it’s representation as completely outside the model. For example, a function can be represented as a table, or a Fourier series, wavelets, etc. “a schedule” hasName Information is sparse. “annealing”

Undergraduate Education • Groups of nanotechnology students develop senior design projects with CS students.

Programs -Australia -Canada -China -Finland -Florida -New Zealand -Israel -South Korea -Taiwan -United Kingdom -Wisconsin First meeting: San Diego March 7-9, 2005 Source: T. Kratz

Vision and Driving Rationale for GLEON • A global network of hundreds of instrumented lakes, data, researchers, students, • Predict lake ecosystems response to natural and anthropogenic mediated events • Through improved data inputs to simulation models • To better plan and preserve freshwater resources on the planet • More or less a grass roots organization. • Led by Peter Arzberger at SDSC, and with U. Wisconsin.

Why develop such a network? • Global e-science becoming increasingly possible • Developments in sensors and sensor networks allow some key measurements to be automated July 2005 Issue Porter, Arzberger, C. Lin, F. P. Lin, Kratz, et al. (2005) Source: T. Kratz

Outline • Overview of collaborative projects that I’m working on. • Discussion of challenges and approaches. • Technical overview of specific projects

Research Challenges • Biggest challenge is data. • Much time and effort is spent managing data in time-consuming and human-intensive means. • Often stored in Excel, text files, SAS. • Metadata in notebooks, gray matter. • No incentives to make data reusable. • Providing data is not valued academically. • Too much manual work involved in acquisition. • Means much is not captured automatically and semantically. • Standardization of things such as ontologies are very slowl, and tend to be top-down. • Can we first build a system that provides some benefit without forcing them to go through a painful standardization process?

Cyberinfrastructure and e-Science • There have been huge improvements in hardware. • There have been huge local improvements in software. • Not so many improvements in large-scale integration and interoperability.

Data, Data, and More Data! • Data is the driver of science. • Recent advances in technology have given us the ability to acquire and generate prodigious amounts of data. • Processing power, disk, memory have increased at exponential rates.

It’s Not a Few Huge Datasets • Huge datasets get more attention. • More glamorous. • Traditional type of CS problem. • Easier to think about. • But it’s the number of different datasets that is the real problem. • If you have one big one, can concentrate efforts on the problem. • Not very amenable to traditional CS “thinking”, since there is a very significant “human-in-the-loop” component. • The best CS research is useless if the human ignores it.

We Are The Same!(More or Less) Technology advances fast. People advance slowl! People compose our institutions, our organizations, our modes of practice. Result: The old ways of doing things don’t cut it. But we haven’t yet figured out the new ways.

Technology Impacts Slowly • Technologies often require many systemic changes to bring benefits. • Sometimes require other complementary technologies to be invented. • Steam engine invented in 1712, did not become huge economic success till 1800’s. • Motor and generator invented in early 1800s. • Real benefits did not occur till 1900s.

Steam To Electric • Steam-powered factories built around a single large engine. • Belts and other mechanical drives distributed power. • If you brought a motor to a factory foreman: • His factory wasn’t built for it. • He might not be able to power it. • Chicken-and-egg problem. • He doesn’t even know how to use it. • It took decades. • Similarly, I believe we are in the early stages when it comes to computer technology.

Socio-Technical Problem • What will it take to figure out how to use all this data? • Not a pure CS problem, people’s actions affect how easy is it use all the data. • Many problems these days are sociotechnical in nature. • Password security is a solved problem. • Interoperability is a solved problem. • Figuring out how to use data is even harder than power, since power distribution is physical, easy to see. • Data/info flow is hard to see.

A Vision • A scientist sits in his office. • He wonders: “I wonder if children who live closer to cell towers have higher rates of autism?” • How much time would it take a scientist to test this hypothesis? • Find the data. • Reformat the data, convert it, etc. • Run some analysis tools. Maybe find time on a large resource. • But the data is out there! • There are many hypotheses that are never tested because it would take too much work.

This vision also applies to business, military, medicine, industry, management, etc. • There are a million sources of data out there. • Real-time data streams, archived data, scientific publications, etc. • How can we build a flexible infrastructure that will allow analyses to be composed and answered on the fly? • How do we go from data+computation to knowledge?

RDF-like Data Model • We hypothesize that part of the problem is that RDBMS are based on data models that do not fit scientific data well. • This “impedance” mismatch is a barrier. • Thus, develop models that more closely resemble the mental model that scientists use when thinking about data. • The less a priori structure imposed on the data, the better.

Goals • Allow some common subset of code and design to be used for many scientific data and applications. • Suggest a data and information architecture for querying and storage. • Provide some fundamental semantics. Each discipline would then refine these semantics. • Don’t get bogged down in trying to figure out everything. Just try to find some LCD. • This is a logical model of data. Also need a “physical” model to handle transport, archiving, etc. Then need to map from the physical model to the logical model. For example, an image file has more than just the raw intensities. But some metadata may not be in the file. We don’t want the logical model to be concerned about the how the data is actually arranged. • Promote bottom-up, grass-roots approaches to building standards.

One Person’s Metadata Is Another Person’s Data • Distinction between data and metadata is artificial and problematic. • What is metadata in one context becomes data in another. For example, suppose you are taking the temperature at a set of locations (determined via GPS). So for each reading, the temperature is the data, and the metadata is the location. But now suppose that you need the error of the location. So now the error becomes the metametadata of the location metadata? • A made-up example based loosely on crystallography: The spatial correction is based on a calibration image obtained from a brass plate. So the calibration image is metadata for the set of frames. Now suppose that they need the temperature of the brass plate when the image was made. So now the temperature is metametadata.

Addressing the Challenges of the Scientific Data Deluge

Addressing the Challenges of the Scientific Data Deluge

Presentation Transcript

Shooting the data deluge rapids:

The Data Deluge and the Grid

The Data Deluge

A Deluge of Data

Addressing the Challenges , Harnessing the Opportunities of

Addressing Challenges to the Juice Industry

Addressing the Challenges of the Next Legislative Session

Riding the Wave: Learning to Surf the Data Deluge

The Challenges of Scientific Computing

Surviving the Data Deluge: Challenges and Opportunities for Academic Research

Addressing the challenges of human trafficking and smuggling

Challenges in the Future of Scientific Surveys

Data Management Challenges of Data-Intensive Scientific Workflows

The Researcher’s Guide to the Data Deluge: Querying a Scientific Database in just a Few Seconds

The Representation of Scientific Data

Disaster Risk Reduction: Addressing the Scientific and Societal Challenges of Extreme Geohazards

Data Deluge

High-throughput Biological Data The data deluge

Addressing the Challenges of Limited Conflict

Scientific Data Discovery with XMC Cat Pushing Back on the Data Deluge:

Beyond the Data Deluge:

ADDRESSING THE IMPLEMENTATION CHALLENGES