Leonid Kalinichenko, Sergey Stupnikov, Victor Zakharov, Vladimir Budzko, Vadim Korolev

Unifying mediation of knowledge, data and services in a subject domain for problem solving over heterogeneous information resources Leonid Kalinichenko, Sergey Stupnikov, Victor Zakharov, Vladimir Budzko, Vadim Korolev Institute of Informatics Problems, Russian Academy of Science Declaration of Intent Draft by IPI RAN SkTech.RC/IT/Madnick

Outline • State of the art in subject mediation reached at IPI RAN • Directions of research and development suggested for use in the proposal SkTech.RC/IT/Madnick • Investigation of application driven approach for scientific problem solving in the subject mediator environment • Heterogeneous multidialect mediator infrastructure for data, knowledge and services semantic integration • Mediation of data bases with nontraditional data models • Storage of very large volumes of data [Zakharov] • Cyber security issues [Budzko, Korolev] • Self-certification • Coverage by the DoI of a content of the three themes (Scientific Dataspace, Data Quality and Big Data) declared by Prof Stuart Madnick

State of the art in subject mediation reached at IPI RAN

Basic principles • Subject mediation technology is aimed to fill the widening gap between the users (applications) and heterogeneous distributed information resources • independence of definition of problem domain (the mediator definition) of the existing information resources • definition of a mediator as a result of consolidated efforts of the respective scientific community • independence of user interfaces of the multiple information resources involved • information about new resources can be published at any time independently of mediators acting at that time • GLAV-based setting for relevant information resources integration at the mediator • integrated access to the information resources in process of problem solving • recursive structure of a mediators

refines R1 E1 Kernel E2 refines R2 E3 refines R3 Canonical Model Resource information models Canonical information model synthesis

Resources identification and integration • Identification relevant resources • metadata model (capabilities) • ontological model (concepts and their relationships) • canonical model (structure and behavior) • Integration of relevant resources in a mediator (registration) • GLAV = Local As View (LAV) + Global As View (GAV) • GAV: provide for reconciliation of various conflicts between resource and mediator specifications • LAV: resource schemas are registered in mediator as materialized views over virtual classes of a mediator • stability of application problem specification during any modifications of resources is provided • scalability of mediators w.r.t. the number of resources is provided

Subject mediation: results obtained at IPI RAN (I) • A prototype of the subject mediation infrastructure used for problem solving over multiple distributed information resources (specifically, in the astronomy problem domain) [slide 8] • Methods and tools for mapping and transformation of information models of heterogeneous resources intended for their unification in mediation middleware • The Model Unifier prototype tool aimed at partial automation of heterogeneous information models unification has been implemented • First version is based on term-rewriting technology • The second version as an Eclipse platform application based on model transformation languages is under implementation [slide 9] • Methods for information resources semantic interoperability support in a context of application problem domain • Tools for identification of resources relevant to a problem on the basis of ontological descriptions of problem domain • Tools for registration of the relevant resources in the mediator

Subject mediation infrastructure

Model Unifier architecture

Subject mediation: results obtained at IPI RAN (II) • Methods and tools for rewriting of non-recursive mediator programs into resource partial programs oriented on object schemas of resources and mediators and typed GLAV-views • A method for optimizing planning of resource partial programs execution over distributed environment • takes into account capabilities of the resources • assigns places of operation’s execution on the basis of estimative samples • Methods for dispersed organization of problem solving in the mediation environment • An implementation of a problem in mediation environment may be dispersed among programming systems, mediators, GLAV-views, wrappers and resources • Methods and tools for representation, manipulation and estimation of efficiency of dispersed organization • Algorithms for construction of efficient dispersed organization • An original approach for binding of programming languages with declarative mediator rule language • The approach combines static and dynamic binding overcoming impedance mismatch and allowing dynamic result types

Directions of research and development Application-driven approach for scientific problem solving

Application-driven approach for scientific problem solving • Approaches to the integrated representation of multiple information resources for problem solving: • Resource-driven: an integrated representation of multiple resources is created independently of the problem • Application-driven: a description of a problem class subject domain is created, into which the relevant to the problem resources are mapped • Application-driven approach assumes creation of a subject mediator that supports an interaction between a user and resources

Experience of applying the application driven approach • The problem of secondary standards search for photometric calibration of optical components of gamma-ray bursts formulated by the Institute of Space Research of RAS • The problem was formalized and implemented applying the subject mediation: • A glossary of the problem domain was manually extracted from the textual specification • An ontology required for problem solving was constructed • Data structures, methods and functions constituting problem domain schema were defined • Resources relevant to the problem were identified in the Astrogrid and VizieR information grids • SDSS, USNO B-1, 2MASS, GSC, UCAC, VSX, ASAS, GCVS, NSVS • Resources were registered in the mediator and corresponding GLAV-views were obtained • The problem was formulated as a program consisted of a set of declarative rules over the mediator schema • The implemented mediator is used for an application monitoring in real time the e-mails informing about the gamma-ray bursts. The application extracts standards located in the area of a burst and e-mails them to subscribers.

Issues requiring further investigations • Semantic identification of resources relevant to a mediator • Construction of semantic source to target schema mapping in the presence of constraints reflecting specificity of various data models • Development of mediator program rewriting algorithms in presence of source and mediator constraints over the classes of objects

Directions of research and development Heterogeneous multidialect mediator infrastructure for data, knowledge and services semantic integration

An approach for the infrastructure • Recently W3C adopted Rule Interchange Format (RIF) standard oriented on interoperability of declarative programs • Objective • integration of • multilanguage knowledge representations and rule-based declarative programs, • heterogeneous databases and services • built on the basis of unified languages and multidialect mediation infrastructure • Idea • Combining RIF standard paradigm and • GLAV approach built on the extensible canonical information model

Modular mediator infrastructure • The multidialectal construction of the canonical model • Mediators are represented as a functional composition of declarative specification of modules • Each module is based on its own dialect with an appropriate semantics • Mediator modules as peers: • Rule-based modules become the mediator components alongside with the GLAV-based modules • Interoperability of the modules is based on P2P and W3C RIF techniques. • Combination of integration and interoperability • The information resource integration can be provided in the scope of an individual mediator module • The integration approaches in different modules can be different. • Rule-based specifications on different levels of the infrastructure • Declarative programming over the mediators • Various modules of a mediator • Schema mapping for semantic integration of the information resources in the mediator • etc

Example of a problem solving in the multidialect mediation infrastructure • A problem of finding an optimal assignment of applicants among universities • A set of n applicants is to be assigned among m universities, where qi is the quota of the i-th college • Applicants (universities) rank the universities (the applicants) in the order of their preference • The aim is to find optimal assignment from the quotas of the colleges and the two sets of orderings • An assignment is unstable if there are two applicants α and β who are assigned to colleges A and B, respectively, although β prefers A to B and A prefers β to α, otherwise an assignment is stable • A stable assignment is called optimal if every applicant is at least as well off under it as under any other stable assignment • Program calculating assignment is defined in DLV (ASP) • The required information resources are integrated in a subject mediator • OntoBroker communicates with the users and applying its ontologies, formulates the queries to the mediator and after collecting the required data, initiates a program in DLV

Optimal assignment problem infrastructure Requests 1. OB2DLV: GetProgram(Loc, Name [Params]) 2. OB2SYNTH: GetSchema(Loc, Name [Params]) 3. OB2SYNTH: SendExec(Loc,Name,Prog [Pars]) 4. OB2DLV: SendExec(Loc, Name, Prog [Pars]) OntoBroker Ontologies BLD → OB OB → BLD Responses 1. DLV2OB: DLV Program (without IDB) 2. SYNTH2OB: Synthesis Schema 3. SYNTH2OB: Result of OB program execution. 4. DLV2OB: Result of DLV program execution. Multi-Layered Broker Req. 2, 3 Resp. 1, 4 Resp. 2, 3 Req. 1, 4 RIF-BLD (via XML) BLD → Synthesis Synthesis → BLD BLD → DLV DLV → BLD Synthesis Mediation Environment DLV (ASP facilities) Resources

Issues to be investigated and prototyped • Approaches for constructing of the rule-based dialect mappings • Methods for justification of semantic preservation by the mappings • Approaches for modular representation of knowledge in the multidialect mediation environment • Approaches for providing of interoperability of the mediator multidialect modules • Infrastructure design and prototyping • Real problems solving in a scientific subject domains chosen • Expansion of the experience into the Semantic Web area

Directions of research and development Mediation of data bases with nontraditional data models

Non-traditional data models • NoSQL data models oriented on the support of extra large volumes of data applying a “key-value” technology for vertical storage • Dynamo, BigTable, HBase, Cassandra, MongoDB, CouchDB. • Graph data models • Neo4j, InfiniteGraph, DEX, InfoGrid, HyperGraphDB, Trinity, supporting flexible data structures. • Triple-based data model (expressible in RDF, RDFS) • Virtuoso, OWLIM, 5Store, Bigdata. • OWL QL profile oriented on a support of ontological modeling over relational databases and expressed by data dependencies used together with Datalog • “Scientific” data models • SciDB applying a multidimensional array data model • Prof. Pentland Connection science-oriented data models • Most of these data models the standards still do not exist • Most of these data models and systems are oriented on “big data” support applying massive parallel technique of the MapReduce kind

The results of research planned to obtain • Information preserving methods of mapping and transformations of various classes of non-traditional data models into the canonical one • Mappings and transformations for specific data models and of adequate extensions of the canonical data model • Techniques for interpretation of canonical model DML in the DMLs of different classes of non-traditional data models and approaches for their implementation • Architectural decisions on implementation of the massive parallel techniques on the level of mediators, evaluation of performance growth that can be reached • Evaluation of suitability and efficiency of integration of non-traditional data models of different classes in the GLAV mediation infrastructure for various problem domains

Directions of research and development Storage of very large volumes of data [Zakharov]

Storage of very large volumes of data [Zakharov] • The objective is to develop a novel distributed parallel fault-tolerant file system possessing the following capabilities: • storage of data volumes of petabyte scale • unlimited period of storage • scalability • efficient multiuser access support in different kinds of networks • usage of different storage types (e.g., HDD and flash memory) • The experience of existing file systems vendors should be taken onto account: • ReFS (Windows Server 8) by Microsoft • VMFS by VMware • Lustre • ZFS by Sun Microsystems • zFS (z/OS) by IBM • OneFS by Isilon

Directions of research and development Cyber security issues [Budzko, Korolev]

Cyber security issues [Budzko, Korolev] • Information integrity and availability support for large-scale data gathering & mining • Technical architectures security analysis (network protocols, architectures, operating systems, DBMSs, etc.) • Vulnerability analysis • Development of threat models • Protection from insiders in personal information data centers

Self-assessment

Self-assessment (I) • Relevance • Semantic integration of resources in the context of an application • Mediation of knowledge • Mediation of non-traditional databases • Semantic Web and Big Data orientation. • Novelty • An intellectual executable level for declarative conceptual level specification of the problems in terms of the application domain for problem solving over diverse resources • Methods for information preserving data model mappings and for their implementation • Schema mapping and query rewriting methods in presence of constraints reflecting specificity of diverse data models, etc. • Breadth of scope • Relevant to a broad area of application domains, technologies and research issues.

Self-assessment (II) • Challengability • Hard theoretical and implementation problems need to be overcome • Entrepreneurship possibilities • Areas of possible application are very diverse • To reach a proper commercialization level serious investments are required • Educational potential • Very broad, various courses can be proposed for master students • Many challenging research topics for PhD research

Coverage of a content of the proposed themes

Scientific DataSpace • Large-scale federated data architecture • Semantic integration of heterogeneous information • Context mediation • Semantic web • Architecture for semantic mediation and integration of heterogeneous resources • Infrastructures: semantic layer for grids and clouds, P2P heterogeneous knowledge-based mediator infrastructures • Data model transformation, data model unification, declarative canonical model extension and synthesis • Justification of correctness of data model transformation, sets of dependencies (constraints) extending canonical model core should be decidable and tractable • Information resources: semantic description, canonical modeling, wrappers, registries, metadata • Problem domains: conceptual description, ontologies, metadata, multidomains, context mediation • Semantic based information resource discovery • Semantic schema mapping for data exchange and integration

Data Quality • Recognizing and resolving heterogeneous data semantics • Effective integration of data from multiple and disparate data sources • Semantic schema mapping • Justification of correctness of data model (schemas and operations) transformation • Dispersed implementation of problems in subject mediation environment

Big Data • Data extraction and gathering from the web • Federated data systems • Parallel infrastructures for high-performance big data manipulation and analysis • Large-scale and novel “big data” applications • Novel approaches to development of large-scale data warehouses • Mediation infrastructure including Grids and clouds • Non-traditional data models integration in the canonical data model • Parallel infrastructures at the mediation level • Distributed parallel fault-tolerant file system

International Cyber Security • Secure information architectures • Techniques for assessment of threats and vulnerabilities • Cyber security issues

Leonid Kalinichenko, Sergey Stupnikov, Victor Zakharov, Vladimir Budzko, Vadim Korolev

Leonid Kalinichenko, Sergey Stupnikov, Victor Zakharov, Vladimir Budzko, Vadim Korolev

Presentation Transcript

LEONID BREZHNEV

In 1924 Sergey Korolev entered into Kiev Polytechnic University.

Leonid Polishchuk

Sergey Shvedov

Sergey Kravchenko

Authors: Vladimir Rubtsov , Sergey Kapralov , Y uri Chalyk , Sergey Ulyanov

Leonid Sukhikh

Sergey Orlov

Leonid Kalinichenko, Sergey Stupnikov, Victor Zakharov, Vladimir Budzko, Vadim Korolev

Hans Verkerk, Vladimir Korotkov, Jeannette Meyer, Sergey Zudin, Sergey Lebedev, Marcus Lindner

Vladimir

Vladimir Korkhov vladimir@csa.ru

Evgeny Vyazilov, Sergey Belov, Sergey Sukhonosov

Sergey Ripinsky

Sergey Ripinsky

Vladimir Zakharov the Poet

Sergey Velder

Vadim Kanavets (ITEP)

Kalinichenko

Sergey Konart

Vladimir Lutkovsky Sergey Shumov UKRAINIAN HYDROMETEOROLOGICAL RESEARCH INSTITUTE

Sergey Ripinsky