Introduction to e-Science and Semantic Web

Introduction to e-Science and Semantic Web Professors Deborah McGuinness and Joanne Luciano (With Li Ding and Peter Fox) CSCI-6962-01 Week 1, August 30, 2010

Admin info (keep/ print this slide) • Class: CSCI-6962-01 • Hours: 1pm-3:50pm Mondays (except one) • Location: Winslow 1140 • Instructors: Deborah McGuinness, Joanne Luciano, with Peter Fox and Li Ding • Instructor contacts: dlm@cs.rpi.edu, jluciano@cs.rpi.edu • Contact locations: Winslow 2104 (DLM), 2143 (JSL) • Wiki: http://tw.rpi.edu/wiki/Semantic_e-Science_%282010_Fall%29

Introductions • Who are we? • Who are you? • Why are you here? • What do you want to get out of the class? • Will you make the class (on time) each week and do you have any other conflicts or issues we should know about?

“Knowledge is the common wealth of humanity”* • In the Earth and space sciences and elsewhere, ready and open access to the vast and growing collections of cross-disciplinary digital information is the key to understanding and responding to complex Earth system phenomena that influence human survival. • We have a shared responsibility to create and implement strategies to realise the full potential of digital information and services for present and future generations. *Adama Samassekou, Convener of the UN World Summit on the Information Society

What do we need to achieve Semantic eScience? (in-class brainstorming exercise) organization, leadership, management strategies, roles and assignment of roles dissemination strategy communication of ideas - machine level - human level conflict resolution cross-disciplinary collaboration flexible adaptable, feedback extensible ability to filter information usage/application of resources, optimization facts, knowledge (domain knowledge) context, domain, scope goals, use cases metadata - data to describe data ability to link information ability to understand information ability to capture and represent conflicting ideas provenance - where data come from trust - reliable ability to capture intent (humanitarian aspect / responsibility) credibility of information interesting and appealing standardization education and outreach methods and metrics criteria for evaluation

Outline of the course • Topics for Semantic e-Science/ Foundations: • Semantic Methodologies • Knowledge Representation for e-Science • Ontology Engineering and Re-Use for e-Science • Knowledge Integration for e-Science • Semantic Data Integration • Semantic Web Languages, Tools and Services • Semantic Infrastructure and Architecture for e-Science • Semantic Grid Middleware • Ontology Evolution for e-Science • Knowledge Management for e-Science • e-Science Workflow Management • Data life-cycle for e-Science • Data Mining and Knowledge Discovery

Contents • Outline of the course • Background • e-Science • Examples • Informatics • Semantics • Elements of Semantic e-Science (SeS) • What we expect • Logistics summary

The Information Era: Interoperability • managing and accessing large data sets • higher space/time resolution capabilities • rapid response requirements • data assimilation into models • crossing disciplinary boundaries. Modern information and communications technologies are creating an “interoperable” information era in which ready access to data and information can be truly universal. Open access to data and services enables us to meet the new challenges of understand the Earth and its space environment as a complex system:

One Real World Example Influenza Ontology Development to Support Research, Surveillance and Monitoring

Ontology Support for Influenza Research and Surveillance Joanne Luciano, PhD, Lynette Hirschman, PhD, Marc Colosimo, PhD Approved for Public Release; Distribution Unlimited. 28 April 2008 Case Number 08-0738

Case Study: Indonesia • Possible Human to Human transmission of H5N1 (May 2006) • Samples were collected and epidemiological data obtained • Know who got sick and their relationship to each other • Know when they got sick and if they died • Have some public sequence data from that time • It is not known if these sample are from these people! isolation_source="gender:M; age:32; Lung Aspirate" Public Sequence Data30 Aug 2006 A/Indonesia/CDC595/2006 (2006-05-09)A/Indonesia/CDC594/2006 (2006-05-10)….A/Indonesia/CDC625L/2006 (2006-05-22)A/Indonesia/CDC644/2006 (2006-05-30) Same person ? Metadata 23 May 2006 Metadata 12 Jul 2006 GenBank Nature WHO

Case Study: UK • Outbreak of H5N1 in the UK at a turkey farm Feb 1, 2007 • What is the source of the outbreak? • Contact with infected wild birds? • But turkeys were in an enclosed “biosecure” unit • No H5N1 detected in the region in the 2 previous months • Govt. veterinarian suggested turkey meat from Hungary might be source of infection • Turkey farm is adjacent to a poultry packing plant that had processed poultry products from Hungary • Hungary had reported an H5N1 outbreak 2 weeks earlier • Sequence data showed that strain infecting the turkeys was 99.96% identical to strain that had infected Hungarian birds • Conclusion: Infected Hungarian poultry was source of H5N1 infection • Open question (relevant to food defense): how did H5N1 spread from processing plant to live turkeys?

Research Agenda Create “Reference” Database to hold Influenza virus sequences and “Metadata” • What metadata to collect? • Where to find data and how to connect different sources (bridging the gap)?

Research Question: Bridging the Gap -Connecting Genomics and Epidemiology GenomicSequence Data Systems Biology Demographic dataClinical data Geospatial data Temporal dataPathogenicity Host Genomics: Genes ofPathogen Epidemiology:Occurrence ofDisease in Host Influenza Ontology

Influenza Ontology: Development • Identify the right collaborators • Collect metadata terms • Identify resources for that include these terms • Regularize metadata • Generate a controlled vocabulary (terms) • Validate subset with BioHealthBase CEIRS data • Iterate, review with community, publish • Integrate Influenza ontology into workflow

Influenza Ontology First Draft: Community • BioHealthBase: NIAID Influenza Database Point of Contact for • Centers of Excellence for Influenza Research and Surveillance (CEIRS) • Research: Emory, Mt Sinai, St. Jude, Univ. of Rochester • Surveillance: St. Jude, UCLA, Univ. of Minnesota • Los Alamos National Laboratory (LANL) • Gemina: Category A-C Pathogen Database Point of contact for • Children’s Hospital Boston • Johns Hopkins University • MITRE Collaboration with BioHealthBase and Gemina

Influenza Ontology First Draft: Identify metadata • ~200 controlled vocabulary terms • covering several fields

Influenza Ontology: Metadata resources Reuse of existing ontologies & metadata standards OBI – Ontology of Biomedical Investigations EnvO – Environmental Ontology (habitat of pathogen) GAZ – Gazeteer (geographic locations) FMA – Foundational Model of Anatomy DC – Dublin Core (publication metadata) PATO – Phenotype SO – Sequence Ontology (sequence features) Cell – Cell Ontology (types of cells) DO – Disease Ontology IDO – Infectious Disease Ontology

Formalize: • Normalize terms into a CV • Issue unique identifiers • Instantiate class hierarchy • Define properties and values • Link to external ontology • terms OBO-Edit: Ontology Editing Tool Influenza Ontology First Draft • Initial steps: • Collect metadata terms • Map and align terms • Group related information • Identify and define relationships • Identify external ontologies Formalize Excel Spreadsheet

Subsequent Work Ontology development • Complete formalization process • Validate subset with data from BioHealthBase • Circulate for review and comments • Use ontology to annotate influenza data

Team Note! (not mentioned in class): This collaboration became international in it’s 3rd year when the Canadian Government decided they too needed an ontology to support integration of influenza data - that the spread of influenza does not stop at the international borders. When they did their research, they found us and joined the collaboration and we were grateful to have their help and expertise. • BioHealthBase (UT Southwestern Medical Center) • Burke Squires • Richard Scheuermann • Institute of Genome Sciences/Gemina (U. Maryland Baltimore) • Lynn Schriml • MITRE • Joanne Luciano • Lynette Hirschman • Marc Colosimo • British Columbia Cancer Agency (Vancouver, Canada) • Ryan Brinkman • Mélanie Courtot

Background Scientists should be able to access a global, distributed knowledge base of scientific data that: • appears to be integrated • appears to be locally available But… data is obtained by multiple means, using various protocols, in differing vocabularies, using (sometimes unstated) assumptions, with inconsistent (or non-existent) meta-data. It may be inconsistent, incomplete, evolving, and distributed And… there often exists significant levels of semantic heterogeneity, large-scale data, complex data types, legacy systems, inflexible and unsustainable implementation technology…

But data has Lots of Audiences Information products have Information More Strategic Less Strategic SCIENTISTS TOO From “Why EPO (Education and Public Outreach)?”, a NASA internal report on science education, 2005

Shifting the Burden from the Userto the Provider Fox CI and X-informatics - CSIG 2008, Aug 11

e-Science • Emphasis is on Science • Original narrative: One of the key drivers behind the search for such new scientific tools is the imminent deluge of data from new generations of scientific experiments and surveys (*). In order to exploit and explore the petabytes of scientific data that will arise from these high-throughput experiments, supercomputer simulations, sensor networks, and satellite surveys, scientists will need assistance from specialized search engines, data mining tools, and data visualization tools that make it easy to ask questions and understand answers. To create such tools, the data will need to be annotated with relevant "metadata" giving information as to provenance, content, conditions, and so on; and, in many instances, the sheer volume of data will dictate that this process be automated. Scientists will create vast distributed digital repositories of scientific data requiring management services similar to those of more conventional digital libraries, as well as other data-specific services. The ability to search, access, move, manipulate, and mine such data will be a central requirement for this new generation of collaborative science software applications. Hey and Trefethen, 2005

Evolving Science • Thousand years ago: science was empirical describing natural phenomena • Last few hundred years: theoretical branch using models, generalizations • Last few decades: a computational branch simulating complex phenomena • Today: data exploration (eScience) synthesizing theory, experiment and computation with advanced data management and statistics new algorithms!

Living in an Exponential World • Scientific data doubles every year • caused by successive generations of inexpensive sensors + exponentially faster computing • Changes the nature of scientific computing • Cuts across disciplines (eScience) • It becomes increasingly harder to extract knowledge • 20% of the world’s servers go into huge data centers by the “Big 5” • Google, Microsoft, Yahoo, Amazon, eBay • So it is not only the scientific data!

Collecting Data • Very extended distribution of data sets: data on all scales! • Most datasets are small, and manually maintained (Excel spreadsheets) • Total amount of data dominated by the other end(large multi-TB archive facilities) • Most bytes today are collected via electronic sensors

Making Discoveries • Where are discoveries made? • At the edges and boundaries • Going deeper, collecting more data, using more colors…. • Metcalfe’s law • Utility of computer networks grows as the number of possible connections: O(N2) • Federating data (the connections!!) • Federation of N archives has utility O(N2) • Possibilities for new discoveries grow as O(N2) • Many examples • Sky surveys – galaxy zoo… Very early discoveries from SDSS, 2MASS, DPOSS • Genomics+proteomics • Alzheimers article in reading

You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years Oh!, and 1PB ~4,000 disks At some point you need indices to limit searchparallel data search and analysis This is where databases can help Take the analysis to the data!! You can FTP 1 MB in 1 sec You can FTP 1 GB / min (~1 $/GB) … 2 days and 1K$ … 3 years and 1M$ Data Delivery: Hitting a Wall FTP and GREP are not adequate

Mind the Gap! • As a result of finding out who is doing what, sharing experience/ expertise, and substantial coordination: • There is/ was still a gap between science and the underlying infrastructure and technology that is available • Informatics - information science includes the science of (data and) information, the practice of information processing, and the engineering of information systems. Informatics studies the structure, behavior, and interactions of natural and artificial systems that store, process and communicate (data and) information. It also develops its own conceptual and theoretical foundations. Since computers, individuals and organizations all process information, informatics has computational, cognitive and social aspects, including study of the social impact of information technologies. Wikipedia. • Cyberinfrastructure is the new research environment(s) that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computing and information processing services over the Internet.

Progression after progression Informatics

World-Wide Emerging Technology Trends • Innovation will come from other parts of the world other than the U.S. • The Chinese have skipped the Internet first generation. • Growth will occur in Asia, and continue to decrease in Western Europe. • U.S. Industry is compulsively outsourcing abroad. • Software is moving from forms-based applications to business processes. • Networks are migrating to IP and optical networking technologies.

Cyberinfrastructure • Data curation and storage • Federated access • Collaboration • New uses in High Performance Computing • Databases • Web servers, services (software as service) • Wiki • Visualization • All discipline neutral

Semantic Web Methodology and Technology Development Process • Establish and improve a well-defined methodology vision for Semantic Technology based application development • Leverage controlled vocabularies, etc. Adopt Technology Approach Leverage Technology Infrastructure Science/Expert Review & Iteration Rapid Prototype Open World: Evolve, Iterate, Redesign, Redeploy Use Tools Evaluation Analysis Use Case Develop model/ ontology Small Team, mixed skills

Ex. 1: Virtual Observatories Make data and tools quickly and easily accessible to a wide audience. Operationally, virtual observatories need to find the right balance of data/model holdings, portals and client software that researchers can use without effort or interference as if all the materials were available on his/her local computer using the user’s preferred language: i.e. appear to be local and integrated Likely to provide controlled vocabularies that may be used for interoperation in appropriate domains along with database interfaces for access and storage -> thus part IT, part CI, part Informatics and all about doing new science

Added value Education, clearinghouses, other services, disciplines, et c. Semantic interoperability Added value Added value Semantic query, hypothesis and inference Semantic mediation layer - mid-upper-level Added value VO API Web Serv. VO Portal Query, access and use of data Mediation Layer • Ontology - capturing concepts of Parameters, Instruments, Date/Time, Data Product (and associated classes, properties) and Service Classes • Maps queries to underlying data • Generates access requests for metadata, data • Allows queries, reasoning, analysis, new hypothesis generation, testing, explanation, et c. Semantic mediation layer - VSTO - low level Metadata, schema, data DBn DB2 DB3 … … … … DB1

Science and technical use cases Find data which represents the state of the neutral atmosphere anywhere above 100km and toward the arctic circle (above 45N) at any time of high geomagnetic activity. • Extract information from the use-case - encode knowledge • Translate this into a complete query for data - inference and integration of data from instruments, indices and models Provide semantically-enabled, smart data query services via a SOAP web for the Virtual Ionosphere-Thermosphere-Mesosphere Observatory that retrieve data, filtered by constraints on Instrument, Date-Time, and Parameter in any order and with constraints included in any combination.

Inferred plot type and return required axes data

Semantic Web Benefits • Unified/ abstracted query workflow: Parameters, Instruments, Date-Time • Decreased input requirements for query: in one case reducing the number of selections from eight to three • Generates only syntactically correct queries: which was not always insurable in previous implementations without semantics • Semantic query support: by using background ontologies and a reasoner, our application has the opportunity to only expose coherent query (portal and services) • Semantic integration: in the past users had to remember (and maintain codes) to account for numerous different ways to combine and plot the data whereas now semantic mediation provides the level of sensible data integration required, and exposed as smart web services • understanding of coordinate systems, relationships, data synthesis, transformations, etc. • returns independent variables and related parameters • A broader range of potential users (PhD scientists, students, professional research associates and those from outside the fields)

But data has Lots of Audiences More Strategic Less Strategic From “Why EPO?”, a NASA internal report on science education, 2005

What is a Non-Specialist Use Case? Someone should be able to query a virtual observatory without having specialist knowledge Teacher accesses internet goes to An Educational Virtual Observatory and enters a search for “Aurora”.

What should the User Receive? Teacher receives four groupings of search results: 1) Educational materials: http://www.meted.ucar.edu/topics_spacewx.php and http://www.meted.ucar.edu/hao/aurora/ 2) Research, data and tools: via research VOs but the search for brightness, or green/red line emission is mediated for them 3) Did you know?: Aurora is a phenomena of the upper terrestrial atmosphere (ionosphere) also known as Northern Lights 4) Did you mean?: Aurora Borealis or Aurora Australis, etc.

Semantic Information Integration: Concept map for educational use of science data in a lesson plan Fox CI and X-informatics - CSIG 2008, Aug 11

Fox CI and X-informatics - CSIG 2008, Aug 11

Semantic Web Basics • The triple: {subject-predicate-object} Interferometer is-a optical instrument Optical instrument has focal length An ontology is a representation of this knowledge • W3C is the primary (but not sole) governing organization for languages, specifications, best practices, et c. • RDF - Resource Description Framework • OWL 1.0 - Ontology Web Language (OWL 2.0 on the way) • Encode the knowledge in triples, in a triple-store, software is built to traverse the semantic network, it can be queried or reasoned upon • Put semantics between/ in your interfaces, i.e. between layers and components in your architecture, i.e. between ‘users’ and ‘information’ to mediate the exchange

Terminology • Semantic Web • An extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation, www.semanticweb.org • Primer: http://www.ics.forth.gr/isl/swprimer/ • Semantic Grid • Semantic services to use the resources of many computers connected by a network to solve large scale computational/ data problems • Provenance • origin or source from which something comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility. • Service-oriented architecture • Provision of a capability over the internet via a ‘remote-procedure-call’ using prescribed input, output and pre-conditions • Ontology (n.d.). The Free On-line Dictionary of Computing. http://dictionary.reference.com/browse/ontology • An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.

Terminology • Closed World - where complete knowledge is known (encoded), AI relied on this • Open World - where knowledge is incomplete/ evolving, SW promotes this • Languages • OWL - Web Ontology Language (W3C) • RDF - Resource Description Framework (W3C) • OWL-S/SWSL - Web Services (W3C) • WSMO/WSML - Web Services (EC/W3C) • SWRL - Semantic Web Rule Language, RIF- Rules Interchange Format • PML - Proof Markup Language • Editors: Protégé, SWOOP, Medius, SWeDE, … • Reasoners • Pellet, Racer, Medius KBS, FACT++, fuzzyDL, KAON2, MSPASS, QuOnto • Query Languages • SPARQL, XQUERY, SeRQL, OWL-QL, RDFQuery • Other Tools for Semantic Web • Search: SWOOGLE swoogle.umbc.edu • Collaboration: www.planetont.org • Other: Jena, SeSAME/SAIL, Mulgara, Eclipse, KOWARI • Semantic wiki: OntoWiki, SemanticMediaWiki • Emerging Semantic Standards for Earth Science • SWEET, VSTO, MMI, GeoSciML

Semantic Web Layers http://www.w3.org/2003/Talks/1023-iswc-tbl/slide26-0.html, http://flickr.com/photos/pshab/291147522/

Application Areas for Semantics • Smart search • Annotation (even simple forms), smart tagging • Geospatial • Implementing logic (rules), e.g. in workflows • Data integration • Verification …. and the list goes on • Web services • Web content mining with natural language parsing • User interface development (portals) • Semantic desktop • Wikis - OntoWiki, SemanticMediaWiki • Sensor Web • Software engineering • Explanation

Introduction to e-Science and Semantic Web

Introduction to e-Science and Semantic Web

Presentation Transcript

Introduction to Semantic Web

Introduction to the Semantic Web

Introduction to Semantic Web

Introduction to Semantic Web and RDF

Introduction to Semantic Web

Introduction to Semantic Web

Introduction to Semantic Web and Ontologies

Introduction to the semantic web

Semantic Web Introduction

Introduction to the Semantic Web

Introduction to Semantic Web and RDF

Introduction to Semantic Web and Ontologies

Introduction to Semantic Web Design

Introduction to the Semantic Web

Semantic Web Science

Semantic Web for E-Science and Education

Introduction to Semantic Web Portal

Introduction to the Semantic Web

Introduction to Semantic Web and RDF

Introduction to the Semantic Web

Introduction to the Semantic Web