interoperability is it feasible n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Interoperability - is it feasible - PowerPoint Presentation
Download Presentation
Interoperability - is it feasible -

Loading in 2 Seconds...

play fullscreen
1 / 42

Interoperability - is it feasible - - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Interoperability - is it feasible -. Peter Wittenburg. Why care about interoperability?. e-Science & e-Humanities “data is the currency of modern research” thus need to get integrated access to many data sets data sets are scattered across many repositories => (virtual) integration

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Interoperability - is it feasible -' - thora


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
why care about interoperability
Why care about interoperability?
  • e-Science & e-Humanities
    • “data is the currency of modern research”
    • thus need to get integrated access to many data sets
    • data sets are
      • scattered across many repositories => (virtual) integration
      • created by different research teams using different
      • conventions (formats, semantics)
      • often in bad states and quality => curation
    • thus interoperability most used word at ICRI conference
  • Big Questions:
    • What is meant with interoperability?
    • How to remove interoperability barriers to analyze large heterogeneous and probably distributed data sets?
    • Is interoperabilitysomething we need/want to achieve?
what is interoperability
What is interoperability?

Wikipedia: Interoperability is a property of a system, whose interfaces are completely understood, to work with other systems, present or future, without any restricted access or implementation.

IEEE: Interoperability is the ability of two or more systems or components to exchange information and to use the information that has been exchanged.

O’Brian/Marakas: Being able to accomplish end-user applications using different types of computer system, operating systems, and application software, interconnected by different types of local and wide-area networks.

OSLC: To be interoperable one should actively be engaged in the ongoing process of ensuring that the systems, procedures and culture of an organization are managed in such a way as to maximise opportunities for exchange and re-use of information.

what is interoperability1
What is interoperability?
  • Technical Interoperability (techn. encoding, format, structure, API, protocol)
  • Semantic Interoperability
  • is it also about bridging understanding between two or more humans?

<köter>

<dog>

<hund>

humans – humans we better speak about understanding

humans – machine same or?

machine – machine well here interoperability makes sense

what is interoperability2
What is interoperability?
  • seems that every one speaks about technical systems when talking about interoperability
  • do we include feeding machines with some mapping rules specified by human users and then carrying out some automatic functions?
    • when linguists hear about mapping tag sets some immediately say that it is impossible and does not make sense
    • why: tags are part of a whole theory behind it
  • well if you look to other disciplines (life sciences, earth observation sciences etc.) that’s exactly what they do
  • why
    • people want to work across collections and ignore theories
    • some see tag sets just as first help but want to work on raw data
    • some see the demand of politicians and society to come up with answers and not with statements about problems
    • AND there is much money (is it useless?)
big data in natural science
Big Data in Natural Science
  • numbers in regular structures
  • how to find relevant data sets
  • volcanology/earthquakes/Tsunamies/etc.
    • X sensor datastreams(seismology)
    • (time, location, parameters)
    • X human observations (biodiversity)
    • (time, location, nr. frogs (etc))
    • window extraction to transfer and manage data
    • interpret regular structures (even frogs)
    • time normalization, take care of dynamics etc.
    • visualize things coherently
big data in natural science1
Big Data in Natural Science
  • numbers in regular structures
  • how to find relevant data sets
  • volcanology/earthquakes/Tsunamies/etc.
    • X sensor datastreams(seismology)
    • (time, location, parameters)
    • X human observations (biodiversity)
    • (time, location, nr. frogs (etc))
    • window extraction to transfer and manage data
    • interpret regular structures (even frogs)
    • time normalization, take care of dynamics etc.
    • visualize things coherently

interoperability looks simple enough

just find patterns in sequences of numbers the

format you need to know

(well – not quite as simple, but ...)

big data in environmental sciences
Big Data in Environmental Sciences
  • many different types of observations
    • climate, weather, etc.
    • species and populations according to multitude of classification systems and schools
  • grand challenge
    • how can all these observations be used to stabilize our environment
    • how can it all be used to maintain diversity
    • etc.
big data in environmental sciences1
Big Data in Environmental Sciences
  • many different types of observations
    • climate, weather, etc.
    • species and populations according to multitude of classification systems and schools
  • grand challenge
    • how can all these observations be used to stabilize our environment
    • how can it all be used to maintain diversity
    • etc.

sounds similar to our field

interoperability is tough

but there are expected gains

and there is more money

intensive work also in social science

many layers of interop access
many layers of interop: access

Enabling

Technologies

ID

ID

ID

ID

ID

0100

0101..

0100

0101..

0100

0101..

0100

0101..

0100

0101..

ID

ID

ID

ID

ID

ID

ID

ID

Discovery

metadata search resulting in Handles (PID)

and some properties

need a high degree

of automation

Access

(ref. resolution, protocols, AAI)

Handle (PID) resolution

and you get the data

ID

ID

ID

ID

ID

Scientists, Data Curators,

End Users, Applications

here linguistics is playing a role

(get schemas and semantics)

Interpretation

what can be

automated

Datasets

Accessed via Repositories

Reuse

here linguistics is playing an even bigger role

(get context information)

many layers of interop management mostly underestimated
many layers of interop: management(mostly underestimated!!!)

Enabling

Technologies

ID

ID

ID

ID

ID

0100

0101..

0100

0101..

0100

0101..

0100

0101..

0100

0101..

ID

ID

ID

ID

ID

ID

ID

ID

Collections +

Properties

metadata gathering resulting in Handles (PID)

and some properties

need a high degree

of automation

Access

(ref. resolution, protocols, AAI)

ID

ID

ID

ID

ID

Handle (PID) resolution

and you get the data

Data Managers

Data Scientists

formalized policies

workflow engine

formal rules manipulate properties of

Handles and metadata and may generate new DOs

need a high degree

of automation

Datasets

Accessed via Repositories

Assessment

check of rules and engine

it’s all about establishing trust

simple but essential example pids
simple but essential example: PIDs
  • it’s similar to TCP/IP with all its core machinery that brought us the Internet and thus interoperability with respect to communication
  • email system works when we abstract from content and thus the semantics of our human messages and focus on the semantics of attributes, parameters etc.
  • let’s assume that you want to use a certain file and first want to be sure that the file has not been modified
    • you look up in metadata
    • that automatically looks for the PID
    • the PID is resolved automatically and a checksum is retrieved
    • the checksum is automatically compared with the checksum of the file accessed
    • a warning is given automatically if the two don’t match
  • this would be a great service (and will come)
slide13

Internet machinery (collaboration CNRI and MPI)

Value Added

Services

DNS

email

WWW

phone

SMTP

HTTP

RTP…

TCP

UDP…

Internet

Protocol Suite

IP

Network

Technology

ethernet

PPP…

CSMA

async

sonet…

copper

fiber

radio

all applications making use of the same basic protocol where the

“packet” is the basic object and where endpoints have addresses and names

slide14

Data machinery (collaboration CNRI and MPI)

Value Added

Services

Persistent

Reference

Analysis

Citation

Custom

Clients

Apps

Plug-Ins

bit sequence

(instance)

points to instances

describes properties

Resolution System

Typing

Persistent

Identifiers

PID

describes

properties

& context

PID record

attributes

Data

Sources

point to

each other

Digital Objects

metadata

attributes

Data Sets

RDBMS

Files

Local Storage

Cloud

Computed

all applications making use of the same basic protocol where “data” is

the basic object and where PID and metadata attributes describe object properties

layers of interoperability
Layers of interoperability
  • Protocols/APIs: defined formats, semantics, processes
    • SCSI: how to read/write/etc. blocks to/from SCSI disc
    • File System: how to read/write/etc. to/from logical entities
    • how to organize files on a machine (virtualization)
    • how to organize files across machines
    • OAI/PMH: how to serve metadata descriptions
    • SRU/CQL: how to do distributed content search
    • etc. etc.
  • all based on standards or widely accepted best practices
    • advantage: standards establish a 1:N relation constant over time
  • large number of standards/BP for metadata (structure, semantics)
back to linguistics
back to linguistics
  • where are we in the linguistics domain?
    • what happened in some well-known projects
    • do we miss the big challenges which other disciplines have and that would force us to ignore schools, vainness, etc.
  • 4 examples
    • metadata
    • DOBES
    • TDS
    • CLARIN
metadata is kind of easy
metadatais kind of easy
  • DC/OLAC – CMDI mapping examples:
    • DC:languageCMDI:languageIn
    • DC:languageCMDI:dominantLanguage
    • DC:languageCMDI:sourceLanguage
    • DC:languageCMDI: targetLanguage
    • DC:dateCMDI:creationDate
    • DC:dateCMDI:publicationDate
    • DC:dateCMDI:startYear
    • DC:dateCMDI:derivationDate
    • DC:formatCMDI:mediaType
    • DC:formatCMDI:mimeType
    • DC:formatCMDI:annotationFormat
    • DC:formatCMDI:characterEncoding
  • everyone accepts now: metadata is for pragmatic purposes and not replacing the one and only one true categorization
  • mapping errors may influence recall and precision – but who cares really

semantic mapping doable due to limited element

sets and to now well-described semantics

(except for recursive machines such as TEI)

if mapping is used for discovery – no problem

if mapping is used for statistics – well ...

crucial for

machine

processing

truth in metadata usage still
truth in metadata usage – still !!

Rebecca Koskela: DataONE

dobes some facts
DOBES – some facts
  • DOBES = Documentation of Endangered Languages
  • some facts
  • started 2000 with 7 international teams and 1 archive team
  • 2012: now 68 documentation teams working almost every where
  • cross-disciplinary approach: linguists, ethnologists, musicologists, biologists, ship builders, etc.
  • every year one workshop and two training courses
dobes agreements
DOBES Agreements
  • in first 2-3 years quite some joint agreements
    • formats to be stored in the archive – interoperability
    • principles of archiving such as PIDs
    • workflows determining the archive-team interaction
    • organizational principles to manage and manipulate data
    • metadata to be used to manage and find data
    • (pragmatics vs. theory)
    • joint agreement on Code of Conduct
  • short discussions on more linguistic aspects failed
    • agreement on joint tag set - NO
    • agreement on joint lexical structures - NO
    • etc.
    • good reason: the languages are so different
    • “bad” reason: agreements require effort
recent dobes questions
recent DOBES Questions
  • now after >10 years we have so much good data in the archive
  • what can we do with it ????
  • traditional: every researcher looks at his/her data and publishes of course taking into account what has been published by others
  • new: can researcher teams come to new results while working on the raw and annotated data?
  • what does this require in case of “automatic” or “blind” procedures?
  • (remember that the researchers do not understand the language)
    • you need to know the tier labels to understand the type of annotation
    • you need to know the tags used to understand the results of the linguistic analysis work (morphological, syntactical, etc.)
an text example
an text example

what’s this?

  • Example from Kilivila (Trobriand Islands – New Guinea)
  • p1tr Ambeya
  • p1en Where do you go?
  • p2tr Bala bakakaya
  • p2w-en I will go I will take a bath
  • p2en I will go to have a bath
  • p2tr Bila bikakaya bike’ita bisisu bipaisewa
  • p3gl3.Fut-go 3.Fut-bath 3.Fut-come back 3.Fut-be 3.Fut-work
  • p2w-en He will go - he will have a bath - he will come back – he will stay -
  • he will work.
  • p2en He will take a bath and afterwards work with us.

what’s this?

big question: what can we do with searches, statistics – thus semi-automatic procedures across different corpora

hum example multi verb expressions
Hum. Example: Multi-verb Expressions

mixed glossing

POS tagging

a multimodal example
a multimodal example

Interaction Study: 12 participants + exper; per part. 7 tiers

tier names

from Toolbox

5 cross corpora projects
5 cross-corpora projects
  • demonstratives with exophoric reference
  • (morpho-syntactic and discourse pragmatic analysis incl. gestures)
  • discourse and prosody – convergence in information structure
  • relative frequencies of nouns, pronouns and verbs
  • cross-linguistic patterns in 3-participant events
  • one rather large program with 13 teams covering different languages
    • primary topic is “referentiality”
    • bigger question: how to do this kind of cross-corpus work
    • strategy: define new tag set and add a manually created tier
    • yet no agreed tags – committee has been formed
    • now in a process to determine selection of corpora
    • question: will existing tags help to find spots of relevance

in general:

additional tagging based on specific agreements

are existing annotations of any help?

finally everyone works in his/her data

tds lot etc

Database schemata

(any DDL)

Local database ontologies

(DTL)

Global linguistic ontology

(OWL)

Topic taxonomies

(SKOS)

Database developer

TDS Knowledge engineer

Domain expert

TDS (LOT etc.)

Typology Database System - offering one semantic domain to look for phenomena in 11 different typological databases created independently and covering many languages.

straight

mapping

complex

mapping

many descriptive parameters

&

differences in structure, terminology and theoretical assumptions

tds lot etc1

Database schemata

(any DDL)

Local database ontologies

(DTL)

Global linguistic ontology

(OWL)

Topic taxonomies

(SKOS)

Database developer

TDS Knowledge engineer

Domain expert

TDS (LOT etc.)

Typology Database System - offering one semantic domain to look for phenomena in 11 different typological databases created independently and covering many languages.

straight

mapping

complex

mapping

many descriptive parameters

&

differences in structure, terminology and theoretical assumptions

thus an ontology based approach to interoperability

instead of an attempt to redo type specification

(WALS)

good: get typology specs out of individual boxes

thus: TDS was also curation work

subject verb agreement
subject-verb agreement
  • Q1: which languages have subject-verb agreement?
    • db A: exactly this question with Boolean answer
      • no distinction thus simple
    • db B: bundle of information
      • sole argument of an intransitive verb
      • agent/patient/recipient-like arguments of transitive verb
      • in general “yes” for s and a cases (but not always clear)
  • Q2: which languages are of type a for transitive verbs
    • db A: ambiguous – so give all languages or none
    • db B: simple answer
  • a pre-query stage allows user to decide about options
  • what when several parameters are used to describe a phenomenon
did tds work
Did TDS work?
  • let’s assume that
    • the local ontologies represent the conceptualization correctly
    • the global ontology forms a useful unifying conceptualization
    • (is there such an accepted unifying ontology?)
    • the 2-stage query interface offers proper help
    • THEN TDS sounds like an excellent, scalable approach
  • why did TDS not yet take up?
    • TRs rely on papers and are not interested in databases ?
    • TRs don’t understand and rely on the formal semantics blurb ?
    • TRs would need to invest time – do they take it ?
    • (occasional usage, small community of experts)
  • what is WALS then – just a glossary for non-experts ?
what happens in clarin
What happens in CLARIN?
  • well Metadata is obvious –> Virtual Language Observatory
    • harvesting and mapping is not the problem
    • bad quality is the problem (as for Europeana etc.)
  • planned is f.e. distributed content search

SRU/

CQL

what is comparatively easy
what is comparatively easy?
  • what if we only look at Dutch or German texts?
    • searching just for textual patterns (collocations)
    • could make use of SUMO, Wordnets to extend query etc
    • but can/should we compete with Google?
  • what if we search across languages?
    • well – need some translation mechanism for textual patterns – could be trivial translation
    • does it make sense – will people use it?
    • AND: it is mainstream – so Google will do it

this is what currently is being worked on

not so inspiring

seems that researchers are not really interested in this

at this moment

what is more difficult and special
what is more difficult and special?
  • assume some annotated texts, audios, videos
  • assume some standard type of linguistic annotations such as morphosyntax, POS, etc.

1. Select corpora

2. Select Tag sets

3. Formulate query

4. Expand by rules

(relations between tags)

this was rejected across country borders

but is it this what we need?

is there a potential or just a myth?

2

3

4

1

semantic bridges how
semantic bridges: how?
  • assume that we have two corpora: one encoded by STTS and the other one by CGN and assume that they have some linguistic annotation (morphosyntax, POS, etc.) to be used in a distributed search or statistics
  • (take care: searching != statistics)
  • what to do now to exploit both collections?
    • 1. do separate searches – well ...
    • 2. create rich umbrella ontology and complex refs
    • (comparable to TDS)
      • well - could become a never ending story ...
      • people disagree on relations etc.
      • relations partly depend on pragmatic considerations
      • expensive, static, require experts, not understandable, etc.
are flat category registries ok
Are flat category registries ok?
  • 3. flat registries of linguistic categories such as ISOcat (12620)
  • sound like a solution for some tasks
    • easy mapping between two (or more) categories
    • users can easily create their own mappings or re-use some
    • maintenance is more easy and thus allows dynamics
    • etc.
    • so it seems that we could overcome the TDS barrier
  • but we are reducing accuracy and losing much information
  • too simple for statistics ??
  • sufficient for searches ??
what about jan s examples
What about Jan’s examples?
  • e0: annotations are structured: “np\s/np”
  • e1: “JJR” -> “POS=adjective & degree=comparative”
  • e2: “Transitive” -> “thetavp=vp120 & synvps=[synNP] & caseAssigner=True”
  • e3: “VVIMP” -> “POS= verb & main verb &
  • mood=imperative”
  • where to put annotation complexity if “ontology” is simple
  • complexity needs to be put into schemas
    • who can do it – is it feasible?
  • mapping must be between combinations of cats or graphs
    • who can do it – is it feasible?
are there conclusions
are there conclusions?
  • do we want/need cross-corpora operations?
    • for many other communities this is a MUST
    • don’t we have “society relevant” challenges?
    • do they just get more money?
    • given all regularity finding machines – is linguistic annotation relevant at all?
  • is it for us more difficult to do?
    • well - that’s what all claim – don’t believe that anymore
are there conclusions1
are there conclusions?
  • are we interested to try it out?
    • well – yet there are not so many people committed
      • is it not of relevance?
      • is it lack of money?
    • some are opposing strictly
      • is it a sense of reality?
      • is it lack of vision?
      • is it vainness?
  • if interested, how do we want to tackle things?
    • pragmatic – stepwise – simple first
    • will people use it then?
    • do we have evangelists?
useless cloud debate
useless Cloud debate

some just call for Cloud – what does it solve?

just collect also all content into one big pot

all the issues about interoperability remain the same

searching will be more efficient – no transport etc.

what about metadata
What about metadata?
  • TEI example 1
        • resp annotation supervisor and developer
          • date from="1997" to="2004"
          • name Claudia Kunze
    • which date is it? need to interpret context
    • which role is it? need to interpret context
  • TEI example 2
        • nameDan Tufiş
        • respOveral editorship
        • nameŞtefan Bruda
        • respError correction and CES1 conformance
    • which role is it? need to interpret context
  • very simple examples show
    • meant to be read by humans
    • (too) much degree of freedom
    • no CV for responsibility role
just a bit of school

Refers To

Symbolizes

Stands For

Term

C.K. Ogden/I.A. Richards, The Meaning of Meaning

A Study in the Influence of Language upon Thought and The Science of Symbolism

London 1923, 10th edition 1969

just a bit of school

Concept

“Orange”

Referent

Slide adapted from (c) Key-Sun Choi for Pan Localization 2005

from the slide of [Bargmeyer, Bruce, Open Metadata Forum, Berlin, 2005]