slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it PowerPoint Presentation
Download Presentation
Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it

Loading in 2 Seconds...

play fullscreen
1 / 65

Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it - PowerPoint PPT Presentation


  • 170 Views
  • Uploaded on

Risorse Linguistiche (lessici, corpora, ontologie, …) Standard e tecnologie linguistiche (cont.) . … e Progetti. Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it. With many others at ILC. SIMPLE Model for a BioLexicon.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it' - metea


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Risorse Linguistiche

(lessici, corpora, ontologie, …)

Standard e tecnologie linguistiche (cont.)

… e Progetti

Nicoletta Calzolari

Istituto di Linguistica Computazionale - CNR - Pisa

glottolo@ilc.cnr.it

With many others at ILC

Dottorato, Pisa, Maggio 2009

simple model for a biolexicon
SIMPLE Model for a BioLexicon

Design a representational model for a BioLexicon, a comprehensive lexical resource

able to integrate terminological, lexical and ontological info

compatible with HLT international standards (i.e. ISO)

able to meet the domain-specific requirements

  • Implement a BioLexicon database, a container with lexical objects to be filled with data provided by “populators” (EBI, UoM & CNR-ILC)
    • able to be automatically incremented with new terms and linguistic info extracted from texts

from Valeria Quochi

Dottorato, Pisa, Maggio 2009

slide3

BioLexicon Building cycle

Term Repository

Gather terms EBI

Bio-Lexicon Population

variants; synt info of terms UoM

Bio-Lexicon

Conceptual model and physical DB

ILC

Bio-eventsextraction of bio-events ILC

Terminolgy to Ontology

Jena/Rennes/EBI

from Valeria Quochi

Dottorato, Pisa, Maggio 2009

the biolexicon where from
The BioLexicon: where from

Incremental population process

Existing repositories

chemical compounds, species names, disease, enzymes

genes/proteins

Subclustering of term variants

BioLexicon

new genes/proteins names

MEDLINE

Named Entity Recognition

Term Mapping by Normalisation

Verbs, nouns, adjs, advs (variants,

inflected forms, derivative relations, ...)

Manual curation

Subcat extraction

Linguistic pre-processing

Syn-sem mapping

Manual annotation of a bio-event corpus

Bio-event extraction

from Simonetta Montemagni

Dottorato, Pisa, Maggio 2009

slide5

BioLexicon Model: High-level lexical objects, Data Categories

e.g.

<feat att=“POS” val=“VVZ”>

<feat att=“ConfScore” val=“0.9”>

<feat att=“source” val=“UNIPROT”

……

Syntax

Semantics

from Valeria Quochi

Dottorato, Pisa, Maggio 2009

DC selection

generegonto biolex concepts to predicates
GeneRegOnto – BioLexConcepts to Predicates

from Valeria Quochi

Dottorato, Pisa, Maggio 2009

slide7

regulate

regulation

Regulation

PredRegulate

Arg0Regulate

Arg1Regulate

PositiveProtein

Regulation

NegativeProtein

Regulation

regulator

regulatee

TranscriptionFactor

Protein

regulates

NF-AT

IL2

regulates

isregulatedby

bio semantic entry

predicative argument structure

bio event concept

bio semantic roles

bio entity concept

Bio-specific qualia relations

bio relations

NF-AT positively regulates IL2

Dottorato, Pisa, Maggio 2009

from Valeria Quochi

slide8

Activity

SynBehaviourLesion1

SenseLesion1

PredicateLESION

SubcatFramepp-of

BioLexicon

Protein

SynArg

Arg0pp-of

SemArg

Arg0Pat

The pattern “lesion of PROTEIN” is not in the lexicon, but can be calculated accessing info scattered over various lexical objects (i.e the syntactic unit lesion heads a pp-of corresponding to the patient argument, restricted by the ontological node PROTEIN)

All lexical items labelled as PROTEIN can be candidates to fill this argument slot. Lesion of OmpC, OmpR, etc… are all admitted instances/sentences of this “predicate”/pattern.

Dottorato, Pisa, Maggio 2009

good mapping of relations
Good mapping of Relations

OBO Relations

Agentive

Formal

isA is_a

partOf is_a_part_of

hasPart has_as_part

GrainOf …

hasGrain …

componentOf …

hasComponent …

properPartOf …

hasProperPart …

locatedIn …

locationOf …

containtIn …

contains contains

adjacentTo ?

derivesFrom derived_from

precededBy ?

participatesIn ?

hasParticipant ?

agentOf …

hasAgent ?

functionOfis_the_activity_of

hasFunction …

instanceOf …

Telic

Constitutive

Relations from Extended Qualia Structure

Dottorato, Pisa, Maggio 2009

enhancing semantic relations
Enhancing Semantic Relations

BelongsToSpecies

phosphoglycolate

mouse

from Valeria Quochi

Dottorato, Pisa, Maggio 2009

how to link bio ontology and bio lexicon place s of semantics in bootstrep
How to link Bio-Ontology and Bio-LexiconPlace(s) of Semantics in BootStrep
  • Bio-Ontology holds domain specific as well as general semantics

(in terms of classes and relations between classes)

  • Lexicon model comes with semantic layer based on linguistic ontology (SIMPLE-CLIPS Ontology)

Questions:

  • What relation between bio-ontology and linguistic ontology?
  • Do they overlap? What is the overlap/intersection? the difference?
  • Mapping possible? How could a mapping look like?

Aim:

  • Bringing lexical semantics and ontological semantics together

?

Dottorato, Pisa, Maggio 2009

slide12

the BioLexicon Model & Standards

The Bio-Lexicon is based on the MILE metamodel and the more recent ISO proposal of a Lexical Markup Framework (LMF)

Data Categories drawn as far as possible from already existing repositories and standards (i.e. morphosyntactic datacat)

There is the need, however, to define a set of Data Categories specific for the biology domain (i.e. semantic roles and relations)

Dottorato, Pisa, Maggio 2009

iso meta model data categories
ISO Meta-model & Data Categories

An ISO standard for NLP lexica

  • Definition of the Lexical Markup Framework, a general & abstract meta-model & a set of structural nodes relevant for linguistic description

Objectives

  • Design of the abstract lexical meta-model
  • Definition of the common set of related Data Categories

The field is mature

from Monica Monachini

Dottorato, Pisa, Maggio 2009

iso lmf
ISO - LMF
  • Specifically designed to accommodate as many models of lexical representation as possible
  • Its pros:
    • Meta-model: a high-level specification ISO24613
    • Data Category Registry: low-level specifications ISO12620
  • Not a monolithic model, rather a modular framework
    • LMF library provides the hierarchy of lexical objects (with structural relations among them)
    • Data Category Registry provides a library of descriptors to encode linguistic information associated to lexical objects (N.B. Data Categories can be also user-defined)

Dottorato, Pisa, Maggio 2009

iso lmf lexical markup framework
ISO LMF – Lexical Markup Framework

Builds also on

EAGLES/ISLE

Structural skeleton, with the basic hierarchy of information in a lexical entry

+ various extensions;

LMF specs comply with modelling UML principles; an XML DTD allows implementation

ICT

KYOTO

LIRICS

NEDO

Asian

Lang.

NICT Language-Grid Service Ontology

Dottorato, Pisa, Maggio 2009

lmf nlp extension for semantics
LMF: NLP Extension for Semantics

Dottorato, Pisa, Maggio 2009

lexical entry

SyntacticBeahviour

SB_protein

Lexical Entry

LE_protein

Lemma

L_protein

Representation Frame

RF_protein

DC: writtenForm= protein

Lexical Entry

<LexicalEntry rdf:ID="LEprotein">

<hasSyntacticBehaviour rdf:resource=“../../#SB_protein”/>

<hasLemma>

<Lemma rdf:ID="L_protein“/>

<hasRepresentationFrame>

<RepresentationFrame rdf:ID=“RF_protein” />

</hasRepresentationFrame>

</hasLemma>

</LexicalEntry>

Dottorato, Pisa, Maggio 2009

event representation through semanticpredicate
Event Representation through SemanticPredicate

SemanticPredicate

SP_regulate

SemanticArgument

SP_TF_protein

DC: role=agent

SemanticArgument

SP_Target Gene

DC: role=patient

Dottorato, Pisa, Maggio 2009

slide19

Sense Representation

Synset

activate

<Sense rdf:ID=“activate_2">

<belongsToSynset rdf:resource="#activate"/>

<hasSemanticRelation rdf:resource="#is_a_1"/>

<hasSemanticRelation rdf:resource="#has_as_part_1"/>

<hasSemanticRelation rdf:resource="#object_of_the_activity_1"/>

<hasSemanticFeature rdf:resource="# SF_chemistry"/>

<hasSemanticFeature rdf:resource="# SF_process"/>

</Sense>

PredicativeRepresentation

Sense

activate_2

SemanticFeature

SF_chemistry

SF_process

Collocation

SemanticRelation

is_a: [SenseID]

Typical_of: [SenseID] S_protein

Dottorato, Pisa, Maggio 2009

slide20

Example of Semantic Relation

<SemanticRelation rdf:ID=“is_in">

<hasSourceSense>

<Sense rdf:ID=“S_cox15">

<id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">S_cox15</id>

</Sense>

</hasSourceSense>

<hasTargetSense>

<Sense rdf:ID=“S_chromosome19">

<id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">S_chromosome19</id>

</Sense>

</hasTargetSense>

<relationName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">is_in</relationName>

</SemanticRelation>

Sense

S_cox15

SemanticRelation

Is_in

Sense

S_chromosome19

Dottorato, Pisa, Maggio 2009

xml based abstract lexicon interchange format mapping exercise
XML based Abstract Lexicon Interchange FormatMapping exercise

Major best practices:

  • OLIF
  • PAROLE/SIMPLE
  • LC-Star
  • WordNet - EuroWordNet
  • FrameNet
  • BDef formal database of lexicographic definitions derived from Explanatory Dictionary of Contemporary French
  • …others on the way…

Entries from existing lexicons have been mapped to LMF to prove that the model is able to represent many best practices and achieve unification

from Monica Monachini

Dottorato, Pisa, Maggio 2009

lexical web content interoperability standards
Lexical WEB & Content Interoperability  ‘Standards’
  • As a critical step for semantic mark-up in the SemWeb

NomLex

WordNets

WordNets

ComLex

WordNets

with

intelligent

agents

SIMPLE

LMF

Lex_x

FrameNet

Lex_y

Standards for Interoperability

Enough??

Dottorato, Pisa, Maggio 2009

need of tools to make this vision operational concrete
Need of tools to make this vision operational & concrete

New prototype “LeXFlow”:

  • web-based collaborative environment for semi-automatic management/integration of lexical resources
    • enabling interoperability of distributedlexical resources
    • accessed by different types of agents
  • addressing semi-automatic integration of computational lexicons, with focus on linking and cross-lingual enrichment of distributed LRs
    • Case-study: cross-fertilization between Italian and Chinese WordNets
  • From Language Resources
          • To Language Services

Dottorato, Pisa, Maggio 2009

our wn case study
Our WN case study
  • ItalWordNet (Roventini et al., 2003)
  • Academia Sinica Bilingual Ontological WordNet (Sinica BOW, Huang et al., 2004)
  • Both connected to Princeton WordNet (although to different versions)
  • Same set of semantic relations (EWN ones)

Dottorato, Pisa, Maggio 2009

architecture for cooperative integration of lexicons
Architecture for cooperative integration of lexicons

Agent Role3

Agent Role1

Agent Role4

Agent Role2

Coordination

Web service Interface

Simple-Wordnet

Relation Calculator

Application

MultiWordnet

Relation Calculator

Web service Interface

Italian

Simple

Italian

Wordnet

Chinese

Wordnet

ILI

Mapper

Relation

Mapper

Data

Dottorato, Pisa, Maggio 2009

basic assumptions behind mwn
Basic assumptions behind MWN …
  • Interlingual level:
    • Interlingua provides an indirect linkage between different WordNets: the Interlingual Index (ILI), an unstructured version of WordNet used in EuroWordNet
    • Each synset in a WNA is linked to at least one record of the ILI by means of a set of relations (eq_synonym, eq_near_synonym, …)
  • Synset correspondence:
    • If there is a SA and a SB that point to the same ILI, they are correspondent
  • Relation correspondence:
    • If there are two synsets in WNA and a relation between them, the same holds between corresponding synsets in WNB

Dottorato, Pisa, Maggio 2009

slide29

parte, tratto

N#12348

iperonimia/HYP

A new proposed mero relation

passaggio,

strada,via

N#1290

meronimy/MPT

curvatura,

svolta,curva

N#20944

iponimia/HPO

carreggiata

N#21225

Synonym

Derived

ILI1.5-3001757-n

road,route

ILI1.6-3243979-n

ILI1.5-5691718-n

stretch

ILI1.6-???

ILI1.5-2857000-n

passage

ILI1.6-3092396-n

ILI1.5-3002522-n

roadway

ILI1.6-3245327-n

ILI1.5-8488101-n

bend,crook,turn

ILI1.6-9992072-n

Synonym

Reinforcement

& validity

tong_dao

(通道)

N#03092396

上位(泛稱)詞_為/HYP

che_dao

(車道)

N#3245327

dao_lu,dao,lu

(道路,道,路)

N#03243979

下位(特指)詞_為/HPO

wan

(彎)

N#9992072

部件_部份詞_為/MPT

Dottorato, Pisa, Maggio 2009

slide30

00403772-v

HYP

00001533-v

00407124-v

HPO

eq_syn

HYP

CAU

eq_syn

eq_syn

00462055-a

Respective, several,

various

00364361-a

00403772-v

acquire_knowledge

00335115-v

00406975-v

Absorb, assimilate

Ingest, take_in

00338206-v

00407124-v

imbibe

00338343-v

01513366-v

receive, have

01260836-v

eq_near_syn

eq_near_syn

eq_syn

has_hyponym

V#32925

studiare_3, imparare_1,

apprendere_2

V#39802

prendere_3

eq_syn

has_hyperonym

has_hyperonym

V#32080

assimilare_5, assorbire_3,

accettare_2, recepire_1

AG#42011

relativo_4

causes

Derived

for a global wordnet grid
For a Global WordNet Grid
  • This architecture for making distributed wordnets interoperable lends itself to different applications in LR processing:
    • Enrichment of existing lexical resources
    • Creation of new resources
    • Validation of existing resources
  • Can provide a platform for cooperative & collective creation & management of LRs, by providing a web-based environment for the collaboration & interaction of distributed agents and resources

Can be seen as the

  • Prototype of a web application supporting the GlobalWordNet Grid initiative, i.e. a shared multi-lingual knowledge base for cross-lingual processing based on distributed resources over the Grid

New project:KYOTO

Dottorato, Pisa, Maggio 2009

slide32

Distributed, diverse & dynamic data

1

Citizens

4

Governments

maintain

terms & concepts

Companies

Wikyoto

Capture text:

"Sudden increase of

CO2 emissions in 2008 in Europe"

Ontology

2

Top

Abstract

Physical

Tybot: term yielding robot

Wordnets

Process

Substance

3

CO2 emission

Middle

H20

CO2

H20

Pollution

CO2

Emission

Greenhouse

Gas

Domain

Kybot: knowledge yielding robot

Index facts:

Process: Emission

Involves: CO2

Property: increase, sudden

When: 2008

Where: Europe

5

6

Text & Fact Index

Semantic

Search

Environmental organizations

from Piek Vossen

Dottorato, Pisa, Maggio 2009

slide33

TEXT

ontology

Wordnet

Linear

DAF

Domain Wordnet

domain ontology

Discourse

Annotation

LMF API

OWL API

Linear

MAF

Morphological

Annotation

Language

Specific

Domain Terms

Linear

SYNAF

Syntactic

Annotation

Generic

TMF

Linear

SEMAF

Semantic

Annotation

Term

Extraction

(Tybot)

Language

Neutral

Linear

Generic

FACTAF

Language

Neutral &

Specific

Fact

Extraction

(Kybot)

from Piek Vossen

Dottorato, Pisa, Maggio 2009

system components
System components
  • Wikyoto = wiki environment for a social group:
    • to model the terms and concepts of a domain and agree on their meaning, within group, across languages and cultures
    • to define the types of knowledge and facts of interest
  • Tybots = Term extraction robots, extract term data from text corpus
  • Kybots = Knowledge yielding robots, extract facts from a text corpus
  • Linguistic processors:
    • tokenizers, segmentizers, taggers, grammars
    • named entity recognition
    • word sense disambiguation
    • generate a layered text annotation in Kyoto Annotation Format (KAF)

from Piek Vossen

Dottorato, Pisa, Maggio 2009

kyoto system
KYOTO SYSTEM

Linear

SYNAF/SEMAF

Term extraction

(Tybot)

Semantic annotation

Generic

TMF

Linear

SEMAF

Fact extraction

(Kybot)

Domain editing

(Wikyoto)

Fact

User

Concept

User

LMF API

OWL API

Linear

Generic

FACTAF

Domain Wordnet

Domain ontology

Wordnet

Ontology

from Piek Vossen

Dottorato, Pisa, Maggio 2009

fact mining by kybots

Source

Documents

Morpho-syntactic analysis

[[the emission]NP

[of greenhouse gases]PP

[in agricultural areas]PP] NP

Fact mining by Kybots

Linguistic

Processors

Ontology

Logical

Expressions

Wordnets &

Linguistic Expressions

Generic

Abstract

Physical

Fact analysis

Patient

[[the emission]NP ] Process: e1

[of greenhouse gases]PP Patient: s2

[in agricultural areas]PP] Location: a3

Substance

Process

Chemical

Reaction

H2O

CO2

Domain

Patient

CO2

emission

water

pollution

from Piek Vossen

Dottorato, Pisa, Maggio 2009

contribution of kyoto

environment

facts

Wordnet

environment

terms

Wordnet

environment

terms

Wordnet

environment

terms

Wordnet

environment

terms

Ontology

environment

concepts

Contribution of KYOTO
  • hundreds of thousands sources in the environment domain
  • in many different languages
  • spread all over the world
  • changing every day
  • KYOTO delivers a Web 2.0 environment for community based control
  • Connects people across language and cultures
  • Establish consensus and knowledge transition
  • KYOTO learns terms and concepts from text documents,
  • Stored as structures that people and computers understand
  • KYOTO enables semantic search and fact extraction
  • Software can partially understand language and exploit web 1 data
  • Understanding is helped by the terms and concepts defined for each language

html

pdf

xls

KYBOT

WIKYOTO

TYBOT

from Piek Vossen

Dottorato, Pisa, Maggio 2009

slide38

A common representation format:WordNet - LMF

Data Categories

LexicalResource

1..*

0..1

1..1

GlobalInformation

Lexicon

SenseAxes

1..*

0..*

1..*

0..1

Meta

Synset

SenseAxis

LexicalEntry

0..1

0..1

0..*

0..1

0..1

1..1

MonolingualExternalRefs

InterlingualExternalRefs

Lemma

Sense

Definition

SynsetRelations

0..1

0..*

1..*

1..*

1..*

MonolingualExternalRefs

MonolingualExternalRef

InterlingualExternalRef

Statement

SynsetRelation

0..1

0..1

0..1

1..*

MonolingualExternalRef

Meta

Meta

Meta

0..1

Meta

from Monica Monachini

Dottorato, Pisa, Maggio 2009

centralized wordnet dc registry
Centralized WordNet DC Registry

A list of 85 sem.rels as a result of a mapping of the KYOTOWordNet grid

Intra-WN

Inter-WN

from Monica Monachini

Dottorato, Pisa, Maggio 2009

slide40

WordNet-LMF multilingual level - Cross-lingual synset relations

<!ELEMENT SenseAxes (SenseAxis+)>

<!ELEMENT SenseAxis (Meta?, Target+, InterlingualExternalRefs?)>

<!ATTLIST SenseAxis

id ID #REQUIRED

relType CDATA #REQUIRED>

<!ELEMENT Target EMPTY>

<!ATTLIST Target

ID CDATA #REQUIRED>

<!ELEMENT InterlingualExternalRefs (InterlingualExternalRef+)>

<!ELEMENT InterlingualExternalRef (Meta?)>

<!ATTLIST InterlingualExternalRef

externalSystem CDATA #REQUIRED

externalReference CDATA #REQUIRED

relType (at|plus|equal) #IMPLIED>

IWN

<fuoco_1, fiamma_1>

00001251-n

SWN

<fuego_3, llama_1>

09686541-n

groups monolingual synsets corresponding to each other and sharing the same relations to English

WN3.0

<fire_1 flame_1 flaming_1>

13480848-n

specifies the type of correspondence

link to ontology/(ies)

from Monica Monachini

Dottorato, Pisa, Maggio 2009

ultimate goal
Ultimate goal
  • Global standardization and anchoring of meaning such that:
    • Machines can start to approach text understanding -> semantic web connects to the current web
    • Communities can dynamically maintain knowledge, concepts and their terms in an easy to use system
    • Cross-linguistic and cross-cultural sharing and communication of knowledge is enabled
  • Comparable to a formalization of Wikipedia for humans AND machines across languages

from Piek Vossen

Dottorato, Pisa, Maggio 2009

some steps for a new generation of lrs
Some steps for a “new generation” of LRs
  • From huge efforts in building static, large-scale, general-purpose LRs
  • Tonon-static LRs rapidly built on-demand, tailored to spefic user needs
  • From closed, locally developed and centralized resources
  • To LRs residing over distributed places, accessible on the web, choreographed by agents acting over them
      • From Language Resources
      • To Language Services

Dottorato, Pisa, Maggio 2009

distributed language services
Distributed Language Services

A long-term scenario implying

    • content interoperability standards,
    • supra-national cooperation and
    • development of architectures enabling accessibility
  • Create new resources on the basis of existing
  • Exchange and integrate information across repositories
  • Compose new services on demand
  • Collaborative & collective/social development and validation, cross-resource integration and exchange of information

Language Grid

Wiki

Dottorato, Pisa, Maggio 2009

in the semantic web vision
In the “Semantic Web”vision ...

…need to tackle the twofold challenge of

  • content availability&
  • multilinguality
  • Natural convergence with HLT:
    • multilingual semantic processing
    • ontologies
    • semantic-syntactic computational lexicons

Dottorato, Pisa, Maggio 2009

language tech knowledge content
Language Tech … & … Knowledge, Content

Ready???

Knowledge Markup

LT & LRs

Semantic Web

How to

cooperate??

Content Interoperable LRs & LT

Dottorato, Pisa, Maggio 2009

lr and the future of lt or content tech
LR and the future of LT or Content Tech

The need of ever growing and richer LRs for effective multilingual content processing requires a change in the paradigm, & the design of a new generation of LRs, based onopen content interoperability standards

The Semantic Web notion may be used to shape the LRs of the future, in the vision of an open space of sharable knowledge available on the Web for processing

The effort of making available millions of “richly annotated words” for dozens of languages is not affordable by any single group

This objective can only be achieved creating integrated Open and Distributed Linguistic Infrastructures

Not only the linguistic experts can participate in these, but may include designers, developers, users of content encoding practices, etc. in wiki mode

Is the LR/LT field mature enough to broaden and open itself to the concept of cooperative effort of different set of communities?

 Could a sort of “Language Genome” large initiative be effective? Storing lots of (annotated) facts

Dottorato, Pisa, Maggio 2009

slide47

Today, many vitality & success signs… for LRs

  • In Spoken, Written, Multimodal areas … … in new emerging areas
  • Statistical approaches…
  • Different dimensions & layers: Content (Ontologies), Emotion, Time, …
  • For Evaluation
  • For Training
      • LREC(> 900 submissions); many LRs at COLING and even at ACL!!
      • ELRA (self-sustaining) & LDC
      • LRE (new Journal: N. Ide & NC)
      • ISO-TC37-SC4/WG4 (International Standards for LRs)
      • AFNLP…
      • FLaReNet
      • ESFRI - CLARIN (also political & strategic role)
      • New calls or initiatives in EU, US, ASIA, on LRs, interoperability, cooperation, …

Dottorato, Pisa, Maggio 2009

but an important point
BUT … an important point

In the ’90s

  • There was a global vision of the field & its main components:
    • Standards
    • Creation of LRs
    • Distribution

Then:

    • Automatic acquisition

… towards the Infrastructure of LRs & LT

ELRA

LDC

While today:

    • There is an ever increasing set of initiatives for new LRs, basic robust technologies, models??, algorithms,
  • We have a LR community culture
  • BUT sort of scattered, opportunistic, not much coherence

Dottorato, Pisa, Maggio 2009

today
Today …

The wealth of data & of basic technologies is such that:

  • We should reflect again at the field as a whole & ask if
    • Standards
    • Creation of LRs
    • Automatic acquisition
    • Distribution

are still “the” important components,

or how they have changed/must change

  • Content interoperability
  • Collaborative creation & Manag.
  • Dynamic LRs
  • Sharing

… Which new challenges towards a

new & more mature infrastructure of LRs & LTs??

Dottorato, Pisa, Maggio 2009

these dimensions
These dimensions

could be at the basis of a

new Paradigm for LRs & LT

& of a new Infrastructure ??

  • Content interoperability
  • Collaborative creation & Manag.

Need more

  • Dynamic LRs

Technology exists

  • Sharing

+

  • Distributed architectures/infrastr

Dottorato, Pisa, Maggio 2009

slide51

Many dimensions around the notion of language

finally

  • We need to put together
  • technical,
  • organisational,
  • strategic,
  • economic,
  • political issues of LRs

Two new European Infrastructural & Networking Initiatives

Multilingualism

Political issues

e.g. a commonly agreed list of minimal requirements for “national” LRs: BLARK

Need of bodies for

a broad research agenda & strategic actions for LT&LRs (W/S /MM)

based on all the dimensions

Interdisciplinarity &

Multidisciplinarity

  • Cultural issues
    • Language … and cultural identity
    • Language … and the Humanities
  • Economic,
  • social issues
    • Applications
    • Services

Technical issues

Dottorato, Pisa, Maggio 2009

which communities
Which Communities?

Technologies exist, but the infrastructure that puts them together and sustains them is still missing

for

  • Humanities
  • Social Sciences
  • Digital Libraries
  • Cultural Heritage
  • Language Resources
  • Language Technologies
  • Standardisation

core

Enabling

infrastr

CLARIN

ResInfra

FLaReNet

Network

Multilinguality

on

  • Grid
  • Semantic Web
  • Ontologists
  • ICT

Focus on cooperation

  • Many application domains

(eculture, egovernment, ehealth, …)

for

Dottorato, Pisa, Maggio 2009

clarin

ESFRI Research Infrastructures

CLARIN

Common Language Resources and Technologies Infrastructure

for the Humanities & Social Sciences

Large-scale pan-Europeancollaborative effort(31+ countries)

  • Make LRs & LTs available & readily usable to scholars of humanities & social sciences (& all disciplines)
  • Need to overcome the present fragmented situation by harmonising structural and terminological differences
  • Basis is a Grid-type infrastructure and Semantic Web technology
  • The benefits of computer enhanced language processing become available only when a critical mass of coordinated effort is invested in building an enabling infrastructure, which can provide services in the form of provision of tools & resources as well as training & counseling across a wide span of domains
  • The infrastructure will be based on a number of resource, service and expertise centres

Dottorato, Pisa, Maggio 2009

slide54

CLARIN Mission

  • Create acomprehensive and free to use distributed archive of LRs & LTscovering not only the languages of all member states, but also other languages studied and used in Europe
  • Through the fact that the tools & resources will be interoperable across languages & domains,contribute to preserving andsupporting multilingual & multicultural European heritage
  • An operationalopen infrastructure of web serviceswill introduce anew paradigm of distributed collaborative development
  • Allow many contributors to add all kinds of new services based on existing ones, thus ensuring reusability and allowing scaling up to suit individual needs

Dottorato, Pisa, Maggio 2009

how can we tackle these challenges
How can we tackle these challenges?
  • J. Taylor
  • “eScienceis about global collaboration in
  • key areas of science and the next generation
  • of infrastructures that will enable it”
  • Need to build new types of platforms
  • to allow researchers to combine existing resources easily to new ones to tackle the big challenges
  • to increase the productivity of all interested researchers, since currently too much time is wasted by preparatory work

from P. Wittenburg

Dottorato, Pisa, Maggio 2009

slide56

CLARIN establishes such a new generation of extended infrastructure

  • Thus CLARIN is not about creating and building new language resources and technology, but
  • making them available and accessible
  • as services
  • in a stable and persistent infrastructure

to allow tackling the great challenges

  • CLARIN: http://www.clarin.eu
  • Grid Project: http://www.mpi.nl/dam-lr
  • ISO TC37/SC4: http://www.tc37sc4.org
  • Standards Project: http://lirics.loria.fr/

eScience Vision

from P. Wittenburg

Dottorato, Pisa, Maggio 2009

we have still a long path
We have still a long path …

& also a “new project”

in an e-Contentplus Call for a:

  • “Thematic Network on Language Resources”:

FLaReNet

    • To providecommon recommendations (to the EC) for future actions
    • To give priorities
    • Need of ‘visions’

In a global context, in cooperation with CLARIN

& also with non-EU members

Dottorato, Pisa, Maggio 2009

which communities1
Which Communities?

LRs & LTs exist, but a global vision, policy and strategy

is still missing

for

  • Humanities
  • Social Sciences
  • Digital Libraries
  • Cultural Heritage
  • Language Resources
  • Language Technologies
  • Standardisation
  • Ontologists
  • Content

core

CLARIN

ResInf

EU

Forum

FLaReNet

Network

Multilinguality

Focus on cooperation

for

  • EC
  • Funding agencies
  • Many application domains

(eculture, egovernment, ehealth, intelligence, domotics, content industry, …)

for

Dottorato, Pisa, Maggio 2009

slide59

Fostering Language Resources Network

e Content plus

http://www.flarenet.eu

A new European Network for Language Resources –

Nicoletta Calzolari(coord.)

glottolo@ilc.cnr.it

Dottorato, Pisa, Maggio 2009

flarenet fostering language resources network
FLaReNet Fostering Language Resources Network

A European forum

  • to facilitate interaction among LR stakeholders

The Network structure considers that LRs present various dimensions and must be approached from many perspectives:

  • technical, but also
  • organisational
  • economic
  • legal
  • political

Addresses also

  • multicultural and multilingual aspects, essential when facing access and use of digital content in today’s Europe

http://www.flarenet.eu/

Dottorato, Pisa, Maggio 2009

organised in thematic working groups
Organised in Thematic Working Groups

A layered structure, with leading experts & groups (national and European institutions, SMEs, large companies) for all relevant LR areas (about 40 partners)

    • in collaboration with CLARIN
    • to ensure coherence of LR-related efforts in Europe

FLaReNet will

  • consolidate existing knowledge, presenting it analytically and visibly
  • contribute to structuring the area of LRs of the future by discussing new strategies to:
    • convert existing and experimental technologies related to LRs into useful economic and societal benefits
    • integrate so far partial solutions into broader infrastructures
    • consolidate areas mature enough for recommendation of best practices
    • anticipate the needs of new types of LRs

Dottorato, Pisa, Maggio 2009

thematic areas
Thematic Areas
  • The Chart for the area of LRs in its different dimensions
  • Methods and models for LR building, reuse, interlinking and maintenance
  • Harmonisation of formats and standards
  • Definition of evaluation protocols and evaluation procedures
  • Methods for the automatic construction and processing of LRs

To build together:

  • Evolving RoadMap
  • Blueprint of actions and infrastructures

Dottorato, Pisa, Maggio 2009

objectives expected results
Objectives & expected results

The largest Network of LR and HLT players, with diverse approaches, efforts and technologies

  • Enable progress toward community consensus
  • Give an extended picture of LRs & recast its definition in the light of recent scientific, methodological, technological, social developments
  • Consolidate methods & approaches, common practices, frameworks and architectures
  • A “roadmap” identifying areas where consensus has been achieved or is emerging vs. areas where additional discussion and testing is required, together with an indication of priorities
  • Recommendations in the form of a plan of coherent actions for the EU and national organizations
  • A European model for the LRs of the next years

Ambitious!

Dottorato, Pisa, Maggio 2009

outcomes of flarenet
Outcomes of FLaReNet

The outcomes will be of a directive nature

  • to help the EC, and national funding agencies, identifying priority areas of LRs of major interest for the public that need public funding to develop or improve

A blueprint of actions will constitute input to policy development both at EU and national level

  • for identifying new language policies that support linguistic diversity in Europe
  • in combination with strengthening the language product market, e.g. for new products & innovative services, especially for less technologically advanced languages

Dottorato, Pisa, Maggio 2009

these initiatives together
These Initiatives, … together
  • Call for international cooperation also outside Europe

and will be relevant for

  • setting up a global worldwide Forum of Language Resources and Language Technologies

Dottorato, Pisa, Maggio 2009