Martin Wynne OUCS OeRC Linguistics martin.wynne@oucs.ox.ac.uk

Martin Wynne OUCS OeRC Linguistics martin.wynne@oucs.ox.ac.uk CLARIN: A Pan-European Research Infrastructure for Language Resources and Technologies Martin Wynne OUCS, OeRC & Linguistics Faculty University of Oxford martin.wynne@oucs.ox.ac.uk

Language Resources and Technologies CLARIN is concerned with language resources and technologies, e.g.: Linguistic corpus (a principled collection of texts sampled to be representative of a particular language variety for the purposes of empirical linguistic research) Audio and video corpora Lexical resources (wordlists, dictionaries, morphological tables, semantic resources, ontologies) Language documentation (e.g. field notes about endangered languages) Language processing tools (for annotation, analysis, linking, editing, speech recognition and synthesis, translation, summarisation, text mining, internet search etc) Processing environments and workflow management tools Other language resources...

Language Resources and Technologies These resources are of use not only in linguistics, but across the Humanities and in many areas of the Social Sciences. Basic language resource toolkits (BLARKs) are essential; the existence of a BLARK is the pre-condition for building natural language-aware tools and services, so there are numerous potential applications beyond academic research which require these datasets and tools.

The problems Many archives known only to certain communities Archives are mostly unconnected, and data difficult to find Every archive has its own standards for storage and access There are not sufficient incentives to share resources Resources are in different formats, follow different standards, are described in differing ways Basic resources do not exist for all languages Tools are hard to use for non-specialist Tools and data are not available for online processing (only simple retrieval of files is possible) Many researchers are not aware of the potential benefits of using language and speech technology tools Many researchers are not aware of leading edge computational infrastructures

The CLARIN Vision A researcher in Zagreb can, from his desktop computer: single sign-on with local authentication search for, find and obtain authorization to use data in Oxford, Warsaw and Bergen select the precise (composite) dataset to work on, and save that selection run semantic analysis tools from Budapest and statistical tools from Tübingen over the dataset use computational power from the local or national computing centre where necessary save the workflow and results of the analysis, and share those results with collaborators in Paris, Vienna and Helsinki discuss and iteratively adopt and re-run the analyses with collaborators

The CLARIN Mission what? create a research infrastructure that makes language resources and technologies (LRT) available to scholars of all disciplines, especially humanities and social sciences how? unite existing digital archives into a federation of connected archives with unified web access provide language and speech technology tools as web services operating on (language) data in archives This represents the first coordinated and comprehensive attempt to address the technical, legal, administrative and financial barriers to the effective use of LRTs in academic research.

Who we are? CLARIN consortium 32 partners from 22 EU and associated countries CLARIN community 140-odd members in 32 countries leading partners include: Utrecht University (Steven Krauwer, coordinator)‏ Max Planck Institute Nijmegen (Peter Wittenburg)‏ Hungarian Academy of Sciences (Tamás Váradi)‏ Oxford University (Martin Wynne)‏ Tübingen University (Erhard Hinrichs)‏ Helsinki University (Kimmo Koskiennemi)‏ University of Copenhagen (Bente Maegaard) plus many more

CLARIN technical work Promoting collaboration and interoperability between European language resource repositories, particularly in relation to: Persistent identifiers Component metadata Trust domains Long-term Preservation and Access Service centres Virtual collections Standards and best practices Concept registry services See the CLARIN Short Guides on these topics at http://www.clarin.eu/

CLARIN and the GRID CLARIN aims to enable e-Humanities CLARIN is currently an early adopter of infrastructure services in Europe (e.g implementing access via Shibboleth, PIDs, metadata mappings) CLARIN aims to be a gateway to language resource collections and technology services in other institutions (e.g. digital libraries, commercial collections) Potentially providing language technologies and tools for other applications and services (e.g. for information extraction) Much current work involves collaboration with other initiatives involving Grid, research infrastructure, standards, Digital Humanites. Etc. to help create a coherent and coordinated infrastructure.

Thank you for your attention CLARIN has received funding fromthe European Community's Seventh Framework Programmeunder grant agreement n° 212230

Martin Wynne OUCS OeRC Linguistics martin.wynne@oucs.ox.ac.uk

Martin Wynne OUCS OeRC Linguistics martin.wynne@oucs.ox.ac.uk

Presentation Transcript

Linguistics

Linguistics

Linguistics

Linguistics

Linguistics

Systemic Linguistics: Core Linguistics

Linguistics

LINGUISTICS

Efficient IT: Can OUCS Help? Dr Stuart D Lee Director, Computing Systems and Services OUCS stuart.lee@oucs.ox.ac.uk

Linguistics

Linguistics

Lisa Starr Wynne Business.

Linguistics

Linguistics

OUCS VPN Service

Martin Wynne martin.wynne@ota.ahds.ac.uk with a lot of help from Ylva Berglund

OUCS, Oxford and the Grid

LINGUISTICS

GERAINT WYNNE - JONES

The OUCS Portal Project

Robert Wynne

Systemic Linguistics: Core Linguistics