Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger

Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged EU Automatic Eurovoc Descriptor Assignment JRC Workshop, Ispra, 16/17 September 2004 http://www.jrc.cec.eu.int/langtech JRC-Ispra, 17.09.04, Slide 1

Display • Danish • Dutch • English • Finnish • French • German • Greek • Italian • Portuguese • Spanish • Swedish Eurovoc indexing – Extend language coverage • Czech • Croatian • Latvian • Lithuanian • Polish • Slovak Analysis • Danish • Dutch • English • Finnish • French • German • (Greek) • Italian • Portuguese • Spanish • Swedish • (Lithuanian) • (Bulgarian) • (Hungarian) Soon also • Albanian • Romanian • Russian • Slovene JRC-Ispra, 17.09.04, Slide 2

Incentive for collaboration • Mutual benefit • We can provide tools and results to you (to non-commercial Member State organisations) • JRC will be able to Eurovoc-index documents for news analysis, etc. • No payments by the JRC are foreseen • How to go ahead? / What to do next? • We need Eurovoc-indexed texts in your languages(or translations of Eurovoc-indexed texts!) (Acquis Communautaire) JRC-Ispra, 17.09.04, Slide 3

Format to provide training texts to the JRC Ideally: • Plain text (not MS-Word, RTF, PDF, etc.) • UTF-8 character encoding • With CELEX code • With Eurovoc descriptor code (mentioning Eurovoc version) • XML format, structured • Linguistically pre-processed and structured: • lemmatised • annexes / signatures separate • title separate • stop word lists • MANY texts: • 80,000 English texts were enough to train ca. 3500 descriptors (out of 6000)! JRC-Ispra, 17.09.04, Slide 4

Descriptor distribution in Spanish EP/EC texts JRC-Ispra, 17.09.04, Slide 5

Descriptor distribution in Spanish EP/EC texts JRC-Ispra, 17.09.04, Slide 6

Descriptor distribution in Spanish Congress texts JRC-Ispra, 17.09.04, Slide 7

Descriptor distribution in Hungarian texts JRC-Ispra, 17.09.04, Slide 8

Procedure • You provide us with • A big XML file containing the documents • A stop word list • We will give back to you • A subset of documents (evaluation set) • Same format • Additional information on automatic Eurovoc descriptors assigned • Some statistics on descriptor usage frequency, etc. • An online browser interface to see the assignment results • A validation interface JRC-Ispra, 17.09.04, Slide 9

<xml> </xml> pre processing pre processing 95% training Training set Descriptor profiles Descriptor profiles Your corpus Descriptor profiles 5% assignment Descriptor Evaluation set Descriptor Descriptor <xml> <assignment> Eurovoc Assignment </assignment> </xml> JRC-Ispra, 17.09.04, Slide 10 export

XML format JRC-Ispra, 17.09.04, Slide 11

JRC-Ispra, 17.09.04, Slide 12

JRC-Ispra, 17.09.04, Slide 13

Results of descriptor assignment - interface JRC-Ispra, 17.09.04, Slide 14

Results of descriptor assignment - XML <assignment> <descriptor ID="1006020102000000" COSINE="0.20" OKAPI="8.83"> PRESIDENCY OF THE EC COUNCIL</descriptor> <descriptor ID="1016030000000000" COSINE="0.17" OKAPI="9.08"> EUROPEAN UNION</descriptor> <descriptor ID="1006040100000000" COSINE="0.15" OKAPI="9.63"> PRESIDENT</descriptor> <descriptor ID="2826020000000000" COSINE="0.14" OKAPI="7.82"> SOCIAL POLICY</descriptor> <descriptor ID="1011020102000000" COSINE="0.14" OKAPI="8.22"> PRINCIPLE OF SUBSIDIARITY</descriptor> ... </assignment> JRC-Ispra, 17.09.04, Slide 15

Results of descriptor assignment - validation Numeric feedback? JRC-Ispra, 17.09.04, Slide 16

Arranging the collaboration of scientific partners • The JRC will be able to provide the tool and indexing results. • The JRC does not have specific funds to pay for this work. • Possibilities for collaboration between parliament and scientists • informal collaboration without payment • formal collaboration (contract, payment) • apply for a project with national or EU funding (example: Hungary) • M.Sc. Theses (e.g. Lithuanian), internships (e.g. Estonian), … • … • We would like to have lemmatisers for the new languages.  • If necessary, we can train system without linguistic pre-processing. JRC-Ispra, 17.09.04, Slide 17

Pre-processing of the texts (by scientists?) • Linguistic pre-processing, needed for each language: • General and corpus-specific list of stop words (several thousand!) • For highly inflected languages: some lemmatiser or stemmer • Multi-word term mark-up for disambiguation purposes? • Further text processing • Some document structuring to separate title, text, footer and annex • Conversion to XML • Conversion to UTF-8 JRC-Ispra, 17.09.04, Slide 18

Dealing with different versions of Eurovoc • Problem has not yet been solved: request for your input • En training material was indexed with versions 3.1 and 4 • Challenge: new descriptors need new training material  delay • Re-training required JRC-Ispra, 17.09.04, Slide 19

Dealing with different versions of Eurovoc (2) Case 1: New descriptor • Search old and new documents for related documents for re-training Case 2: New name for old descriptor • Replace the descriptor name: OLD_NAME  NEW_NAME Case 3: New place in hierarchy • No problem Case 4: Disappearing descriptor • Will no longer be assigned JRC-Ispra, 17.09.04, Slide 20

OLD_NAME_1OLD_NAME_2 NEW_NAMEOLD_NAME_3 Dealing with different versions of Eurovoc (2) Case 5: Several descriptors are conflated • No problem Case 6: A descriptor is split into two or more • Re-training required(see Case 1) NEW_NAME_1 OLD_NAME NEW_NAME_2 NEW_NAME_3 JRC-Ispra, 17.09.04, Slide 21

Dealing with different versions of Eurovoc (3) • Changes between Eurovoc versions should not only be described in free text. • They should be formalised in a machine-readable way(e.g. in XML, in table format, …). • This should be done centrally for the thesaurus (i.e. for all thesaurus languages), rather than separately for each language! JRC-Ispra, 17.09.04, Slide 22

Appeal to Eurovoc community / EP / OPOCE • Make Eurovoc available to the wide public in machine-readable form • Formalise the version differences (e.g. XML) • Make Eurovoc-indexed texts available to the scientific community • Controlled by licences, if necessary • E.g. via the Evaluations and Language resources Distribution Agency ELDA • See http://www.elda.fr • “ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA.” • Wealth of ‘parallel texts’ to train multilingual text analysis applications • Machine Translation • Multilingual Named Entity Recognition • Multilingual classification • Multi-document summarisation • … • Automatic indexing • The benefit is yours! JRC-Ispra, 17.09.04, Slide 23

JRC-Ispra, 17.09.04, Slide 24

Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger