European Language Resources Association (ELRA). HLT Evaluations. Khalid CHOUKRI ELRA/ELDA 55 Rue Brillat-Savarin, F-75013 Paris, France Tel. +33 1 43 13 33 33 -- Fax. +33 1 43 13 33 30 Email: [email protected] http://www.elda.org/ or http://www.elra.info/.
Evaluation to drive research progress
Human Language Technologies Evaluation(s)
What, why, for whom, how ….
(Some figures from TC-STAR)
Examples of Evaluation campaigns
Demo …(available afterwards)Presentation Outline
2005- Extension of ELRA’sofficial mission to promote LRs and evaluation for the Human Language Technology (HLT):
The mission of the Association is to promote language resources (henceforth LRs) and evaluation for the Human Language Technology (HLT) sector in all their forms and all their uses;ELRA: An efficient infrastructure to serve the HLT CommunityStrategies for the next Decade … New ELRA status:
Meeting points with technology development
which have been
Long term / high risk
Large return of investment
Choose between research alternatives
Identify promising technologies (market)
Benchmarking … state of the art
Share knowledge … dedicated workshops
Feedback … Funding agencies
Share Costs ???Why Evaluate?
What about good technology ? ….
Software industryTechnology performance & Applications
MT developers want to improve the “quality” of MT output
MT users (humans or software e.g. CLIR ) want to improve productivity using the most suitable MT system (e.g. multilinguality)
….HLT Evaluations …. For whom
Metric(s): Automatic, Human judgments … scoring software
scale/range of performance to compare with (Baseline)
reliability assessment: independent body
Participants: technology providersRequirements for an evaluation campaign
Activities by technology
Activities by geographical region
Evaluation ServicesHLT Evaluation Portal… Pointers to projects
Let us list some well known campaigns
A.1) Face Detection
A.2) Visual Person Tracking
A.3) Visual Speaker Identification
A.4) Head Pose Estimation
A.5) Hand Tracking
B) Sound and Speech technologies
B.1) Close-Talking Automatic Speech Recognition
B.2) Far-Field Automatic Speech Recognition
B.3) Acoustic Person Tracking
B.4) Acoustic Speaker Identification
B.5) Speech Activity Detection
B.6) Acoustic Scene Analysis
C) Contents Processing technologies
C.1) Automatic Summarisation … Question AnsweringSome of the technologies being evaluated within CHIL …http://chil.server.de/
more at the CHIL/CLEAR workshops
Technolangue/Evalda: the Evalda platform consists of 8 evaluation campaigns with a focus on the spoken and written language technologies for the French language:
ARCADE II: evaluation of bilingual corpora alignment systems.
CESART: evaluation of terminology extraction systems.
CESTA: evaluation of machine translation systems (Ar, Eng => Fr).
EASY: evaluation of parsers.
ESTER: evaluation of broadcast news automatic transcribing systems.
EQUER: evaluation of question answering systems.
EVASY: evaluation of speech synthesis systems.
MEDIA: evaluation of in and out-of context dialog systems.Evaluation Projects …. The French sceneSome projects in NL, Italy, ...
TC-STARSome details from relevant projects
European Parliament Plenary Sessions: (EPPS): English (En) and Spanish (Es),
Broadcast News (Voice of America VoA): Mandarin Chinese (Zh) and English (En)Back to Evaluation Tasks within TC-STAR (http://www.tc-star.org/)
Speech recognitionImprovement of SLT Performances (EnEs)
Input = Text ,
The end-to-end evaluation is carried out for 1 translation direction: English-to-Spanish
Evaluation of ASR (Rover) + SLT (Rover) +TTS (UPC) system
Same segments as for SLT human evaluation
Adequacy: comprehension test
Fluency: judgement test with several questions related to fluency and also usability of the systemEnd-to-End
1: Not at all , ...........5: Yes, absolutely
[Fluent Speech] Is the speech in good Spanish?
1: No, it is very bad ...... 5: Yes, it is perfect
[Effort] Rate the listening effort
1: Very high ............ 5: Low, as natural speech
[Overall Quality] Rate the overall quality of this audio sample
1: Very badm unusable ...... 5: It is very usefulFluency questionnaire
More results from the 2007 Campaign message?
Evaluation packages available TC-STAR Tasks
It saves developers time and money message?
It help assess progress accurately
It produces reusable evaluation packages
It helps to identify areas where more R&D is neededSome concluding remarks on Technology evaluation