1 / 24

ELSE

ELSE. Evaluation in Language and Speech Engineering January 98 - April 99. e. l. s. e. MIP - U. of Odense UDS U. di Pisa EPFL XRCE U. of Sheffield Limsi (CNRS) CECOJI (CNRS) ELRA & ELSNET. Denmark Germany Italy Switzerland France United Kingdom France France.

hera
Download Presentation

ELSE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ELSE Evaluation in Language and Speech Engineering January 98 - April 99 e l s e

  2. MIP - U. of Odense UDS U. di Pisa EPFL XRCE U. of Sheffield Limsi (CNRS) CECOJI (CNRS) ELRA & ELSNET Denmark Germany Italy Switzerland France United Kingdom France France ELSE Participants e l s e

  3. Comparative Technology Evaluation Paradigm • Successfully used US DARPA (since 1984) • Shorter scale in Europe (Sqale, Grace…) • Choose task / system or component • Gather participants • Organize campaign (protocols/metrics/data) • Mandatory if technology insufficient: • MT, IR, summarization… (cf recognition 80s) e l s e Limsi-CNRS

  4. Knowledge gained from evaluation campaigns • Knowledge shared by participants in WS: • How to get the best results ? • Methodology advantages / disadvantages • Funding agencies (DARPA / others) • Level of technology / applications • Progress vs investment • Set priorities e l s e Limsi-CNRS

  5. Knowledge gained from evaluation campaigns • Industry • Compare with State-of-the-Art (developers) • Select technologies (integrators) • Easier market intelligence (SMEs) • Consider applications (end-users) e l s e Limsi-CNRS

  6. Powerful tool • Go deeper into conceptual background • Metrics, protocols... • Contrastive evaluation scheme • Accompany research • Problem-solving approach • Interest for speech and NL communities e l s e Limsi-CNRS

  7. Resources & evaluationby-products • Training and test data • Must be of high quality (used in test) • Evaluation toolkits • Expensive: of interest for all • Interest for remote users (domain, country) • Compare with state-of-the-art • Induce participation in evaluation campaign • Measure progress e l s e Limsi-CNRS

  8. Relationship / usage-oriented evaluation • Technology evaluation • Generic task • Attract enough participants • Close enough to practical application • Usage evaluation • Specific application / specific language • User satisfaction criteria e l s e Limsi-CNRS

  9. Relationship / usage-oriented evaluation • Technology insufficient: no application • Technology sufficient: possible application • Efforts for usage evaluation are larger than for technology evaluation • Technology evaluation (10s) generic and organized centrally • Usage evaluation (1000s) specific organized by each application developer / user e l s e Limsi-CNRS

  10. Relationship / Long Term Research • Different objectives / time scale • Meeting points placed in the future • LTR: high risk but high profit investment e l s e Limsi-CNRS

  11. ELSE results • What ELSE proposes ? abstract architecture (generic IR/IE) (profiling, querying and presentation) control tasks1) can be easily performed by a human 2) arbitrary composite functionality possible 3) formalism for task result description 4) measures easy to understand 6 tasks or a global task to start with... e l s e Limsi-CNRS

  12. 6 Control tasks to start with... 1. Broadcast News Transcription 2. Cross Lingual IR / IE 3. Text To Speech Synthesis 4. Text Summarization 5. Language Model Evaluation 6. Word Annotation task (POS, Lemma, Syntactic Roles, Senses etc.) e l s e Limsi-CNRS

  13. or a global task to start with... • ”TV News on Demand” (NOD)(Inspired from BBN "Rough'n'Ready”)- segments radio and TV Broadcast- combines several recognition techniques (speaker Id, OCR, speech transcription, Named Entities etc.)- detects topics- summarizes- searches/browse and retrieves information e l s e Limsi-CNRS

  14. Multilingualism • 15 Countries • 2 Possible solutions: • 1) Cros Lingual Functionality Requirement • 2) All participants evaluate on 2 languages - their own - one common pivotal language (English ?) e l s e Limsi-CNRS

  15. Results Computation • Multidimensional evaluation (multiple mixed evaluation criteria) • Baseline Performance (contrastive) • dual result computation (quality) • Reproducible (automated evaluation toolkit needed) e l s e Limsi-CNRS

  16. Language Resources • Human Built Reference Data (cost + consistency check + guidelines) • Minimal Size (chunck selective evaluation) • Minimal Quality Requirement • Language Phenomena Representativity • Reusable & Multilingual • By-products of evaluation become Evaluation Resources e l s e Limsi-CNRS

  17. Actors in the infrastructure European Commission ELRA Evaluators Participants (EU / non EU) L. R. Producers Research Industry Citizens Users & Customers e l s e Limsi-CNRS

  18. Need for a Permanent Infrastructure ? • Problem with Call for Proposals mechanism • Limited duration (FPs) / Share of cost by participants • Permanent organization • General policy / Strategy / Ethical aspects • scoring software • Label attribution / Quality insurance & control • Production of Language Resources (dev,test) • Distribution of Language Resources (ELRA) • Cross-over FPs e l s e Limsi-CNRS

  19. Evaluation in the Call for Proposals • Evaluation campaigns: 2 years • Proactive scheme: Select topics (research / industry) e.g. TV News on Demand or several tasks (BNT, CLIM, etc.) • Reactive scheme: Select projects, Identify generic technologies among projects (clusters ?), resources contracted out of project budgets, a posteriori negociation e l s e Limsi-CNRS

  20. Multilinguality • Each participant should address at least two languages (own + common language) • One language common to all participants • Compare technologies on same language/data • Compare languages on same technology • English: spoken by many people, large market, cooperation with USA • Up to 4 languages for each consortium • Other languages in future actions e l s e Limsi-CNRS

  21. Proactive vs Reactive ? • ELSE views: • Proactive • Single Consortium • Permanent Organization (Association + Agency) • English as common language e l s e Limsi-CNRS

  22. Estimated Cost • 100% EC funding for infrastructure org, LR • Participants: share of system development • Reactive: Extra funding for evaluation • Proactive: • 600 Keuro average each topic (3,6 Meuro total) • 90 Keuro organization • 180 Keuro LR production • 300 Keuro participants (up to 10) • 30 Keuro supervision permanent organization e l s e Limsi-CNRS

  23. Questions ? • Are you interested by the concept ? • Would you be interested to participate ? • Would you be interested to provide data ? • Would you be ready to pay for participating ? • Would you be ready to pay for accessing the results (and by products, e.g. data and tools) of an evaluation ? • Would you be interessed in paying for specific evaluation services ? e l s e Limsi-CNRS

More Related