1 / 1

A Field Survey for Establishing Priorities in the Development of HLT Resources for Dutch

A Field Survey for Establishing Priorities in the Development of HLT Resources for Dutch D. Binnenpoorte, F. de Vriend, J. Sturm, W. Daelemans, H. Strik, C. Cucchiarini. Feedback Feedback of the HLT field (academia and industry) was collected at a workshop with about 100 participants. 1: BLARK.

Download Presentation

A Field Survey for Establishing Priorities in the Development of HLT Resources for Dutch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Field Survey for Establishing Priorities in the Development of HLT Resources for Dutch D. Binnenpoorte, F. de Vriend, J. Sturm, W. Daelemans, H. Strik, C. Cucchiarini Feedback Feedback of the HLT field (academia and industry) was collected at a workshop with about 100 participants. 1: BLARK 2: Inventory of available resources In defining the BLARK a distinction was made between: • Applications • Modules • Data A matrix (fragment in Figure 1) was drawn up describing • which modules are required for which applications; • which data are required for which modules; • what the relative importance is of the modules and data. Figure 1 (“+” = important, “++” = very important) Based on the full matrix a BLARK was defined. An inventory was made to establish which of the components - modules and data - that make up the BLARK are: • available; i.e. can be bought or are freely obtainable e.g. by open source; • (re-)usable. Inventory based on: • expert knowledge; • information found on the internet and in the literature; • personal communication with actors in the field. Components can only be considered usable if they are of sufficient quality  quality evaluation. Limited to a descriptive level: modules and data were checked against a list of evaluation criteria. A second matrix (fragment in figure 2) describes the availability of the components in the BLARK. Figure 2 (1 = ‘module or data set is unavailable’ to 10 = ‘module or data set is easily obtainable’). BLARK for language technology: Modules: • Robust modular text pre-processing • Morphological analysis and morpho-syntactic disambiguation • Syntactic analysis • Semantic analysis Data: • Monolingual lexicon • Annotated corpus of text (a treebank • Benchmarks for evaluation BLARK for speech technology: Modules: • Automatic speech recognition • Speech synthesis • Tools for calculating confidence measures • Tools for identification • Tools for (semi-) automatic annotation of speech corpora Data: • Speech corpora for specific applications • Multi-modal speech corpora • Multi-media speech corpora • Multi-lingual speech corpora • Benchmarks for evaluation Comparison 3: Priority list Requirements for prioritization: • the components should be relevant for a large number of applications; • the components should currently be either unavailable, inaccessible, or have insufficient quality; • developing the components should be feasible in the short term. Language technology: 1. Annotated corpus written Dutch 2. Syntactic analysis: robust recognition of sentence structure 3. Robust text pre-processing: tokenization and named entity recognition 4. Semantic annotations for the treebank mentioned above 5. Translation equivalents 6. Benchmarks for evaluation Speech technology: 1. Automatic speech recognition (including non-native ASR, robust ASR, adaptation, and prosody recognition) 2. Speech corpora for specific applications (e.g. directory assistance, CALL) 3. Multi-media speech corpora (speech corpora that also contain information from other media such as newspapers, WWW, etc.). 4. Tools for (semi-) automatic transcription of speech data 5. Speech synthesis (including tools for unit selection) 6. Benchmarks for evaluation

More Related