10 likes | 118 Views
A Field Survey for Establishing Priorities in the Development of HLT Resources for Dutch D. Binnenpoorte, F. de Vriend, J. Sturm, W. Daelemans, H. Strik, C. Cucchiarini. Feedback Feedback of the HLT field (academia and industry) was collected at a workshop with about 100 participants. 1: BLARK.
E N D
A Field Survey for Establishing Priorities in the Development of HLT Resources for Dutch D. Binnenpoorte, F. de Vriend, J. Sturm, W. Daelemans, H. Strik, C. Cucchiarini Feedback Feedback of the HLT field (academia and industry) was collected at a workshop with about 100 participants. 1: BLARK 2: Inventory of available resources In defining the BLARK a distinction was made between: • Applications • Modules • Data A matrix (fragment in Figure 1) was drawn up describing • which modules are required for which applications; • which data are required for which modules; • what the relative importance is of the modules and data. Figure 1 (“+” = important, “++” = very important) Based on the full matrix a BLARK was defined. An inventory was made to establish which of the components - modules and data - that make up the BLARK are: • available; i.e. can be bought or are freely obtainable e.g. by open source; • (re-)usable. Inventory based on: • expert knowledge; • information found on the internet and in the literature; • personal communication with actors in the field. Components can only be considered usable if they are of sufficient quality quality evaluation. Limited to a descriptive level: modules and data were checked against a list of evaluation criteria. A second matrix (fragment in figure 2) describes the availability of the components in the BLARK. Figure 2 (1 = ‘module or data set is unavailable’ to 10 = ‘module or data set is easily obtainable’). BLARK for language technology: Modules: • Robust modular text pre-processing • Morphological analysis and morpho-syntactic disambiguation • Syntactic analysis • Semantic analysis Data: • Monolingual lexicon • Annotated corpus of text (a treebank • Benchmarks for evaluation BLARK for speech technology: Modules: • Automatic speech recognition • Speech synthesis • Tools for calculating confidence measures • Tools for identification • Tools for (semi-) automatic annotation of speech corpora Data: • Speech corpora for specific applications • Multi-modal speech corpora • Multi-media speech corpora • Multi-lingual speech corpora • Benchmarks for evaluation Comparison 3: Priority list Requirements for prioritization: • the components should be relevant for a large number of applications; • the components should currently be either unavailable, inaccessible, or have insufficient quality; • developing the components should be feasible in the short term. Language technology: 1. Annotated corpus written Dutch 2. Syntactic analysis: robust recognition of sentence structure 3. Robust text pre-processing: tokenization and named entity recognition 4. Semantic annotations for the treebank mentioned above 5. Translation equivalents 6. Benchmarks for evaluation Speech technology: 1. Automatic speech recognition (including non-native ASR, robust ASR, adaptation, and prosody recognition) 2. Speech corpora for specific applications (e.g. directory assistance, CALL) 3. Multi-media speech corpora (speech corpora that also contain information from other media such as newspapers, WWW, etc.). 4. Tools for (semi-) automatic transcription of speech data 5. Speech synthesis (including tools for unit selection) 6. Benchmarks for evaluation