1 / 62

Research on Human Language Technologies in Flanders

Research on Human Language Technologies in Flanders. Walter Daelemans (Ed.) CNTS Language Technology Group University of Antwerp walter.daelemans@ua.ac.be. Prehistory Computational Linguistics. Early isolated PhDs Willy Martin (Leuven, 1970): Analysis of a vocabulary by means of a computer

hosea
Download Presentation

Research on Human Language Technologies in Flanders

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research on Human Language Technologies in Flanders Walter Daelemans (Ed.) CNTS Language Technology Group University of Antwerp walter.daelemans@ua.ac.be

  2. Prehistory Computational Linguistics • Early isolated PhDs • Willy Martin (Leuven, 1970): Analysis of a vocabulary by means of a computer • Luc Steels (Antwerpen, 1977): Aspects of a Modular Theory of Language • Early International MT involvement (Leuven) • EUROTRA (early eighties until 1994), METAL (mid-eighties until 1992) • Dutch language pairs

  3. Prehistory Computational Linguistics • Droste PhDs: F. G. Droste (1969) Translating with the computer, possibilities and problems. • Frank Van Eynde (Leuven, 1985): Meaning, translatability, and machine translation • Geert Adriaens (Leuven, 1986): Process Linguistics: The Theory and Practice of a Cognitive-Scientific Approach to Natural Language Understanding • Walter Daelemans (Leuven, 1987): Studies in Language Technology: an object-oriented model of Dutch morphophonology and its applications • First groups start late eighties, early nineties • Leuven CCL (Van Eynde) • Antwerp CNTS (Daelemans, Gillis)

  4. Prehistory Speech Technology • Early PhDs • Jean-Pierre Martens (Gent, 1982). Quality degrading aspects of filtered speech. • Luc Vanhove (Gent, 1984). Study and improvement of the linear prediction vocoder. • Werner Verhelst (Brussels, 1985). Short-time cepstra and LPC analysis-synthesis of speech. • Dirk Van Compernolle (Stanford, 1985). Speech Processing Strategies for a Multichannel Cochlear Prosthesis. • First speech technology groups start up in the mid 80’s • Gent (ELIS): Marc Vanwormhoudt, Jean-Pierre Martens • Leuven (ESAT): Dirk Van Compernolle • Brussels (ETRO): Oscar Steenhaut, Werner Verhelst

  5. Prehistory Speech Technology • First collaborations on national level • 1983-1989: First IWONL-IRSIA project on speech analysis / synthesis (with UGent, VUB, UCL, Bell Telephone, FPMS, ACEC, Philips, correlative systems) • 1987-1992: National stimulation program on artificial intelligence (with KULeuven, VUB, UCL, ULB, FPMS, UGent)

  6. Start of an organized field • Research Initiative on Dutch speech and language technology (1993-1994) • Flemish ministry of science and technology • To “improve and strengthen the position of Dutch” • App. 1 million euro • Speech recognition research, corpora (CoGen, ANNO), pronunciation lexicon (Fonilex) • In preparation of a long-term research programme on speech and text translation to and from Dutch

  7. http://clif.esat.kuleuven.be Start of an organized field • Computational Linguistics in Flanders (CLIF) • Sponsor: Flemish National Science Foundation • Scientific Research community • 1995-2010 (2 renewals), 12,500 euro per year • Goals • Strengthen the integration of fundamental research on language and speech technology in Flanders to establish multidisciplinary, fundamental and applied research of natural language and Dutch in particular • Facilitate research activities of the participating research groups, improving (re-)usability of data for spoken and written language • Positive effects • Acts as a de facto spokesman of academic research community for government, Dutch language union, etc. • Fruitful environment for cooperation • All the main research groups are represented • International advisory board

  8. Start of an organized field • Flanders Language Valley campus • Ypres, the centre of Europe • Literally arising around Lernout & Hauspie Speech Products • 1985 founded, 1995 NASDAQ, 2001 bankrupt • 125 million euro investment capacity in FLV Fund • 1995-2005 (liquidation) • “Favorable place of business for HLT companies” • CELE Research group (Kristiina Jokinen, Dirk Frimout) • University of Antwerp cooperation long term research on CAM Brain Machine • (turned out to be short term research)

  9. From CGN to STEVIN • What happened to the “long-term research programme on speech and text translation to and from Dutch”? • CLIF recommendation: extend and valorize the “short term research projects” • Opportunities for cooperation with The Netherlands • CGN (Spoken Dutch Corpus) • 1998-2004 • 10 million words, linguistically annotated and linked to the signal • Tools, protocols, interface • From spontaneous to read speech • 5 million euro, 1/3 Flanders

  10. From CGN to STEVIN • The position of Dutch in Language and Speech Technology (Bouma & Schuurman, 1998) • Conclusion: many weak spots and omissions in the available infrastructure • Advice: • Install Dutch-Flemish platform (coordinated by NTU) • Stimulate both fundamental and applied research • Set up an inter-university HLT education program in Flanders and reinforce the existing Dutch programs • Action plan for Dutch in language and speech technology: priorities for basic resources (Daelemans & Strik, 2002) • Prepared the contents and priorities for STEVIN. • STEVIN program (1/3 Flanders) • 2004-2011, 11.4 million euro

  11. HLT Research Overview

  12. Others • Research within companies • Language and Computing • Nuance • … • Karel De Grote university college (Antwerp) • Readability, subtitling, … • Lessius university college (Leuven) • Terminology extraction and management, translation tools • Erasmus university college (Brussels) • Terminology, translation tools, corpora

  13. Funding situation in Flanders • Flemish HLT research funding situation is reasonably good, (except for basic research) • Basic research grants (hard to get) • FWO PhD and postdoc mandates • FWO research projects • (VNC Dutch-Flemish research projects) • University funding (Special Research Fund) • TOP, GOA, IUAP, … • Application-oriented research • IWT PhD projects • IWT SBO / GBOU / STWW projects • TETRA projects • European Research (Framework Programs)

  14. Research situation in Flanders • The joint research groups cover a large part of the field of language and speech technology research • Speech recognition and speech synthesis • MT, QA, Information Extraction, Summarization, Information Retrieval, Ontology and Terminology Learning • Machine Learning / statistical methods, knowledge-based / linguistic methods, hybrid methods • Text analysis (from morphology via syntax to semantics and discourse) • Corpus development and annotation • Less well-represented • Text generation from meaning representations • (Spoken) dialogue systems • Multimodality

  15. Speech

  16. ELIS-DSSP Jean-Pierre Martens Electronics & Information Systems University of Ghent https://speech.elis.ugent.be/

  17. ELIS-DSSP • Embedding • research group of dept. Electronics & Information Systems (ELIS) • Key dates • 1982: first PhD • 1986: first Flemish aid for speech impaired persons, working with speech synthesis • 1988: speech synthesis technology sold to L&H • 1997: creation of spin-off company (Technology & Integration) in domain of alternative communication with speech technology • Main research themes in speech technology • auditory model based speech and music analysis • acoustic and lexical modelling for ASR • speech segmentation and labelling as a pre-processing step in speech transcription systems • objective assessment of disordered speech

  18. ELIS-DSSP • Software development • auditory model embedding AMPEX pitch extractor • monophonic melody extractor • real-time audio indexing system comprising the isolation of speech intervals, speaker turn detection, gender and speaker clustering • AUTONOMATA grapheme-to-phoneme conversion toolkit supporting the development of error recovery from baseline system • Tool for disordered speech intelligibility assessment • Resource development • CoGeN (Corpus Gesproken Nederlands): ELIS + ESAT • FONILEX (Phonetic Lexicon): together with UA, CCL • CGN (Corpus Spoken Dutch): project leader for Flanders • COST-278 multilingual broadcast news database

  19. Main research results • improved LVCSR by means of data-driven pronunciation variation modeling (ACCENT: FWO) • real-time audio segmentation algorithm that came out first in a multilingual evaluation campaign (ATRANOS: IWT, COST278: EU) • improved spontaneous speech recognition by the proper treatment of disfluencies (ATRANOS: IWT) • reliable prediction of disordered speech intelligibility by means of phonological features (SPACE: IWT) • improved proper name recognition by means of a phonological feature model (SPACE: IWT) • improved proper name synthesis by means of special purpose g2p converters that can be trained on very few transcribed data using a g2p-p2p approach (AUTONOMATA: STEVIN) • improved LVCSR by means of a data-driven compound composition and decomposition strategy (NBEST: STEVIN)

  20. ETRO-DSSP Werner Verhelst Electronics & Informatics University of Brussels http://www.etro.vub.ac.be/Research/DSSP/dssp.htm

  21. Laboratory for Digital Speech and Audio Processing – ETRO-DSSP • Embedding • research group of dept. Electronics & Informatics, ETRO of the Vrije Universiteit Brussel • Key dates • 1985: first PhD • 1988-1991: collaboration with Institute for Perception Research, The Netherlands • 2004: member of Interdisciplinary Institute for Broadband Telecommunication (IBBT) • 2006: joint research group for audiovisual speech processing with Northwestern Polytechnic University Xi’An China and start of FWO-WOG Audiovisuele systemen • Main research themes in speech technology • speech modification • speech enhancement • expressive speech analysis and synthesis • audiovisual speech analysis and synthesis

  22. Main development work • Software development • system for automatic synchronization of studio dialogs with lip movements in video and film postproduction (IWOIB – EOS) • speech synthesis for feedback in reading tutor software (IWT - SPACE) • audiovisual text to speech synthesis system (Flemish and English) • sound management system for public address systems (IWT + ESAT +Televic)

  23. Main development work • Resource development • Audiovisual recording studio • Database with multi-sensor speech recordings (EU-SAFIR with Thales and Voice Insight) • Database for Flemish unit selection TTS • Audiovisual database with emotional speech (new project)

  24. Main research results • window and sampling effects in short-time cepstra of voiced speech (IWONL) • improved autocorrelative pitch detection with adaptive sign clipping (IWONL) • improved voicing source model for vocoders (VUB) • the WSOLA algorithm and its use for robust natural sounding time scaling (IWT) • perceptual speech and audio modeling with damped sinusoids (IRMUT – IWT with ESAT) • least squares theory and design of optimal noise shaping filters for speech and audio requantization (SMS4PA – IWT with ESAT and Televic) • first cross-database study for expressive speech classification (VIN - IBBT) • improved speech recognition in noisy environments with bone conducting microphones (SAFIR - FP6) • improved spelling and syllabification modes in text to speech synthesis (SPACE – IWT)

  25. ESAT/PSI-Spraak Patrick Wambacq, Hugo Van hamme, Dirk Van Compernolle Electrical Engineering, Center for Processing Speech and Images University of Leuven http://www.esat.kuleuven.be/psi/spraak

  26. ESAT/PSI-Spraak • Speech processing research at K.U.Leuven (Dept. Electrical Engineering, Center for Processing Speech and Images) since 1987 • Focus on speech recognition and its applications, using in-house developed large vocabulary continuous speech recognition system • 3 staff members: Dirk Van Compernolle, Hugo Van hamme, Patrick Wambacq, ≈ 10 researchers (some postdocs) • Extensive computing facilities • Cooperations at national and international level through research projects with both academia and industry, for fundamental and applied research • Current coordinator of CLIF research community

  27. ESAT/PSI-Spraak:research themes • ASR novel architectures (episodic, hybrid, layered approaches) • ASR robustness (noise, spontaneous speech, speaker variability, …) • speech modeling and representation • computational models of human language acquisition • applications: CALL, clinical applications, indexing, subtitling, … • tools and corpora for ASR

  28. ESAT/PSI-Spraak: FLaVoR • FLAVOR: “Flexible Large Vocabulary Recognition”: IWT funded, Oct. 2002 - Sept. 2006, with CNTS-U of Antwerp • Frustrated by the inflexibility of the traditional monolithic ASR architecture, we set out to • Incorporate Linguistic Knowledge Sources • That allow for efficient modeling of morphologically productive languages • That allow for modeling linguistic phenomena that are not well dealt with in a traditional left-to-right architecture • That allow for the modeling of both short and long term dependencies

  29. ESAT/PSI-Spraak: FLaVoR • Through a Modular Recognizer Architecture • That assures a better reusability of components • That relies on a high degree of independence between acoustic and linguistic processing • That allows for a faster decoder and hence makes computational resources available for the more complex linguistic modeling

  30. ESAT/PSI-Spraak: FLaVoR

  31. ESAT/PSI-Spraak: SPACE • SPACE: “SPeech Algorithms for Clinical and Educational applications”, IWT funded, Mar. 2005 - Feb. 2009, with ELIS-Ugent, DSSP-VUB, ORTHO-KULeuven, COM-UAntwerp • Main goals: • evaluate user’s speech in educational and clinical applications • improve speech recognition and speech synthesis technologies to better support these applications: • provide accurate classification of uttered phonemes • provide corrective auditory feedback • particular focus points: children’s speech, disfluencies, mis-pronunciations, deviant speech (e.g. speech of the deaf, dysarthria) • demonstrate the benefits of speech technology based tools for these applications, with involvement of experienced user groups

  32. ESAT/PSI-Spraak: SPRAAK • SPRAAK: Speech Processing, Recognition and Automatic Annotation Kit: STEVIN project Dec. 2005 - June 2008, building on in-house developed software since 15 years • make state-of-the-art LVCSR system available for research community (free for research purposes): • modular toolkit (plug&play) for research on speech recognition, allowing researchers to focus on one particular aspect only and forget about the rest, with access to deep internals of the system (using low-level API) • recognizer with simple interface, usable by non-experts through high-level API • http://www.spraak.org

  33. Text

  34. Centrum voor Computerlinguïstiek Frank Van Eynde Faculty of Arts University of Leuven http://www.ccl.kuleuven.be/

  35. Centrum voor Computerlinguïstiek • founded in 1991 at the Faculty of Arts of K.U.Leuven • building on the expertise that had been acquired in the 80s in the framework of the machine translation projects Eurotra and Metal • part of the research unit ‘Dutch, German and Computational Linguistics’ since 2005 • member of ELSNET since 1993 and of CLIF since 1995 • main objectives: (1) acquiring funds for carrying out research in formal and computational linguistics and its application in natural language processing; (2) teaching, training and dissemination

  36. Centrum voor Computerlinguïstiek • formal syntax and semantics (Head-driven Phrase Structure Grammar)‏ • corpus annotation (tagging, treebanks, semantic annotation)‏ • machine translation • multilingual information retrieval • teaching at K.U.Leuven and at international summer schools (ESSLLI, ELSNET, EMLS, LOT)‏ • host of ESSLLI-90, TMI-95, CLAW-96, ELSNET-97, CLIN-98, EMLS-02, HPSG-04, CLIN-07 • http://www.ccl.kuleuven.be/

  37. Machine TranslationMETIS-II (EU/FP6) 2004-2007 • Successor of Metis I (2003-2004)‏ • Hybrid System DU-EN • Low Resources (no full parser, no parallel data)‏ • BLEU scores about the same as SMT with IBM1 trained on Europarl • Succeeded by PaCo-MT (NTU) (2008-2011)‏ • Hybrid system FR <-> DU <-> EN • Full Resources

  38. Corpus AnnotationCGN / D-Coi / Lassy (STEVIN)‏ • Series of joined Flemish/Dutch projects • spoken Dutch (CGN 1998-2000), ±10M • written Dutch (D-Coi '05-'06, Lassy '06-'09), ±50M • PoS labels, lemmata, treebank • Parts manually corrected (e.g. 1M treebank in Lassy)‏ • Succeeded by SoNaR 2008-2011, ± 500M semantic labels (corrected) for 1M (NE, coreference, semantic roles, spatiotemporal)‏

  39. CNTS Language Technology Group Walter Daelemans, Steven Gillis Faculty of Arts University of Antwerp http://www.cnts.ua.ac.be

  40. CNTS Center for Dutch Language and Speech • Founded in 1992 to promote research in Dutch (corpus) linguistics, psycholinguistics, and computational linguistics • Research Center of the department of Linguistics (Faculty of Arts) • Member of Elsnet, CLIF, Flarenet, Clarin, Pascal, CIL, … • Co-founded ACL SIG on Computational Language Learning (SIGNLL) and the associated CoNLL conference series and CoNLL shared tasks series • Resources and software development • Corpora (CGN, COREA, Knack-2002, …) • TiMBL, Tadpole (with ILK, Tilburg University) • Memory-Based Shallow Parser (MBSP) • Spin-off: www.textkernel.nl

  41. Research Topics • Computational Psycholinguistics • Computational models of human language acquisition and processing (phonology, morphology, syntax) • Machine Learning of Language • Memory-Based Learning; ML methodology ML-based Text Analysis • Phonological and morphological analysis, Prosody and grapheme-to-phoneme, POS tagging, chunking, grammatical relations, pp-attachment, named-entity recognition, semantic role labeling, word sense disambiguation, coreference resolution, … • LT Applications • Biomedical information extraction; Summarization and sentence simplification; Ontology extraction from text; Question Answering • NL interface to graphical design packages, serious gaming, computational stylometry • African Language Technology

  42. http://www.biograph.be Biomedical Text Mining • BioMinT (EU FP5, Quality of Life, 2003-2005) • Information Retrieval and Information Extraction tool for human curators of SWISSPROT (protein database) • With SIB Geneva, University of Manchester, PharmaDM, University of Vienna, University of Geneva • Results CNTS • Adaptation Memory-Based Shallow Parser for biomedical language (tagger, tokenizer, NER, grammatical relations) • Biograph (University of Antwerp GOA project, 2007-2010) • Ranking genes implicated in diseases (schizophrenia, Alzheimer) using heterogeneous data (including text) • With data mining group and molecular biology group of University of Antwerp • Better text mining engine including analysis of modality and negation for better biomedical information extraction

  43. Computational Stylometry • Computational techniques for stylometry (FWO project, 2007-2010) • Goals • Develop feature construction, feature selection and supervised and unsupervised machine learning techniques for • Authorship, gender, date, and personality attribution from text • Stylistic analysis of literary texts • Develop and make available a tool and benchmark datasets • http://www.cnts.ua.ac.be/stylometry

  44. iTEC Piet Desmet Faculty of Arts University of Leuven, Campus Kortrijk http://www.itec-research.be

  45. iTEC Interdisciplinary research on Technology, Education & Communication: • Computer-assisted language learning • Corpora & digital libraries • Language testing • Authoring systems

  46. Recent projects Lingu@tic • Language learning environment Dutch & French based on video extracts • www.kuleuven-kortrijk.be/linguatic Medi@tic • Database of video extracts for language learners Dutch & French • www.kuleuven-kortrijk.be/mediatic DPC • Dutch Parallel Corpus (10 million words Dutch, English & French) • www.kuleuven-kortrijk.be/dpc

  47. Lingu@tic Development of a free language learning environment (Dutch & French) based on authentic broadcasted video extracts Use of half-open exercises (e.g. translation exercises with alternative answers) • www.franel.eu

  48. Medi@tic Development of a database of learning objects • Repository of free authentic video materials • Management tool for the description of audio and video assets • Exploration tool for selecting video materials which can be integrated in CALL applications

  49. DPC Annotated sentence aligned corpus 10 million words NL-FR and NL-EN Quality control Compatible with D-COI

More Related