1 / 28

CKL --- Center for Computational Linguistics

CKL --- Center for Computational Linguistics. Proje c t MŠMT LC536 (LC05) Univerzita Karlova v Praze, ÚFAL MFF Západočeská univerzita Plzeň, KKY FAV Masarykova Univerzita Brno, FI Ústav pro jazyk český AV ČR Praha http://www.centrumkomputacnilingvistiky.cz.

lesley
Download Presentation

CKL --- Center for Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CKL---Center for Computational Linguistics Project MŠMT LC536 (LC05) Univerzita Karlova v Praze, ÚFAL MFF Západočeská univerzita Plzeň, KKY FAV Masarykova Univerzita Brno, FI Ústav pro jazyk český AV ČR Praha http://www.centrumkomputacnilingvistiky.cz

  2. Center’s Advisory Board Meeting31.1.2011MFF UK, Malostranské nám. 25Room S1, 4th floor • 10:00 Introduction to the Center, history, results (Jan Hajic) • 10:25 Charles University research and results (Jan Hajic) • 10:40 Break • 11:00 Institute for Czech Language research and results (Karel Oliva) • 11:15 Masaryk University research and results (Karel Pala) • 11:30 University of West Bohemia research and results (Pavel Ircing)

  3. The Center • Goals: • Research in all areas of computational linguistics and speech • Close cooperation in speech and langauge • Create annotated data • Algorithms and SW Tools for NL analysis and generation • Create and integrate lexical resources

  4. History of the Center • Former Center for Computational Linguistics (program MŠMT LN) • 2000-2004 • UK, ÚJČ, ZČU: fundamental research type (B) • Now: Center for Computational Linguistics • (again) fundamental research, MŠMT LC • Masaryk University in Brno added, now 4 sites

  5. The Center: some figures • Budget and timeframe • 2.9 mil. €, 2005-2009[-2011] (6 yrs + 9 mos) • Personální obsazení (2010): • 1 PI (professor) • 7 Co-PIs and key presons (full/assoc. prof.) • 11Postdocs (Ph.D.) • 9of them graduated with CKL support • 24 graduate students • Reduced to about 2/3 for 2011

  6. The sites (1) • UK Praha (ÚFAL MFF / Charles University) • Formal language theory and algorithms • SW tools for NLU / NLG • Raw, Annotated data (incl. parallel) • ZČU Plzeň, KKY FAV (University of West Bohemia in Pilsen) • Speech recognition and TTS • Data collection and annotation

  7. The sites (2) • MU Brno, FI, NLP lab (Masaryk University) • Lexical issues • Lexical databases, incl. SW • ÚJČ AV ČR (Institute of the Czech Language, Academy of Sciences of the CR) • Digitization of historical data • Lexical databases

  8. 2005 • Start of work, after some “gap” • Apr. 1, 2005 – three months vacuum • [Got back the name…] • Reduced budget for 2005 (300k €) • Durable equipment / future computing cluster • Cooperation: • EU grant proposals • continuing work on Malach (U.S.) • Start of the PIRE NSF project (JHU, Brown Univ.)

  9. 2006 • First full year • Prague Dependency Treebank v2.0 finished (published at LDC) • Speech reconstruction project (UK, specification with PIRE/JHU) • Lexical issues (UK, MU, ÚJČ) • Speech (ASR, TTS - ZČU) • IR – CLEF test collection, CLEF shared task, 1st part • Digitization of historical material (ÚJČ) • Start of EU Integrated project „Companions“: UK, ZČU • More international cooperation: EU, USA (JHU, Brown, Univ. of Pennsylvania) • Organization of Treebanks and Linguistics Theories, Dec. 2006 (UK) • 40 „results” in the government database („RIV”)

  10. 2007 • Mid-project • Lexical resources, new Czech language lexical database (MU+ÚJČ) • Added more students for English work, translation • English annotation specification, annotation (ZČU, UK) • Integration of ASR and TTS with NLU/NLG (UK, ZČU) • In the “Companions” project • SW tools for analysis and generation • Speech, language (UK, MU, ZČU) • International collaboration • EU (3 projects 6th FP: UK, UK+ZČU), USA (UK, UK+ZČU) • Local organisation of ACL 2007 and EMNLP 2007 • Still (2011) holds record in attendance (~1100 participants) • 66 results in“RIV” (16 journals, 39 in-proc., 5 SW/data etc.)

  11. 2008 • Slightly modified goals (stress on MT) • Lexical resources (MU, UK, ÚJČ) • SW tools • Semantics • detection of plagiarism (MU) • NLU (UK, MU), NLG (UK) • New algorithms for ASR • Prosody, language modeling, speech reconstruction • Data acquisition, annotation, corpus tools • Research (incl. data annotation) for machine translation • The TectoMT SW and data platform • Theoretical formal linguistics, language usage • Results (RIV): 64: 13 journal art., 32 in-proc., 5 books, 5 SW tools/data resources etc.

  12. 2009 • Should have been the last year of CKL… • Application for extension for 2010-11 • Granted for 2010 • Research: English data, MT, ASR, Dialog • Work on the parallel Czech-English treebank (PTB) • Companions project: integration work • Tight cooperation between UK and ZCU • PIRE project – workshops, students from US at UK • Euromatrix EU project on MT extended (-2012) • Organization of the CoNLL 2009 shared task • Organization of session at FET 2009 (EU conference) • Results: 62, journals: 8, in-proc.: 42, 3 books etc.

  13. 2010 • Last fully-funded year: ext. to 2011 granted in Nov. • Continuation of research along the same lines • Wrap-up in data annotation: PCEDT, PDTSx • Departures of people due to uncertainty • International cooperation: • Companions project finished (Nov. 2010) • PIRE continuing towards 2011, EuromatrixPlus renewed (UK) • New projects in 2010: • Univ. of Pennsylvania – discourse representation, annotation (UK) • Khresmoi (EU IP) – medical IR and IE, UK • Faust (STREP, machine translation, UK) • META-NET network of excellence in MT / data sharing • Chairing the ACL 2010 conference (Uppsala, Sweden) • Results (prelim.): ~60 (12 journal articles, ~40 in-proc.)

  14. Quantitative Summary of Results • RIV 2005-2009 (2010 pending) • 274 records (+ ~ 60 in 2010) • Mostly papers in proceedings of conferences and workshops • ACL, EACL, NAACL, Coling, CoNLL; workshops • > 95% international, > 85% abroad • Some journal articles • LNCS, IEEE Transactions, LRE, Czech ling. Journals (PBML, SaS – now in WoS) • Software and data • Mostly „open source“; training, shared task (evaluation)

  15. Most valued publications • Papers • Semi-supervised POS tagging (EACL 2009) • Best results in POS tagging so far, incl. English • Now taggers available in 5 languages • Extension of HVS Semantic Parser by Allowing Left-RightBranching (ICASSP 2008) • New result, drawing from S. Young’s work • Large-scale Semantic Networks: Annotation and Evaluation • NAACL 2009; in cooperation with Google Research (Zurich, K. Hall) • CoNLL 2009 Shared Task, CoNLL 2009 • Overall task and system description • Book • Valenční slovník českých sloves (Valency Lexicon of Czech Verbs, Karolinum Press) • Electronic version available

  16. Most valued data • Corpora (language databases, publicly available) • Prague Dependency Treebank 2.0, Linguistic Data Consortium 2006 • Prague Czech-English Dependency Treebank, to appear in 2011 • Penn Treebank & translation to Czech, with semantic annotation ~PDT/style • Czech Wordnet 1.0 (ELRA, 2008) • Sign Language, Audiovisual (ELRA, 2008) • Test / shared task collections • CLEF 2006, 2007 • Multilingual cross-langauge search competitions • Machine Translation Open Competition – EuroMatrix/Plus 2006-10 • Czech-English, German, French, Italian, Hungarian, Spanish • CoNLL Shared Task 2007, 2009 • Dep. parsing, semantic role labeling (unified for 7 languages)

  17. Most valued SW tools • Software • Corpus manager (client/server) Bonito/Manatee • Worldwide use: ČNK, SNK; Hu, Hr, GB • Word Sketch Engine • Commercial use (Lexical Computing) • ComPOST • State-of-the-art POS tagger (Cz, En, Dutch, Swedish, Icelandic) • Syntacticdependency parser „MST“ (Czech) • With Univ. of Pennsylvania • Improved Czec ASR and Emotional TTS • Used in the Companions project • NLG and Dialogue Manager w/knowledge base • Also for the Companions project • The TectoMT SW and data handling platform • MT, dialogue systems (now any NLU/NLG processing -> “Treex”)

  18. The Center provided… • Material benefits • 3/4 of budget: personnel (mainly graduate students) • Generous travel money • Small equipment • Durable equipment – clusters (30-200 CPUs) • Only in 2005/6 – need for renewal • Small indirect costs (< 12%, contribution of inst.) • “intangible” benefits • (Sub)teams, even across institutions, flexible assignment of people to projects, • dissertations, one assoc. professor promotion

  19. The Center had to work under certain “restrictions” • Employment of graduate students, postdocs, supervision of graduate students • Now at all four sites (2009: 10/4/9/1) • Requirement: at least on site…→Check • Requirement: Participation of students (Bc./Mgr./Ph.D.) • Total: 41 students→Check • 7nationalities • Students - after graduation - went to (e.g.)… • Petr Němec (UK): TextKernel, Hol.; Kiril Ribarov (UK): ČEZ • Jan Romportl, Aleš Pražák: SpeechTech (spinoff, ZČU) • Vladimír Kadlec (MU Brno): Acision (GB) • Petr Pajas (UK): Google (Zurich) • Václav Novák (UK): Ministry of Interior, then a small startup • Former CKL (LN, 00-04): M. Čmejrek, J. Cuřín (UK): IBM Research (Yorktown, Prague)

  20. “Restrictions”(cont.’d) • Requirement: integration to EU “research space” • 9projects EU, 6thand 7th FP • All types: IP, STREP, NoE; SSA, Dig. Libraries • Companions (IP) - ZČU, UK; • Khresmoi (IP) - UK • EuroMatrix, EuroMatrixPlus, Faust (STREP) - UK • Flarenet, META-NET (NoE) - UK • Clarin (SSA) - UK, MU, ÚJČ; • KYOTO (Dig. Libraries) - MU • USA • Malach (till 2007; UK, ZČU): USC, JHU, IBM, UMD • PIRE: rozpoznávání řeči a strojový překlad (UK, indirectly ZČU): JHU, Brown Univ. • Discourse: Univ. of Pennsylvania • Treebanking: Univ. of Colorado →Check

  21. EU Project „Companions“ • Goal • Intelligent conversational companion • Over photographs (Cz), „how was your day“ (En) • Technologies • ASR, emotional TTS • Natural language understanding, NL generation • Naturalness of dialogue: „user studies“ / „evaluation“ • CKL • UK/ZČU: ASR, TTS, NLU, NLG, Dialogue management

  22. The Companions project

  23. Companions: System Diagram

  24. Other project demos

  25. Semantic annotation (UK) Některé kontury problému se však po oživení Havlovým projevem zdají být jasnější.

  26. PDT 2.0:Annotation layers „Byl by šel do lesa“ (“he’d go to the forest”) Linked layers of annotation Stand-off annotation Scheme (Relax NG) z-layer

  27. ? Generation Speech reconstruction (UK, ZČU) ●Goal: „Translation“ • ● Annotation Ten obraz jsem jim nemohl dát. Ten obraz jsem jim nemohl dát. I could not give them the painting. SEM NEMOH SEM TO JIM DÁT TEN VOBRAZ ‘m couldn’t ‘m that them give the paintin’

  28. Speech Reconstruction Annotation • Edited transcript • All changes allowed • Manual annotation • Large data • Malach data • Companions proj. dialogues (> 100h)

More Related