1 / 18

Construction of Multilingual Terminology Bank of Computational Linguistics

Construction of Multilingual Terminology Bank of Computational Linguistics. Abstract.

ronda
Download Presentation

Construction of Multilingual Terminology Bank of Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Construction of Multilingual Terminology Bank of Computational Linguistics

  2. Abstract • A multilingual computational linguistics dictionary involving English, Chinese, Japanese, German was built by Institute of Computational Linguistics of Peking University in the 1990s. The dictionary contains more than 5,400 terms of computational linguistics and it made great contributions to the development of NLP domain. • In order to develop the prior achievements, more terms that occur in the past two decades are added into the expanded term bank (ETB) which includes about 13,000 English terms and the number of languages involved is also extended to seven. • Now,the seven language core term bank is mostly done. The construction of ETB including the scale, source of terms and the design of the database management system is described in details in the paper. ETB will have a promoting effect on the development of computational linguistics.

  3. 1.Introduction-1 • With the development of computer and internet, a large number of academic literatures are emerging everyday, and many field terminologies are generated accordingly. • Terminology is an important information resource and standard terminology is required in the academic communication. Terminology bank provides a convenient way to share the information resource and an important method to the management of terms.

  4. 1.Introduction-2 • Professor Yu Shiwen (ICL-PKU) developed a multilingual comparison lexicon of English, Japanese, Chinese and German which includes 5,415 terms of computational linguistics occurred before the early of 1990s. • Based on this achievement, the Peking University Press published an English-Chinese Lexicon Computational Linguistics which is the first terminology dictionary of this field and one of the most important references of terminology translation.

  5. 1.Introduction-3 • In order to inherit and develop the existing research achievements, we collect some new terms which emerged in the last two decades. The scale and the language kind of the terminology bank are also expanded. • A new computational linguistics terminology bank with wide coverage, high quality, and multi-language is constructed.

  6. 2Terminology Sources • As terminology is a kind of appellation of concepts, in most cases, terminology is noun. However, according to practical condition, some special verbs and adjectives could also be contained in term base, such as “parse(句法剖析)”, “anaphoric (回指的)”, and so on.

  7. 2Terminology Sources-2 • The expanded term bank (ETB) in this paper enlarges the original dictionary of “English-Chinese Lexicon of Computational Linguistics”, and 5,415 terms in ETB come from the original dictionary (Yu, Zhu and E.Kaske, 1996). • Three books written by Prof. Yu Shiwen and other researchers in ICL-PKU. 325 terms are from the book “The Introduction to Computational Linguistics” (Yu, Chang and Zhan, 2003), 304 terms are from “Preview of Computational Linguistics” (Yu and Huang, 2005), and 782 terms from “The Grammatical Knowledge-Base of Contemporary Chinese-A complete Specification” (Yu, Zhu and Wang, 2003). • A math and computational linguistics terminology dictionary of English-French-Russian written by Y.VENEV in 1990 (Y.Venev, 1990). (3,900 terms ) • The book “Natural Language Understanding” written by James Allen. 602 terms come from the appendix of this book (James, 2005).

  8. 2Terminology Sources-3 • the book “An Introduction to Information Retrieval” written by Christopher D. Manning is selected. 630 English terms of IR field are from it (Christopher, 2008). • Besides books, the key words from the papers of “Journal of Chinese Information Processing” are picked as the terms to the ETB. The publishing period of journal is shorter than book, so there are many new terms in journals. (1100 terms) • About 1,300 terms come from the internet and the usual research work.

  9. 2Terminology Sources-4 • All of the terms mentioned above are got by many different methods such as automatic acquisition, manual input, machine scan, etc. • All of the new terms are checked manually at last. Many repetitive and wrong terms are got from different sources, and they are deleted and corrected. • In ETB, English term is taken as the primary key. There are about 13,000 English computational linguistics terms totally now. By using some dictionaries, some of the translation work from English to other languages is done. • Besides, the terms from different sources are made intersection computation to form a core terms bank (CTB) which, to some extent, could represent those frequently-used and important concepts.

  10. 2.2LanguagesSelection • Seven languages are selected in ETB which are English, Chinese, Japanese, German, Russian, French, and Korean. These languages belong to different language families. • Professor Yu Shiwen invited several experts and scholars from different countries to join the terminology translation work. Now, a seven-language computational linguistic terminology bank has initially been built.

  11. 3.The Characteristics of Multilingual Terminology Bank • The main file of ETB is one multilingual term comparison bank which includes seven languages. • Each record uses English term as primary key. However, apart from English term field, there may be several corresponding terms in other language term fields.

  12. 4.Multilingual Terminology Bank System It includes 9 fields, i.e., ID number, seven different language fields 。 The table is arranged by English character order. English term is the primary key。and for every record there is only one English term but may be several other language terms. • 4.1Design of Base Table Table 1. The structure of main file Table 2. Examples from multi-lingual terminology bank

  13. 4.1Design of Base Table Six Monolingual Term Index Tables • In order to find the terms of different languages conveniently and quickly, six index tables are built. Each table corresponds to one language. Because English terms could be found in main file, there is no English index table in the system. The field “ID” corresponds to the ID number in Table1. The field “Term” means the monolingual term Table3. The structure of monolingual index table Another field of “PinYin” is added to Chinese term index table, and the table is ordered by PinYin of the terms. Table4. Examples from Chinese term index table

  14. 4.1Design of Base Table Seven Monolingual Basic Information Tables • It is incomplete for a term bank if it only contains the different language translation of the terms. The basic information tables are built for the researchers to find more information such as definition, pinyin, synonym, abbreviation, hypernym, hyponym and so on. • Now, this information table is still under construction.

  15. 4.1Design of Base Table • Based on term information table, index table and the main file, it is convenient to automatically construct some dictionaries such as mono-linguistic information dictionary, bilingual information dictionary.

  16. 4.2 The Function of ETB Management System (1) Data maintenance (2) Data Inquiry. (3) Duplicate checking. (4) Re-order. (5) Automatic indexing (6) Dictionary Generation

  17. 5.Conclusion and Future Work • Based on the 5,415 terms from Computational Linguistic Lexicon which was built by ICL-PKU in the early 1990s, a new expanded computational linguistic terminology bank ETB is constructed. • The number of languages is also expanded from the four to seven. CTB is built by intersection operation of different term banks and one database management system is also built. • Currently, the expanding job and CTB construction have been finished, and the multi-linguistic translation of CTB is completed. The whole translation work of ETB is now in progress. The current ETB contains 13,016 English terms, 11,290 Chinese terms, 8,415 Japanese terms, 6,250 German terms, 5,747 Russian terms, 4,583 French terms, and 779 Korean terms.

  18. 5.Conclusion and Future Work • To complete the whole translation is the key task in the future. Also, how to translate, define, arrange the terms by combining the manual and automatic methods, how to give the term information such as synonym and hypernym, and how to construct the term classification system are the further problems that should be dealt with.

More Related