1 / 26

Chinese Core Ontology Construction from a Bilingual Term Bank

Chinese Core Ontology Construction from a Bilingual Term Bank. Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University. Outline. Introduction Related Works Algorithm Design– COCA Performance Evaluation Conclusion. Introduction.

Download Presentation

Chinese Core Ontology Construction from a Bilingual Term Bank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chinese Core Ontology Constructionfrom a Bilingual Term Bank Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University

  2. Outline • Introduction • Related Works • Algorithm Design– COCA • Performance Evaluation • Conclusion

  3. Introduction • What is a Core Ontology • A mid-level ontology • Bridges the gap between an upper ontology and a domain ontology

  4. Concepts and Terminologies • Upper Ontology • A general ontology to ensure reusability across different domains (e.g.: Computer Program in SUMO) • Domain Ontology • An ontology conceptualize a specific domain (e.g.: Free Software in IT domain) • More application dependent, more extents of concepts • Midlevel Ontology(Core Concept) • Basic concepts of a domain • More application independent, more intents of concepts. • core ontology (e.g.: Software) • Frequently used, ability to form other concepts • Core Terms • Lexical units of core concepts

  5. Related Works • Manually constructed ontologies • SUMO • Famous upper level ontology works based on lexicon • CoreLex (Buitelaar, P., 1998) • EuroWordnet (Rodríguez, 1998 ) • Ontology harmonization: Core ontology • “Towards a Core Ontology for Information Integration” (M. Doerr, 2003) • A most similar work • “Enriching Core Ontology with Domain Thesaurus through Concept and Relation Classification ” (Huang, 2007) • Use Concept and Relation Classification to Enrich core ontology

  6. Our Previous Works • Chinese terminology extraction • Chinese core term extraction(Ji et al, 2007) • Preliminary work on automatic construction of core ontology construction using English-Chinese Term Bank (MRCOCA, Ontolex 2007, Chen, 2007) • Bilingual lexicon • Extended strings • Frequency information in synset • Weight from extended strings are integrated into final weight by simple addition • Mapping to synset and SUMO can only achieve accuracy of about 50%

  7. Issues • What kind of concept should be included? • How to identify core concepts • If through core terms, disambiguation • What and how to identify relations? • Making use of available resources • Chinese NLP resource scares • English NLP resources abundant

  8. Requirements of Core Ontology • The concepts must be widely accepted and commonly referenced • Corresponding core terms must be highly used and productive • The concepts/terms can be mapped to upper ontology. So the core ontology can inherit the attributes provided by upper ontology

  9. Core Ontology Construction Algorithm(COCA) for Chinese • Extract Chinese core terms from a bilingual term bank • Mapped core termTcto English terms • Mapping English terms to WordNet • Mapping synset to a upper ontology concept in SUMO

  10. COCA - Resources Used • ITCTerm • a domain specific core term list (Chen, 2007) • CETBank • Chinese-English bilingual term bank • 1,500 most productive core terms extracted can serve as suffixes to form more than 50% of the terms in CETBank) • WordNet • SUMO • Mappings between WordNet and SUMO

  11. The Framework of COCA

  12. COCA – Statistical Translation Module Translation ambiguity: Each Chinese core term TC∈ ITCTerm has a set of translations T_SetE , TE∈T_SetE • Objective • to estimate the likelihood of every translation using extended terms of TC • P(TE | TC) for all TE∈ T_SetE.

  13. COCA - Sense Disambiguation Module • Mapping a given TC to the Synset S through its translation setT_SetE (TC) • Mapping probability of a English term TEto take a synset S using freq. info in WordNet • Mapping probability of TC to take a particular synset S via an English translation TE

  14. COCA - Concept Selection Module • Combining three features • multi-path feature • hypernyms feature • part-of-speech feature • Using Union Probability of Independent Events

  15. Feature 1 –Multi-Paths to Synset Multiple paths is the path between Chinese core terms and synset via different English translations The feature mergesthe probability of multiple paths

  16. Feature 2 – Hyponyms in domain Incorporate info on all the extended strings Extended String uses the core term as headword and is the hyponym of the core term Length Ratio Union Probability of Independent Events

  17. Feature 3 – Part of Speech Probability of the POS tagpos(S) owned by a synsetS given a core termTc PoS Tag estimation: Heuristics on Adj, Verb, and noun based on position

  18. Integrate Features • Using Union Probability of Independent Events

  19. Evaluation • Algorithm Output • A pair of < Tc_i, Synseti > for each Chinese core term with the highest mapping weight • Evaluation Standard • For each Tc_i, whether their mappings to Synset are the best match with respect to this domain • Answer Preparation • Answer is manually made by two experts in IT domain respectively on the same set of data

  20. Performance • The evaluation conducted on the top N frequent core terms • The algorithm COCA achieves 71% in accuracy (N is 28 in this paper) • Compared to the result of MRCOCA (Chen, 2007) which achieved only 50% • Two examples of core term to syntset mapping generated by the algorithm are given for “软件” and “网络”.

  21. Conclusion • Evaluation of COCA repeated on an English-Chinese bilingual Term bank with more than 130K entries show that the algorithm is • “42%” improved in accuracy compared to MRCOCA (Our Previous Works) • The three features and the new algorithm based on probability made the improvement

  22. Term bank can help to quickly construct domain core ontology by selecting the concept nodes and relations used in domain • Bilingual term bank can further introduce the second language realization of the core ontology effectively and automatically

  23. Future Works • Evaluation on three features • how effective they are • how much they contribute to the final performance • Consideration of more features such as abbreviation, synset of head word of core term and etc. • Use of other resources

  24. Q&A

  25. Q A

More Related