Chinese Core Ontology Construction from a Bilingual Term Bank

Chinese Core Ontology Constructionfrom a Bilingual Term Bank Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University

Outline • Introduction • Related Works • Algorithm Design– COCA • Performance Evaluation • Conclusion

Introduction • What is a Core Ontology • A mid-level ontology • Bridges the gap between an upper ontology and a domain ontology

Concepts and Terminologies • Upper Ontology • A general ontology to ensure reusability across different domains (e.g.: Computer Program in SUMO) • Domain Ontology • An ontology conceptualize a specific domain (e.g.: Free Software in IT domain) • More application dependent, more extents of concepts • Midlevel Ontology(Core Concept) • Basic concepts of a domain • More application independent, more intents of concepts. • core ontology (e.g.: Software) • Frequently used, ability to form other concepts • Core Terms • Lexical units of core concepts

Related Works • Manually constructed ontologies • SUMO • Famous upper level ontology works based on lexicon • CoreLex (Buitelaar, P., 1998) • EuroWordnet (Rodríguez, 1998 ) • Ontology harmonization: Core ontology • “Towards a Core Ontology for Information Integration” (M. Doerr, 2003) • A most similar work • “Enriching Core Ontology with Domain Thesaurus through Concept and Relation Classification ” (Huang, 2007) • Use Concept and Relation Classification to Enrich core ontology

Our Previous Works • Chinese terminology extraction • Chinese core term extraction(Ji et al, 2007) • Preliminary work on automatic construction of core ontology construction using English-Chinese Term Bank (MRCOCA, Ontolex 2007, Chen, 2007) • Bilingual lexicon • Extended strings • Frequency information in synset • Weight from extended strings are integrated into final weight by simple addition • Mapping to synset and SUMO can only achieve accuracy of about 50%

Issues • What kind of concept should be included? • How to identify core concepts • If through core terms, disambiguation • What and how to identify relations? • Making use of available resources • Chinese NLP resource scares • English NLP resources abundant

Requirements of Core Ontology • The concepts must be widely accepted and commonly referenced • Corresponding core terms must be highly used and productive • The concepts/terms can be mapped to upper ontology. So the core ontology can inherit the attributes provided by upper ontology

Core Ontology Construction Algorithm(COCA) for Chinese • Extract Chinese core terms from a bilingual term bank • Mapped core termTcto English terms • Mapping English terms to WordNet • Mapping synset to a upper ontology concept in SUMO

COCA - Resources Used • ITCTerm • a domain specific core term list (Chen, 2007) • CETBank • Chinese-English bilingual term bank • 1,500 most productive core terms extracted can serve as suffixes to form more than 50% of the terms in CETBank) • WordNet • SUMO • Mappings between WordNet and SUMO

The Framework of COCA

COCA – Statistical Translation Module Translation ambiguity: Each Chinese core term TC∈ ITCTerm has a set of translations T_SetE , TE∈T_SetE • Objective • to estimate the likelihood of every translation using extended terms of TC • P(TE | TC) for all TE∈ T_SetE.

COCA - Sense Disambiguation Module • Mapping a given TC to the Synset S through its translation setT_SetE (TC) • Mapping probability of a English term TEto take a synset S using freq. info in WordNet • Mapping probability of TC to take a particular synset S via an English translation TE

COCA - Concept Selection Module • Combining three features • multi-path feature • hypernyms feature • part-of-speech feature • Using Union Probability of Independent Events

Feature 1 –Multi-Paths to Synset Multiple paths is the path between Chinese core terms and synset via different English translations The feature mergesthe probability of multiple paths

Feature 2 – Hyponyms in domain Incorporate info on all the extended strings Extended String uses the core term as headword and is the hyponym of the core term Length Ratio Union Probability of Independent Events

Feature 3 – Part of Speech Probability of the POS tagpos(S) owned by a synsetS given a core termTc PoS Tag estimation: Heuristics on Adj, Verb, and noun based on position

Integrate Features • Using Union Probability of Independent Events

Evaluation • Algorithm Output • A pair of < Tc_i, Synseti > for each Chinese core term with the highest mapping weight • Evaluation Standard • For each Tc_i, whether their mappings to Synset are the best match with respect to this domain • Answer Preparation • Answer is manually made by two experts in IT domain respectively on the same set of data

Performance • The evaluation conducted on the top N frequent core terms • The algorithm COCA achieves 71% in accuracy (N is 28 in this paper) • Compared to the result of MRCOCA (Chen, 2007) which achieved only 50% • Two examples of core term to syntset mapping generated by the algorithm are given for “软件” and “网络”.

Conclusion • Evaluation of COCA repeated on an English-Chinese bilingual Term bank with more than 130K entries show that the algorithm is • “42%” improved in accuracy compared to MRCOCA (Our Previous Works) • The three features and the new algorithm based on probability made the improvement

Term bank can help to quickly construct domain core ontology by selecting the concept nodes and relations used in domain • Bilingual term bank can further introduce the second language realization of the core ontology effectively and automatically

Future Works • Evaluation on three features • how effective they are • how much they contribute to the final performance • Consideration of more features such as abbreviation, synset of head word of core term and etc. • Use of other resources

Q&A

Chinese Core Ontology Construction from a Bilingual Term Bank