1 / 22

Corpus Exploitation f rom Wikipedia f or Ontology Construction

Corpus Exploitation f rom Wikipedia f or Ontology Construction. Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic University. Outline. Introduction Related Work s Algorithm design Classification Tree Traversal

calder
Download Presentation

Corpus Exploitation f rom Wikipedia f or Ontology Construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpus Exploitation from Wikipedia forOntologyConstruction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic University

  2. Outline • Introduction • Related Works • Algorithm design • Classification Tree Traversal • Ranking nodes in the classification tree • Experiments and Evaluations • Conclusion and Future Works

  3. Background • Ontology Construction • Manual construction • Corpus is not necessary • Small scale • Automatic or semiautomatic construction • Domain specific corpus • Good domain knowledge coverage

  4. Related Works • Corpus Selection • Corpus by linguists • British National Corpus (BNC) [Collin F. Baker, etc., 1998] • Corpus from Publications • Reuters News Corpus [Latifur Khan, Feng Luo, 2002] • Corpus from Internet • Searching Results from Web as Corpus [P Cimiano, etc., 2004]

  5. Use of Wikipedia as a Resource • Statistical and analysis work • [A Lih., 2004], [Jakob Voss, 2005] • Link structure and cultural bias analysis of Wiki • [M Völkel, M Krötzsch, D Vrandecic, H Haller and R Studer., 2006 ], [F Bellomi and R Bonato, 2005] • Add semantic links • Add semantic links between concepts in Wiki pages • [M Völkel, 2006], [Michael Strube, Simone Paolo Ponzetto, 2006] • Corpus for XML retrieval • [L Denoyer, P Gallinari, 2006]

  6. Problems • Manually Selected Corpus • Domain experts needed • Time and labor intensive • Corpus Collection from Publications • Limitation in time and region • Internet Exploitation • Difficulty in domain specific data identification

  7. Wikipedia Overview • Established in 2001 • 500,000 articles in 2005 • 1 million articles in Nov. 2006 • More than 2 millions of articles till now • Different types of data • Abundance of domain specific data • Availability of category information • Too many reachable nodes

  8. Algorithm Design • Basic Idea • Make use of the classification tree to only certain qualified reachable nodes • Classification Tree Traversal • Given a Root node: Pr(category node) • Breadth-First-Search Algorithm • Initialization • Wr= 1 for root node Pr • Wi= 0 if Pi is not on the current traversal path

  9. Tree traversal and weights • Wiki Graph • Classification Tree • In-edge • Out-edge • Nin(P) • Nout(P)

  10. Ranking Schemes (1) • S1 • Considering the sum of scores of Pc’s out-edges pointing to the classification tree against the total number of Pc‘s out-edges • The 1 in denominator is to avoid it being 0

  11. Ranking Schemes (2) • S2 • Considering the summation of Pc’s in-edges in the classification tree against the total number of the in-edges of Pi s, which are Pc’s upper level nodes

  12. Ranking Schemes (3) • S3 • Considering the summation of the out-edge nodes in the classification tree divided by both Pc’s out-edge scores and its upper level nodes Pi’s in-edge scores

  13. Data • Wikipedia Resource • English version in XML • 1,100,000 articles • Cut off date: Nov. 30, 2006 • Domain Connected Branches • 549,486 nodes for IT • 549,433 nodes for biology

  14. Evaluation on Scheme Selection • Evaluation by sampling • For Top 20,000 nodes • 10 nodes in every 1,000 nodes • For Remaining nodes • 10 nodes in every 10,000 nodes • Corpus size • Top 20,000 • 98M for IT • 101M for Biology

  15. Sampling Results of Different Schemes Table 1 Evaluation Result of Different Schemes in the IT Domain Table 2 Evaluation Result of Different Schemes in the Biology Domain

  16. Overall Precision on sampled data

  17. Root Node Identification • Different root nodes leads to different classification structure • E.g. “Category: Electronics” • For electronics • For IT • Compare to Library of American Congress Classification (LACC) • Widely used library classification in most research and academic areas

  18. Comparisons with LACC Comparisons of Classification Trees with Root Nodes from Respective Domains

  19. Comparisons with LACC (2) Comparisons of Classification Tree Structures with LACC with Root Node: Electronics

  20. Conclusion • Acquire leave nodes through qualified classification tree branches in Wiki • Best performance should take into consideration of both in-edges and out-edges • Selection of proper nodes does affect the results • Pick the most common term as the root node

  21. Future Works • Improve Ranking Functions • Using page contents • Using hyperlinks in contexts of pages • Set different parameters of weights to different domains

  22. Thanks! Q & A

More Related