Enhancing Text Classification using Semantic Features from DBpedia
This paper presents a novel approach to text classification by leveraging semantic features derived from DBpedia. We compare traditional "Bag of Words" methods with a "Bag of Conceptions" framework to improve document categorization accuracy. Our experimental results demonstrate that utilizing semantic-level processing and normalization significantly enhances classification performance, especially for documents with ambiguous category references. We implemented a candidate expression detection algorithm and tested our method on datasets related to closely related categories in biology, achieving notable improvements in accuracy.
Enhancing Text Classification using Semantic Features from DBpedia
E N D
Presentation Transcript
AST2009 A Semantic Text Classification Based on DBpedia Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo 255049, China { brj, ljhbrj}@sdut.edu.cn
OUTLINE • 1.BACKGROUND • 2. DBpedia • 3.OUR PROPOSED METHODS • 4.EXPERIMENT • 5.CONCLUSION
1.BACKGROUND • “Bag of Words” (BOW) .VS. “Bag of Conceptions” (BOC) • Semantic Features Representation
2. DBpedia DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web.
3.OUR PROPOSED METHODS • Definition 1 (Core Ontology). A core ontology is a structure O := (C,<c) consisting of a set C, whose elements are called concept identifiers, and a partial order <c on C, called concept hierarchy or taxonomy. • Definition 2 (Subconcepts and Superconcepts).If c1 <c c2 for any c1, c2 ∈ C, then c1 is a subconcept (specialization) of c2 and c2 is a superconcept (generalization) of c1. If c1 <c c2 and there exists no c3 ∈ C with c1 <c c3 <c c2, then c1 is a direct subconcept of c2, and c2 is a direct superconcept of c1, denoted by c1﹤ c2.
3.OUR PROPOSED METHODS The candidate expression detection algorithm Input: document d = {w1,w2, …,wn}, Lex = (SC;RefC) and window size k ≥ 1. i 1 list Ls index-term s while i≤n do for j = min(k, n - i + 1) to 1 do s {wi…wi+j-1} if s ∈ SC then save s in Ls i i + j break else if j = 1 then i i + j end if end for end while return Ls
4.EXPERIMENT • Datasets • Our goal is to obtain a high performance for closely related categories. Therefore, in order to test our approach, we designed a robot to crawler a data set from Yahoo! Website. It is contained the closely related (ambiguous) categories under Science->Biology . The test categories under Science->Biology considered here for Training and Testing are: Bio-Archaeology, Bio-Informatics, Genetics, Food Science and Microbiology.
4.EXPERIMENT Table 1. Confusion Matrix before Applying Semantic Processing
4.EXPERIMENT Table 2. Confusion Matrix after Applying Semantic Processing
4.EXPERIMENT Fig.3 Accuracy from Semantic Representation Terms vs. Bag of Words
5.CONCLUSION • In this paper, we have discussed a novel approach to applying DBpedia’s background knowledge represent documents for boosting text categorization performance. • Our approach and experiments prove that applying semantic level processing and normalization help in achieving higher accuracies over classification of documents, which have words with cross category references.
END THANKS!