410 likes | 491 Views
This thesis explores automating ontology mapping for text classification to reduce human effort, with discussions on semantic web, approaches, and experimental results. Using Naïve Bayes text classifier for ontology mapping.
E N D
Learning the Semantic Meaning of a Concept from the Web Yang Yu Master’s Thesis Defense August 03, 2006
LIVING_THINGS ANIMAL PLANT HUMAN CAT TREE GRASS MAN WOMAN ARBOR FRUTEX The Problem • Manually preparing training data for text classification based ontology mapping is expensive.
http://www.google.com/ The Thesis • Automatically collecting training data for the concept defined in an ontology. • Benefits • Reduce the amount of human work • Fully automated ontology mapping
Overview • Background • The semantic Web and ontology • Ontology Mapping • Proposal • System • Experimental Results • WEAPONS ontology • LIVING_THINGS ontology • Discussions and Conclusion
Find all types of jets that are made in the USA Made-in WA partOf USA Semantic Web and Ontology • What is it? • “an extension of the current web” • An Example
Ontology Mapping • Interoperability problem • Independently developed ontologies for the same or overlapped domain • Mapping • r = f (Ci, Cj) where i=1, …, n and j=1, …, m; • r {equivalent, subClassOf, superClassOf, complement, overlapped, other}
Approaches to Ontology Mapping • Manual mapping • String Matching • Text classification • the semantic meaning of a concept is reflected in the training data that use the concept • Probabilistic feature model • Classification • Results highly depend on training data
Motivation • Preparing exemplars manually is costly • Billions of documents available on the web • Search engines
The Proposal • Using the concept defined in an ontology as a query and processing the search results to obtain exemplars • Verification • Build a prototype system • Check ontology mapping results
Ontology A Parser Queries Retriever Retriever WWW Links to Web Pages Processor HTML Docs Text Files System overview – Part I Search Engine
Concepts Queries FOOD FRUIT APPLE ORANGE living+things living+things animal living+things+animal plant living+things+plant cat living+things+animal+cat human living+things+animal+human man living+things+animal+human+man woman living+things+animal+human+woman tree living+things+plant+tree grass living+things+plant+grass frutex living+things+plant+tree+Frutex arbor living+things+plant+tree+arbor The parser (Query expansion) FOOD+FRUIT+APPLE
Naïve Bayes text classifier • Bow toolkit • McCallum, Andrew Kachites, Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, http://www.cs.cmu.edu/~mccallum/bow 1996. • rainbow -d model --index dir/* • rainbow –d model –query • Bayes Rule • Naïve Bayes text classifier
Prior P (B | A) * P (A) P (B) posterior Normalizing constant P(A, B) Mitchell Tom, Machine Learning, McGraw Hill, 1997 B A P (B | A) = P (A, B) / P (A) P (A | B) = P (A, B) / P (B) Bayes Rule • P (A | B) =
Naïve Bayes classifier • A text classification problem • “What’s the most probable classification of the new instance given the training data?” • vj: category j. • (a1, a2, …, an): attributes of a new document • So Naïve (Mitchell Tom, Machine Learning, McGraw Hill) 1997
Ontology A Ontology B Feature Model Mapping Results Text Files (B) Text Files (A) Rainbow Rainbow Model Builder Calculator System overview– Part II
LIVING_THINGS ANIMAL PLANT LIVING_THINGS HUMAN CAT TREE GRASS ANIMAL PLANT HUMAN CAT TREE GRASS MAN WOMAN ARBOR FRUTEX MAN WOMAN ARBOR FRUTEX The model builder • Mutually exclusive and exhaustive • Leaf classes • C+and C-
The calculator • Naïve Bayes text classifier tends to give extreme values (1/0) • Tasks • Feed exemplars to the classifier one by one • Keep records of classification results • Take averages and generate report
Categories in WeaponsA.n3 Num. of exemplars TANK-VEHICLE 170 AIR-DEFENSE-GUN 20 SAUDI-NAVAL-MISSILE-CRAFT 10 An Example of the Calculator TANK-VEHICLE APC AIR-DEFENSE-GUN Classifier 200 SAUDI-NAVAL- MISSILE-CRAFT P(TANK-VEHICLE | APC) = 170 /200= 0.85 P(AIR-DEFENSE-GUN | APC) = 0.10 P(SAUDI-NAVAL-MISSILE-CRAFT| APC) = 0.05
Experiments with WEAPONS ontology • Information Interpretation and Integration Conference (http://www.atl.lmco.com/projects/ontology/i3con.html) • WeaponsA.n3 and WeaponsB.n3 • Both over 80 classes defined • More than 60 classes are leaf classes • Similar structure
WeaponsA.n3 Part of WeaponsA.n3 WEAPON CONVENTIONAL- WEAPON ARMORED- COMBAT-VEHICLE MODERN- NAVAL-SHIP WARPLANE SUPER-ETENDARD AIRCRAFT-CARRIER PATROL-CRAFT TANK-VEHICLE -
WEAPON CONVENTIONAL- WEAPON ARMORED- COMBAT-VEHICLE MODERN- NAVAL-SHIP WARPLANE FIGHTER-PLANE AIRCRAFT-CARRIER PATROL- WARTER-CRAFT TANK-VEHICLE - FIGHTER-ATTACK-PLANE LIGHT-TANK APC LIGHT-AIRCRAFT-CARRIER PATROL- BOAT- RIVER PATROL- BOAT SUPER-ETENDARD-FIGHTER WeaponsB.n3 Part of WeaponsB.n3
Part of WeaponsB.n3 Expected Results AIRCRAFT-CARRIER PATROL-CRAFT SUPER- ETENDARD TANK-VEHICLE FIGHTER-PLANE LIGHT-AIRCRAFT-CARRIER PATROL- WARTER-CRAFT APC FIGHTER-ATTACK-PLANE LIGHT-TANK SUPER-ETENDARD-FIGHTER PATROL- BOAT- RIVER PATROL- BOAT
A Typical Report P(APC | Ci) where i = 1 … 63 ...... ……
New Classes Whole file Prob Sentences with Keywords Prob LIGHT-AIRCRAFT-CARRIER AIRCRAFT-CARRIER 0.65 AIRCRAFT-CARRIER 0.57 P(TANK-VEHICLE | APC ) = 0.28 APC SILKWORM-MISSILE-MOD 0.46 SELF-PROPELLED-ARTILLERY 0.36 SUPER-ETENDARD-FIGHTER SILKWORM-MISSILE-MOD 0.66 MRBM 0.51 FIGHTER-ATTACK-PLANE SILKWORM-MISSILE-MOD 0.83 MRBM 0.38 P(SUPER-ETENDARD | SUPER-ETENDARD-FIGHTER ) = 0.21 PATROL-WATERCRAFT SILKWORM-MISSILE-MOD 0.28 PATROL-CRAFT 0.52 PATROL-BOAT-RIVER SILKWORM-MISSILE-MOD 0.65 PATROL-CRAFT 0.54 PATROL-BOAT SILKWORM-MISSILE-MOD 0.51 PATROL-CRAFT 0.66 LIGHT-TANK SILKWORM-MISSILE-MOD 0.56 TANK-VEHICLE 0.3 FIGHTER-PLANE AIRCRAFT-CARRIER 0.49 MRBM 0.38 classes with highest conditional probability
New Classes Group-whole-50 Prob Group-whole-100 Prob LIGHT-AIRCRAFT-CARRIER SILKWORM-MISSILE-MOD 0.60 AIRCRAFT-CARRIER 0.65 APC SILKWORM-MISSILE-MOD 0.65 SILKWORM-MISSILE-MOD 0.46 SUPER-ETENDARD-FIGHTER SILKWORM-MISSILE-MOD 0.74 SILKWORM-MISSILE-MOD 0.66 FIGHTER-ATTACK-PLANE SILKWORM-MISSILE-MOD 0.83 SILKWORM-MISSILE-MOD 0.83 PATROL-WATERCRAFT SILKWORM-MISSILE-MOD 0.64 SILKWORM-MISSILE-MOD 0.28 PATROL-BOAT-RIVER SILKWORM-MISSILE-MOD 0.89 SILKWORM-MISSILE-MOD 0.65 PATROL-BOAT SILKWORM-MISSILE-MOD 0.64 SILKWORM-MISSILE-MOD 0.51 LIGHT-TANK SILKWORM-MISSILE-MOD 0.62 SILKWORM-MISSILE-MOD 0.56 FIGHTER-PLANE SILKWORM-MISSILE-MOD 0.80 AIRCRAFT-CARRIER 0.49 different numbers of exemplars (whole)
New Classes Group-sentence-50 Prob Group-sentence-100 Prob LIGHT-AIRCRAFT-CARRIER AIRCRAFT-CARRIER 0.44 AIRCRAFT-CARRIER 0.57 APC TANK-VEHICLE 0.54 SELF-PROPELLED-ARTILLERY 0.36 SUPER-ETENDARD-FIGHTER HY-4-C-201-MISSILE 0.4 MRBM 0.51 FIGHTER-ATTACK-PLANE ICBM 0.19 MRBM 0.38 PATROL-WATERCRAFT PATROL-CRAFT 0.49 PATROL-CRAFT 0.52 PATROL-BOAT-RIVER PATROL-CRAFT 0.36 PATROL-CRAFT 0.54 PATROL-BOAT PATROL-CRAFT 0.37 PATROL-CRAFT 0.66 LIGHT-TANK TANK-VEHICLE 0.59 TANK-VEHICLE 0.3 FIGHTER-PLANE MRBM 0.38 MRBM 0.38 different numbers of exemplars (sentence)
Groups of experiments Mapping accuracy judged by desired class mapped Group-whole-50 0% Group-whole-100 11% Group-sentence-50 67% Group-sentence-100 56% Comparison of mapping accuracy of different groups of experiments Higher Conditional Probability
HUMAN MAN WOMAN Experiment with LIVING_THINGS ontology • P(MAN | HUMAN) • P (WOMAN | HUMAN) • Find a mapping for GIRL
Conditional Probability Using first 50 exemplars Using first 100 exemplars Using first 200 exemplars P(MAN | HUMAN) 0.75 0.58 0.62 P(WOMAN | HUMAN) 0.24 0.41 0.38 WOMAN HUMAN MAN Actual Experiment Results: L-1 Results of experiment (1)
P(ANIMAL | GIRL) 0.83 P(PLANT | GIRL) 0.17 P(HUMAN | GIRL) 0.92 P(ANIMAL | GIRL) 0.76 P(CAT | GIRL) 0.08 P(PLANT | GIRL) 0.23 P(WOMAN | GIRL) 0.63 P(HUMAN | GIRL) 0.70 P(MAN | GIRL) 0.37 P(CAT | GIRL) 0.30 P(MAN | GIRL) 0 P(DOG | GIRL) 0.56 P(WOMAN | GIRL) 1 P(CAT | GIRL) 0.01 P(HUMAN | GIRL) 0.43 P(PYCNOGONID | GIRL) 0 Actual Experiment Results: L-2 With clustering on exemplars Without clustering on exemplars with additional classes
Conditional Probability Using first 50 exemplars Using first 100 exemplars Using first 200 exemplars P(ANIMAL | GIRL) 0.66 0.53 0.77 P(PLANT | GIRL) 0.34 0.47 0.23 P(HUMAN | GIRL) 0.86 0.56 0.43 P(CAT | GIRL) 0.01 0.15 0.01 P(DOG | GIRL) 0.13 0.29 0.56 P(PYCNOGONID | GIRL) 0 0 0 P(MAN | GIRL) 0.02 0.03 0 P(WOMAN | GIRL) 0.98 0.97 1 Actual Experiment Results: L-3 Comparison between different numbers of exemplars (sentence)
Concepts Queries living+things Living+things animal Living+things+animal+Animalia plant Living+things+plant+Plantae cat Living+things+animal+Animalia+cat+Felidae human Living+things+animal+Animalia+human+intelligent man Living+things+animal+Animalia+human+intelligent+man+male woman Living+things+animal+Animalia+human+intelligent+woman+female tree Living+things+plant+Plantae+tree grass Living+things+plant+Plantae+grass frutex Living+things+plant+Plantae+tree+Frutex arbor Living+things+plant+Plantae+tree+arbor Actual Experiment Results: Different Queries Queries augmented with class properties
Conditional Probability Whole Keyword Sentences P(MAN | HUMAN) 0.91 0.93 P(WOMAN | HUMAN) 0.09 0.07 Conditional Probability Whole Keyword Sentences WOMAN HUMAN MAN P(ANIMAL | GIRL) 0.9 0.83 P(PLANT | GIRL) 0.1 0.17 P(HUMAN | GIRL) 0.78 0.83 P(CAT | GIRL) 0.22 0.17 P(MAN | GIRL) 0.14 0.16 P(WOMAN | GIRL) 0.86 0.84 Actual Experiment Results: L-4 Results of experiment (1) with new queries Results of experiment (2) with new queries
HUMAN MAN WOMAN Limitation 1: An exemplar is not a sample of a concept • An exemplar is a combination of strings that represent some usage of a concept. • An exemplar is not an instance of a concept. • The way we calculate conditional probability is an estimation.
Limitation 2: Popularity does not equal relevancy • Limited by a search engine’s algorithm • PageRank™ • Popularity does not equal relevancy • Weight cannot be specified for words in a search query
Limitation 3: Relevancy does not equal to similarity Search Results for concept A Text related to concept A Text against concept A Text for concept A i.e. desired exemplars Text for related concept B
Related Research • UMBC OntoMapper • Sushama Prasad, Peng Yun and Finin Tim, A Tool for Mapping between Two Ontologies Using Explicit Information, AAMAS 2002 Workshop on Ontologies and Agent Systems, 2002. • CAIMEN • Lacher S. Martin and Groh Georg ,Facilitating the Exchange of Explicit Knowledge through Ontology Mappings, Proc of the Fourteenth International FLAIRS conference, 2001. • GLUE • Doan Anhai, Madhavan Jayant, Dhamankar Robin, Domingos Pedro, and Halevy Alon, Learning to Match Ontologies on the Semantic Web, WWW2002, May, 2002. • Google Conditional Probability • P(HUMAN | MAN) = 1.77 billion / 2.29 billion = 0.77 • P(HUMAN | WOMAN) = 0.6 billion / 2.29 billion = 0.26 • Wyatt D., Philipose M., and Choudhury T., Unsupervised Activity Recognition Using Automatically Mined Common Sense. Proceedings of AAAI-05. pp. 21-27.
Conclusion and Future Work • Text retrieved from the web can be used as exemplars for text classification based ontology mapping • Many parameters affect the quality of the exemplars • There are noise contained in the processed documents • Future work • Clustering