Ontology Learning

Shalini Gupta - 07305R02 Apoorv Sharma - 07305913 Chirag Patel - 07305909 Shitanshu Verma - 07305037 Ontology Learning

Issue There is lot of information current representation renders it uninterpretable for machines consequences most of the information remains undiscovered Big and popular search engines are able to search only 3-4% of the total information on the web.

What is needed ? Improved machines intelligence. Make them read understand use modify information. With minimal human intervention.

To Achieve It ? Enable machines Populate Enrich Evaluate Maintain Their knowledge representation

What is ontology A representation format that conceptualizes domain Captures classes, instances , attributes, relationships Provides sound semantic ground of machine-understandable description of digital content Is used in various fields SE, AI Is represented using languages as OWL etc

What is ontology learning Process of preparing updating ontologies from sources such as Documents in natural language with the help of dictionaries thesauruses etc

Environment

The flow Initial ontology is given Information sources are given Machines work over the data sources to enrich the ontology Once enriched consistency check is done evaluation

Terms related with the process Ontology enrichment Improving an existing ontology Ontology population Creating new ontology or adding new concepts to it Inconsistency resolution resolving inconsistencies that come up while acquiring ontologies

Enrichment of Ontology Term Identification Taxonomy Extraction Non taxonomical relationship extraction

Enrichment of Ontology Term Identification identify important terms in the text Taxonomy Extraction identifying taxonomical relationships between terms identified Non taxonomical relationship extraction identifying other relationships

Review Ontology learning ontology enrichment term identification taxonomy extraction non taxonomic relationship extraction

Term Identification: Basics Everything is a concept. An object, an idea, or a thing. A term lexicalizes a concept. A Word or Multi-word string that conveys 'a single meaning' within a given community e.g. company, Paris, man, cellphone, Red Hat, car parking Goal: Find out representative concepts.

Term Identification: Steps Steps: Term Recognition: Find the terms. Term Classification: Cluster the terms which are same. Term Mapping: Link the terms to well-defined concepts of referent data sources. Various techniques exist for every step.

Term Identification: Tokenizing Different combinations of Linguistics techniques have been able to surpass this step Tokenizing Scan the text in order to identify boundaries of words and complex expressions

Term Identification: Tokenizing Remove the stop words like 'a', 'the', 'of', 'with' E.g. Check of the Electrical Bonding of External Composite Panels with a CORAS Resistivity-Continuity Test Terms: Check, Electrical Bonding, External Composite Panels, CORAS Resistivity-Continuity Test Set. Generally nouns are considered as candidate concepts

Term Identification: Importance of a term TF-IDF technique can be used to find the important keywords [6] a balanced measure stating that a word is more important if it appears several times in a target document and at the same time it appears rarely in other documents. Seed-concepts can be used from existing ontologies.

Term Identification:Importance of a term Multi-word terms The C/NC-value method: [5] (1) the frequency of occurrence, (2) the frequency of occurrence as a sub-string of other candidate terms, (3) the number of candidate terms containing the given term as a sub-string, (4) the number of words contained in the candidate term The relevant terms can be determined by mutual cohesiveness by using Mutual Expectation

Term Identification: Morphological Analysis Use of morphological knowledge of a word [9] A technique which identifies a word-stem from a full word-form To identify small domain-speciﬁc units studies patterns of word-formation and attempts to formulate rules using the word structure. e.g. In the biomedical domain a word ending in “-oﬁlous” or “-itis” is very probably a bio-molecule or a medical term Advantage: Can identify “background terms” even with low frequency of appearance

Term Identification:Named Entity Recognition Recognition of person, location, organization names as single complex entities Complex date and time expressions percentage, monetary value E.g. 'Merrill Lynch' The next step associates single words or complex expressions with the concepts e.g 'Merrill Lynch' is related to the concept organization

Identifying Relationships • More information for later steps • Dependency Relations: • Between the word and its neighbours, the mind perceives connections, the totality of which forms the structure of the sentence • Structural connections establish dependency relations between the words

Deriving Relationships from Dependency Relations Syntactic dependency relations coincide closely with semantic relations [3] e.g. France Telecom in Paris offers the new DSL technology. Dependency relations would give linkage between France Telecom(organization) and Paris(city)‏ From this we can derive a semantic relationship between organization and city

Term Identification Identifying Relationships Taxonomic Relationships Non-Taxonomic Relationships

Taxonomy Construction Hierarchy of concepts Inclusion relations provide a tree view of the ontology and imply inheritance between super-concepts and sub-concepts. E.g. 'Living being' is a super-concept and 'mammal' is a sub-concept. In terms of ontology, root node is the most general one for the domain of interest.

Discovering taxonomic relations Based on lexico-syntactic patterns Can find inclusion relation between concepts through a simple pattern matching on a set of documents E.g. NP such as NP, NP,..., and NP ...works by authors such as Herrick, Goldsmith, and Shakespeare hyponym(“author”, Herrick)‏ hyponym(“author”, Goldsmith)‏ hyponym(“author”, Shakespeare)‏

Discovering new patterns Idea is to use a pattern learner to generate new patterns Generated patterns then can be used in order to generate new information (new inclusion relations), as well as to assess the validity of extracted information E.g. we can generate new patterns like NP is NP NP, NP,..., and other NP NP, especially NP, NP,..., and NP From the pattern NP such NP as NP, NP,..., and NP

Algorithm for finding new patterns Decide on a lexical relation, R, that is of interest,e.g., "group/member" E.g. a hyponym relation like (author,Shakespeare). Gather a list of terms/instances for which this relation holds. Find places in the corpus where these terms/instances occur syntactically near one another and record the environment. Find new patterns using this. Once a new pattern has been positively identified, use it to gather more instances of the target relation and go to Step 2.

Multi-word concepts A concept may be represented by multi-word terms A concept 'A' is a hyponym of a concept 'B' if A has more tokens than B all the tokens of B are present in A both terms have the same head E.g. Concepts 'private customer' and business customer' is a hyponym of the concept 'customer'

Mining non-taxonomic relations Relationships other than is-a relationships E.g. Linguistic processing may find that the word 'cost' occurs frequently with the words 'hotel', 'guest house', 'youth hostel' in sentences like 'Costs at the youth hostel are $20 per night' Relations (cost, hotel), (cost, guest house) and (cost, youth hostel) exist Discovery algorithm finds support and confidence measures for these pairs as well as relationships at higher levels of abstraction such as accommodation and costs

Finding non-taxonomic relations Based on basic Association Rule Algorithm [3] Basic Association Rule Algorithm Given a set of transactions, T Each transaction has a set of items, i1,i2, ... in Goal: Compute association rules of form i1→i2 Trick: Explores the fact that many items appear together. So occurrence of one implies occurrence of another with a high probability (confidence)‏

Association Rule Mining E.g. consider the transactions (bread, butter, jam, chips)‏ (bread, butter, jam, ketchup)‏ (ketchup,chips)‏ (bread, butter, jam, chips)‏ (bread,rice)‏ Eg. bread → butter, jam Support =n(XUY)/N E.g. Support = 3/5 Confidence = n(XUY)/n(X)‏ E.g. Confidence = 3/4

Algorithm 1. Extend each transaction to include the ancestor of a particular item E.g. include the word 'Accommodation' in the transactions containing word 'guest house' 2. Determine association rules of the form Xk→Yk where |Xk| = 1 and |Yk| = 1 3. Determine confidence for all rules that exceed user determined support 4. Prune the rules subsumed by ancestral rules E.g. if we found 2 rules, (cost, accommodation) and (cost, hotel), we prune the latter rule (cost, hotel)‏

Statistics-based Extraction of Taxonomic Relations [12][13] Uses hierarchical clustering. Groups up the similar terms in a bottom up fashion Uses cosine similarity function The cosine measure or normalized correlation coefficient between two vectors x and y is given by

Algorithm

Computation of similarity function The similarity matrix is given by Hotel vector=(0,14,7,4,6) Accommodation vector=(14,0,11,2,5) cos(Hotel,Accommodation) = 7*11+4*2+6*5/(105*150)

Case study:Web-based Ontology Learning with ISOLDE ISOLDE (Information System for Ontology Learning and Domain Exploration) produce domain ontology from a base ontology Uses the following An unsupervised named entity recognition system Web resources like DWDS, Wikipedia and Wiktionary.

Analysis steps used by ISODLE Named-entity recognition (NER) uses a domain-specific corpus, a base ontology and a general purpose NER system (SproUT, see Drozdzynski et al. 2004) to find instances for the classes in the base ontology. Linguistic pattern analysis for the extraction of class candidates from the context of the instances extracted in step 1 by use of lexico-syntactic patterns Collecting web-based knowledge collect information on and between extracted class candidates from online resources and integrating this into a new or extended taxonomy/ontology

Architecture

Stage wise Examples After step 1 we get Ballack,Munich, as 1 named entity from soccer corpus In the second step we find the class candidates for named entities for the sentence in the corpus and then filter the domains specific candidates using X2 method Ballack, the best midfielder in the German national team. Gives Midfielder as the calss candidate of Ballack. In the third step for the class candidates we search on web wikipedia definition on midfielder is A midfielder is a player whose position of play is midway between the attacking strikers and the defenders

Example contd.. • We learn the relation midfielder is a player(taxonomic relationship) • Relevence Factor X2 • X2= • O matrix for striker

Issues in Learning human understandable vs machine understandable learning higher degree relation mapping to high level ontology evaluation benchmark incremental ontology learning multi agent learning

Application of ontology is ubiquitous in information systems [2] improving the performance of information retrieval and reasoning making data between different applications interoperable ontology-type semantic description of behaviors and services allow software agents in a multi-agent system to better coordinate themselves

References [1] Elias Zavitsanos, Georgios Paliouras, George Vouros,Ontology Learning and Evaluation: A survey Technical Report, 2006. [2] Nicolas Weber, Paul Buitelaar, Web-based Ontology Learning with ISOLDE, DFKI GmbH - Language Technology Lab Saarbrücken, German,2006. [3] Alexander Maedche and Steffen Staab, Mining Ontologies from Text, 2000. [4] Alexander Maedche, Viktor Pekar, and Steffen Staab, Ontology Learning Part One-On Discovering Taxonomic Relations from the Web, 2003.

References [5] K. Frantzi, S. Ananiadou, and H. Mima. Automatic recognition of multi-word terms: The c-value/nc-value method. 3(2):115–130, 2000. [6] A. Saltion, G. Wong and C.S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975. [7] D.I. Moldovan and R.C. Girju. An interactive tool for the rapid development of knowledge bases. International Journal on Artificial Intelligence Tools (IJAIT), 10(1-2), 2001

References [8] J.D. Cohen. Highlights: Language and domain independent automatic indexing terms for abstracting. Journal of the American Society for Information Science, 46(3):162–174, 1995. [9] U. Heid. A linguistic bootstrapping approach to the extraction of term candidates from german text. Terminology, 5(2):161–181, 1998. [10] L.M. Iwanska, N. Mata, and K. Kruger. Fully Automatic Acquisition of Taxonomic Knowledge from Large Corpora of Texts, pages 335–345. MIT/AAAI Press, 2000.

References [11] J.U. Kietz, A. Maedche, and R. Volz. A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet. , Juan-Les-Pins, France, 2000. [12] A. Maedche, V. Pekar, and S. Staab.Ontology learning part one - on discovering taxonomic relations from the web.In Proceedings of the Web Intelligence conference. Springer Verlag, 2002. [13] Vincent Schickel-Zuber, Boi Faltings: Using hierarchical clustering for learning theontologies used in recommendation systems. KDD 2007: 599-608 [14] A . Maedche and S. Staab. Discovering Conceptual Relations from Text. In Proceedings of ECAI 2000, IOS Press, Amsterdam, 2000.

Thank You

Ontology Learning

Ontology Learning

Presentation Transcript

Ontology

Ontology Learning from Text

Ontology (Science) vs. Ontology (Engineering)

“Ontology”

Ontology learning and population from from text

Ontology Learning and Population from Text

Ontology

Towards Ontology Learning from Folksonomies

Parallel Corpora for Multilingual Ontology Learning

Ontology

OCM Ontology and Ontology Services

Learning Goal Ontology How Can We Form Effective Collaborative Learning Groups?

ontology

Knowledge Discovery in Ontology Learning

Ontology

Actively Learning Ontology Matching via User Interaction

An Ontology-based Learning Design Assistant

On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning

Ontology

Ontology…

Ontology