Shalini Gupta - 07305R02 Apoorv Sharma - 07305913 Chirag Patel - 07305909 Shitanshu Verma - 07305037. Ontology Learning. Issue. There is lot of information current representation renders it uninterpretable for machines consequences most of the information remains undiscovered
Shalini Gupta - 07305R02
Apoorv Sharma - 07305913
Chirag Patel - 07305909
Shitanshu Verma - 07305037
There is lot of information
current representation renders it uninterpretable for machines
most of the information remains undiscovered
Big and popular search engines are able to search only 3-4% of the total information on the web.
Improved machines intelligence.
Make them read understand use modify information.
With minimal human intervention.
Maintain Their knowledge representation
A representation format that conceptualizes domain
Captures classes, instances , attributes, relationships
Provides sound semantic ground of machine-understandable description of digital content
Is used in various fields SE, AI
Is represented using languages as OWL etc
ontologies from sources such as
Documents in natural language
with the help of
Initial ontology is given
Information sources are given
Machines work over the data sources to
enrich the ontology
consistency check is done
Improving an existing ontology
Creating new ontology or adding new concepts to it
resolving inconsistencies that come up while acquiring ontologies
Non taxonomical relationship extraction
identify important terms in the text
identifying taxonomical relationships between terms identified
Non taxonomical relationship extraction
identifying other relationships
non taxonomic relationship extraction
Everything is a concept.
An object, an idea, or a thing.
A term lexicalizes a concept.
A Word or Multi-word string that conveys 'a single meaning' within a given community
e.g. company, Paris, man, cellphone, Red Hat, car parking
Goal: Find out representative concepts.
Term Recognition: Find the terms.
Term Classification: Cluster the terms which are same.
Term Mapping: Link the terms to well-defined concepts of referent data sources.
Various techniques exist for every step.
Different combinations of Linguistics techniques have been able to surpass this step
Scan the text in order to identify boundaries of words and complex expressions
Remove the stop words like 'a', 'the', 'of', 'with'
E.g. Check of the Electrical Bonding of External Composite Panels with a CORAS Resistivity-Continuity Test
Terms: Check, Electrical Bonding, External Composite Panels, CORAS Resistivity-Continuity Test Set.
Generally nouns are considered as candidate concepts
TF-IDF technique can be used to find the important keywords 
a balanced measure stating that a word is more important if it appears several times in a target document and at the same time it appears rarely in other documents.
Seed-concepts can be used from existing ontologies.
The C/NC-value method: 
(1) the frequency of occurrence,
(2) the frequency of occurrence as a sub-string of other candidate terms,
(3) the number of candidate terms containing the given term as a sub-string,
(4) the number of words contained in the candidate term
The relevant terms can be determined by mutual cohesiveness by using Mutual Expectation
Use of morphological knowledge of a word 
A technique which identifies a word-stem from a full word-form
To identify small domain-speciﬁc units
studies patterns of word-formation and attempts to formulate rules using the word structure.
e.g. In the biomedical domain a word ending in “-oﬁlous” or “-itis” is very probably a bio-molecule or a medical term
Advantage: Can identify “background terms” even with low frequency of appearance
person, location, organization names as single complex entities
Complex date and time expressions
percentage, monetary value
E.g. 'Merrill Lynch'
The next step associates single words or complex expressions with the concepts
e.g 'Merrill Lynch' is related to the concept organization
Syntactic dependency relations coincide closely with semantic relations 
e.g. France Telecom in Paris offers the new DSL technology.
Dependency relations would give linkage between France Telecom(organization) and Paris(city)
From this we can derive a semantic relationship between organization and city
Hierarchy of concepts
Inclusion relations provide a tree view of the ontology and imply inheritance between super-concepts and sub-concepts.
E.g. 'Living being' is a super-concept and 'mammal' is a sub-concept.
In terms of ontology, root node is the most general one for the domain of interest.
Based on lexico-syntactic patterns
Can find inclusion relation between concepts through a simple pattern matching on a set of documents
E.g. NP such as NP, NP,..., and NP
...works by authors such as Herrick, Goldsmith, and Shakespeare
Idea is to use a pattern learner to generate new patterns
Generated patterns then can be used in order to generate new information (new inclusion relations), as well as to assess the validity of extracted information
E.g. we can generate new patterns like
NP is NP
NP, NP,..., and other NP
NP, especially NP, NP,..., and NP
From the pattern NP such NP as NP, NP,..., and NP
Decide on a lexical relation, R, that is of interest,e.g., "group/member" E.g. a hyponym relation like (author,Shakespeare).
Gather a list of terms/instances for which this relation holds.
Find places in the corpus where these terms/instances occur syntactically near one another and record the environment.
Find new patterns using this.
Once a new pattern has been positively identified, use it to gather more instances of the target relation and go to Step 2.
A concept may be represented by multi-word terms
A concept 'A' is a hyponym of a concept 'B' if
A has more tokens than B
all the tokens of B are present in A
both terms have the same head
E.g. Concepts 'private customer' and business customer' is a hyponym of the concept 'customer'
Relationships other than is-a relationships
E.g. Linguistic processing may find that the word 'cost' occurs frequently with the words 'hotel', 'guest house', 'youth hostel' in sentences like 'Costs at the youth hostel are $20 per night'
Relations (cost, hotel), (cost, guest house) and (cost, youth hostel) exist
Discovery algorithm finds support and confidence measures for these pairs as well as relationships at higher levels of abstraction such as accommodation and costs
Based on basic Association Rule Algorithm 
Basic Association Rule Algorithm
a set of transactions, T
Each transaction has a set of items, i1,i2, ... in
Goal: Compute association rules of form i1→i2
Trick: Explores the fact that many items appear together. So occurrence of one implies occurrence of another with a high probability (confidence)
E.g. consider the transactions
(bread, butter, jam, chips)
(bread, butter, jam, ketchup)
(bread, butter, jam, chips)
Eg. bread → butter, jam
E.g. Support = 3/5
Confidence = n(XUY)/n(X)
E.g. Confidence = 3/4
1. Extend each transaction to include the ancestor of a particular item
E.g. include the word 'Accommodation' in the transactions containing word 'guest house'
2. Determine association rules of the form Xk→Yk where |Xk| = 1 and |Yk| = 1
3. Determine confidence for all rules that exceed user determined support
4. Prune the rules subsumed by ancestral rules
E.g. if we found 2 rules, (cost, accommodation) and (cost, hotel), we prune the latter rule (cost, hotel)
Uses hierarchical clustering.
Groups up the similar terms in a bottom up fashion
Uses cosine similarity function
The cosine measure or normalized correlation coefficient between two vectors x and y is given by
The similarity matrix is given by
cos(Hotel,Accommodation) = 7*11+4*2+6*5/(105*150)
ISOLDE (Information System for Ontology Learning and Domain Exploration) produce domain ontology from a base ontology
Uses the following
An unsupervised named entity recognition system
Web resources like DWDS, Wikipedia and Wiktionary.
Named-entity recognition (NER)
uses a domain-specific corpus, a base ontology and a general purpose NER system (SproUT, see Drozdzynski et al. 2004) to find instances for the classes in the base ontology.
Linguistic pattern analysis
for the extraction of class candidates from the context of the instances extracted in step 1 by use of lexico-syntactic patterns
Collecting web-based knowledge
collect information on and between extracted class candidates from online resources and integrating this into a new or extended taxonomy/ontology
After step 1 we get Ballack,Munich, as 1 named entity from soccer corpus
In the second step we find the class candidates for named entities for the sentence in the corpus and then filter the domains specific candidates using X2 method
Ballack, the best midfielder in the German national team. Gives Midfielder as the calss candidate of Ballack.
In the third step for the class candidates we search on web wikipedia definition on midfielder is
A midfielder is a player whose position of play is midway between the attacking strikers and the defenders
human understandable vs machine understandable
learning higher degree relation
mapping to high level ontology
incremental ontology learning
multi agent learning
is ubiquitous in information systems 
improving the performance of information retrieval and reasoning
making data between different applications interoperable
ontology-type semantic description of behaviors and services allow software agents in a multi-agent system to better coordinate themselves
 Elias Zavitsanos, Georgios Paliouras, George Vouros,Ontology Learning and Evaluation: A survey Technical Report, 2006.
 Nicolas Weber, Paul Buitelaar, Web-based Ontology Learning with ISOLDE, DFKI GmbH - Language Technology Lab Saarbrücken, German,2006.
 Alexander Maedche and Steffen Staab, Mining Ontologies from Text, 2000.
 Alexander Maedche, Viktor Pekar, and Steffen Staab, Ontology Learning Part One-On Discovering Taxonomic Relations from the Web, 2003.
 K. Frantzi, S. Ananiadou, and H. Mima. Automatic recognition of multi-word terms: The c-value/nc-value method. 3(2):115–130, 2000.
 A. Saltion, G. Wong and C.S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.
 D.I. Moldovan and R.C. Girju. An interactive tool for the rapid development of knowledge bases. International Journal on Artificial Intelligence Tools (IJAIT), 10(1-2), 2001
 J.D. Cohen. Highlights: Language and domain independent automatic indexing terms for abstracting. Journal of the American Society for Information Science, 46(3):162–174, 1995.
 U. Heid. A linguistic bootstrapping approach to the extraction of term candidates from german text. Terminology, 5(2):161–181, 1998.
 L.M. Iwanska, N. Mata, and K. Kruger. Fully Automatic Acquisition of Taxonomic Knowledge from Large Corpora of Texts, pages 335–345. MIT/AAAI Press, 2000.
 J.U. Kietz, A. Maedche, and R. Volz. A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet. , Juan-Les-Pins, France, 2000.
 A. Maedche, V. Pekar, and S. Staab.Ontology learning part one - on discovering taxonomic relations from the web.In Proceedings of the Web Intelligence conference. Springer Verlag, 2002.
 Vincent Schickel-Zuber, Boi Faltings: Using hierarchical clustering for learning theontologies used in recommendation systems. KDD 2007: 599-608
 A . Maedche and S. Staab. Discovering Conceptual Relations from Text. In Proceedings of ECAI 2000, IOS Press, Amsterdam, 2000.