Knowledge Management. Speaker Prof. Sudeshna Sarkar Computer Science & Engineering Department, Indian Institute of Technology Kharagpur,Kharagpur email@example.com. Indo-German Workshop on Language technologies AU-KBC Research Centre, Chennai. Research Activities .
Indo-German Workshop on Language technologies
AU-KBC Research Centre, Chennai
Department of Computer Science & Engineering
College of Engineering, Guindy
Chennai – 600025
Participant : Dr.T.V.Geetha
Other members: Dr. Ranjani Parthasarathi
Mr. S. Swamynathan
Knowledge Management, Semantic Web Retrieval - Possible Areas of cooperation
Contacted: Dr. Steffen StaabUniversity of KarlsruheInstitute of Applied Informatics and Formal Description MethodsCore Competencies: Knowledge Management
Core Competencies: WordNet, Data bases
Normal sentences with WSD
Normal sentences with WSD
Normal sentences with WSD
Computer Sc. & Engg. Department
KOLKATA – 700 032, INDIA.
Professor Sivaji Bandyopadhyay
Search and Information extraction lab focuses
building technologies for Personalized, customizable and highly relevant information retrieval and extraction systems The vertical search or the domain specific search, when combined with the personalization aspects, will drastically improve the quality of search results.
Current work includes on building search engines that are vertical portals in nature. It means that they are specific to a chosen domain aiming at producing highly quality results (with high recall and precision). It has been realized in the recent past that it is highly difficult to build a generic search engine that can be used for all kinds of documents and domains yet produce high quality results. Some of the tasks that are involved in building domain specific search engines include to have representation of the domain in the form of ontology or taxonomy, ability to “deeply understand” the documents belonging to that domain using techniques like natural language processing, semantic representation and context modeling. Another area of immediate interest for English pertains to summarization of documents. Work is also going-on on text categorization and clustering.
The development makes use of the basic technology already developed for English, as well as for Indian languages pertaining to word analyzers, sentential parsers, dictionaries, statistical techniques, keyword extraction, etc. These have been woven in a novel architecture for information extraction.
Knowledge based approaches are being experimented with. The emphasis is on using a combination of approaches involving automatic processing together with handcrafting of knowledge. Applications to match extracted information from documents with given specifications are being looked at. For example, a given job requirement could be matched with resumes (say, after information is extracted from them).
A number of sponsored projects from industry and government are running at the Center in this area. A major knowledge management initiative in the areas of eGovernance is also being planned.
We are building search engines and named entity extraction tools specifically for Indian context. As a test bed, we are building an experimental system codenamed as PSearch (http://nlp.iiit.net/~psearch).
SIEL is also actively developing proper name gazetteers to cover the commonly used names of people, places, organizations etc in the Indian news media for various languages. These resources will help in the information extraction, categorization, and machine translation activities.
For further information and details please email to firstname.lastname@example.org
Natural Language Processing Lab
Centre for Development of Advanced Computing
(Ministry of Communications & Information Technology)
C 56/1 Sector 62, Noida – 201 307, India
‘GyanNidhi’ which stands for ‘Knowledge Resource’ is parallel in 12 Indian languages , a project sponsored by TDIL, DIT, MC &IT, Govt of India
What it is?The multilingual parallel text corpus contains the same text translated in more than one language.
What Gyan Nidhi contains?GyanNidhi corpus consists of text in English and 11 Indian languages (Hindi, Punjabi, Marathi, Bengali, Oriya, Gujarati, Telugu, Tamil, Kannada, Malayalam, Assamese). It aims to digitize 1 million pages altogether containing at least 50,000 pages in each Indian language and English.
Source for Parallel Corpus
Platform : Windows
Data Encoding : XML, UNICODE
Portability of Data : Data in XML format supports various platforms
Applications of GyanNidhi
Automatic Dictionary extraction
Creation of Translation memory
Example Based Machine Translation (EBMT)
Language research study and analysis
Above Graph shows that the distribution is almost equal in Hindi and Nepali in the sample text.
Most frequent consonants in the Hindi
Most frequent consonants in the Nepali
Results also show that these six consonants constitute more than 50% of the consonants usage.
Information Retrieval / Information Extraction
Information extraction on partially structured domain dependent data is done for IB.
The sample data was in criminal domain.
This is a rule based system and the rules are hand crafted.
There are various dictionaries for places, events and the basic verbs which are used by the rules.
The dictionary can be dynamically updated.
The template is pre-defined.
Event : An exchange of fire took place between the police and CPML-PW extremists ( 2 ) at Basheera ( Kamarpally mandal/district Nizamabad/January 9 ) resulting in the death of a DCM of the outfit . The police also recovered wireless sets ( 2 ) , hand-grenade ( 1 ) and revolver ( 1 ) from the site .
Participant 1 = police
Participant 2 = CPML-PW_extremists
No of Participant 2 = ( 2 )
Material = revolver
Date = January 9 2002
Police Station = Nizamabad
Mandal = Kamarpally
District = Nizamabad
Event = exchange of fire
Information extraction on Unstructured, domain dependent data is done in online matrimonial.
The sample data was take from The Hindu online matrimonial.
This is a rule based system and the rules are hand crafted. Linguistic rules as well heuristic rules play a major role in this.
There are various dictionaries for cast, religion, language etc. Which are used by the system.
The template to be filled up is static and pre-defined.