Knowledge Management

Knowledge Management • Speaker • Prof. Sudeshna Sarkar • Computer Science & Engineering Department, • Indian Institute of Technology Kharagpur,Kharagpur • sudeshna@cse.iitkgp.ernet.in Indo-German Workshop on Language technologies AU-KBC Research Centre, Chennai

Research Activities Department of Computer Science & Engineering College of Engineering, Guindy Chennai – 600025 Participant : Dr.T.V.Geetha Other members: Dr. Ranjani Parthasarathi Ms.D. Manjula Mr. S. Swamynathan

Knowledge Management, Semantic Web Retrieval - Possible Areas of cooperation • Semantic Based Approaches to Information Retrieval Extraction • Cognitive Approaches to Semantic Search Engines with user profiles and user perspective • Multilingual Semantic Search Engines – use of an intermediate representation like UNL • Goal based Information Extraction from semi-structured Documents – use of ontology • Information Extraction and its Visualization – development of time line visualization of documentsContacted: Dr. Steffen StaabUniversity of KarlsruheInstitute of Applied Informatics and Formal Description MethodsCore Competencies: Knowledge Management

Knowledge Management, Web Services - Work done in the area • Design and implementation of Reactive Web Services using Active Databases • Design and Implementation of Rule Engine • Design and implementation of complex rules to tackle client and server side semantics of the rule engine. • Development of intelligent web services for E-commerce. • Extension to tackle multiple and cooperative web service environments.

Knowledge Management, Web Services - Possible Areas of cooperation • Formalization and Description of Web Service Semantics using Semantic Web • Introspection between Web Service • Personalization of Web Services • Rating of Web Services Contacted: Dr. Steffen StaabUniversity of KarlsruheInstitute of Applied Informatics and Formal Description MethodsCore Competencies: Knowledge Management

Natural Language ProcessingKnowledge Representation Possible Areas of cooperation • Knowledge Representation Architecture based on Indian Logic • Argumentative Reasoning Models based on Indian Logic • Knowledge representation and interpretation strategies based on Indian sastras like Mimamsa • Building Domain Ontologies based on above architecture • Knowledge Management based on above approaches • Contacted: Prof. Dr. Gerd UnruhUniversity of Applied Sciences FurtwangenDepartment of Informatics Core Competencies: WordNet, Data bases

Utkal UniversityWe Work On Image Processing Speech Processing Knowledge Management

Knowledge Management • Machine Translation Normal sentences with WSD • Lexical Resources • (A) e-Dictionary (Oriya EnglishHindi) – Got IPR. and Tested by SQTC, ETDC Banglore 27,000 Oriya, 30,000 English and 20,000 Hindi words. • (B) Oriya WordNet withMorphological Analyzer. • Got IPR. , Tested by SQTC, ETDC, Banglore -1,000 Lexicon. • (C) Ori-Spell (Oriya Spell Checker) • Got IPR , Tested by SQTC, ETDC Banglore, 1,70,000 words (root and derived). • (D) Trilingual Word Processor (Hindi- English-Oriya) • Integrated with Spell Checker and Grammar Checker.

KM(Sanskrit) • San-Net(Sanskrit Word-Net) • Developed using Navya-NyAya()Philosophy and Paninian Grammar • Beside Synonym, Antonym, Hypernym, Hyponym, Holonym and Meronyms etc., some more relation such as: Analogy, Etymology, Definition, Nominal Verb, Nominal Qualifier, Verbal Qualifier and Verbal Noun have been introduced in San-Net. • San-Net can be used for Indian language understanding, translating, summarizing and generating. • A standard Knowledge Base (KB) has been developed for analyzing syntactic, semantic and pragmatic aspects of any lexicon.

Present Interest • Sanskrit WordNet based Machine Translation System • Morphological Analyser for Sanskrit • Navya Nyaya Philosophy to be extensively used for it. • Help to have better WSD as NNP provides a effective Conceptual analysisng capability.

Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – 700 032, INDIA. Professor Sivaji Bandyopadhyay sivaji_ju@vsnl.com

Cross-lingual Information Management • Multilingual and Cross-lingual IR • A Cross Language Database (CLDB) System in Bengali and Hindi developed • Natural language query analyzed using a Template Grammar and Knowledge Bases to produce the corresponding SQL statement • Cooperative response in the query language • Anaphora / Coreference in CLDB studied • Database updates and elliptical queries also supported

Cross-lingual Information Management • Open Domain Question Answering • Work being done for English • Currently building a set of question templates (Qtargets) and the corresponding Answer patterns with relative weights • Input question analyzed to produce the corresponding question template • Appropriate answer pattern retrieved • Answer generated using the input document and the synthesis rules of the language

Search and Information Extraction Lab IIIT Hyderabad Search and Information extraction lab focuses building technologies for Personalized, customizable and highly relevant information retrieval and extraction systems The vertical search or the domain specific search, when combined with the personalization aspects, will drastically improve the quality of search results.

Current work includes on building search engines that are vertical portals in nature. It means that they are specific to a chosen domain aiming at producing highly quality results (with high recall and precision). It has been realized in the recent past that it is highly difficult to build a generic search engine that can be used for all kinds of documents and domains yet produce high quality results. Some of the tasks that are involved in building domain specific search engines include to have representation of the domain in the form of ontology or taxonomy, ability to “deeply understand” the documents belonging to that domain using techniques like natural language processing, semantic representation and context modeling. Another area of immediate interest for English pertains to summarization of documents. Work is also going-on on text categorization and clustering.

The development makes use of the basic technology already developed for English, as well as for Indian languages pertaining to word analyzers, sentential parsers, dictionaries, statistical techniques, keyword extraction, etc. These have been woven in a novel architecture for information extraction. Knowledge based approaches are being experimented with. The emphasis is on using a combination of approaches involving automatic processing together with handcrafting of knowledge. Applications to match extracted information from documents with given specifications are being looked at. For example, a given job requirement could be matched with resumes (say, after information is extracted from them). A number of sponsored projects from industry and government are running at the Center in this area. A major knowledge management initiative in the areas of eGovernance is also being planned.

We are building search engines and named entity extraction tools specifically for Indian context. As a test bed, we are building an experimental system codenamed as PSearch (http://nlp.iiit.net/~psearch). SIEL is also actively developing proper name gazetteers to cover the commonly used names of people, places, organizations etc in the Indian news media for various languages. These resources will help in the information extraction, categorization, and machine translation activities. For further information and details please email to vv@iiit.net

Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications & Information Technology) ‘Anusandhan Bhawan’, C 56/1 Sector 62, Noida – 201 307, India karunesharora@cdacnoida.com

Gyan Nidhi : Parallel Corpus ‘GyanNidhi’ which stands for ‘Knowledge Resource’ is parallel in 12 Indian languages , a project sponsored by TDIL, DIT, MC &IT, Govt of India

Gyan Nidhi: Multi-Lingual Aligned Parallel Corpus What it is?The multilingual parallel text corpus contains the same text translated in more than one language. What Gyan Nidhi contains?GyanNidhi corpus consists of text in English and 11 Indian languages (Hindi, Punjabi, Marathi, Bengali, Oriya, Gujarati, Telugu, Tamil, Kannada, Malayalam, Assamese). It aims to digitize 1 million pages altogether containing at least 50,000 pages in each Indian language and English. Source for Parallel Corpus • National Book Trust India • Sahitya Akademi • Navjivan Publishing House • Publications Division • SABDA, Pondicherry

GyanNidhi Block Diagram

Gyan Nidhi: Multi-Lingual Aligned Parallel Corpus Platform : Windows Data Encoding : XML, UNICODE Portability of Data : Data in XML format supports various platforms Applications of GyanNidhi Automatic Dictionary extraction Creation of Translation memory Example Based Machine Translation (EBMT) Language research study and analysis Language Modeling

Tools: Prabandhika: Corpus Manager • Categorisation of corpus data in various user-defined domains • Addition/Deletion/Modification of any Indian Language data files in HTML / RTF / TXT / XML format. • Selection of languages for viewing parallel corpus with data aligned up to paragraph level • Automatic selection and viewing of parallel paragraphs in multiple languages • Abstract and Metadata • Printing and saving parallel data in Unicode format

Sample Screen Shot : Prabandhika

Tools: Vishleshika : Statistical Text Analyzer • Vishleshika is a tool for Statistical Text Analysis for Hindi extendible to other Indian Languages text • It examines input text and generates various statistics, e.g.: • Sentence statistics • Word statistics • Character statistics • Text Analyzer presents analysis in Textual as well as Graphical form.

Sample output: Character statistics Above Graph shows that the distribution is almost equal in Hindi and Nepali in the sample text. Most frequent consonants in the Hindi Most frequent consonants in the Nepali Results also show that these six consonants constitute more than 50% of the consonants usage.

Vishleshika: Word and sentence Statistics

AU-KBC Research Centre Knowledge Management Information Retrieval / Information Extraction

IE in Partially structured data Information extraction on partially structured domain dependent data is done for IB. The sample data was in criminal domain. This is a rule based system and the rules are hand crafted. There are various dictionaries for places, events and the basic verbs which are used by the rules. The dictionary can be dynamically updated. The template is pre-defined.

Example: Event : An exchange of fire took place between the police and CPML-PW extremists ( 2 ) at Basheera ( Kamarpally mandal/district Nizamabad/January 9 ) resulting in the death of a DCM of the outfit . The police also recovered wireless sets ( 2 ) , hand-grenade ( 1 ) and revolver ( 1 ) from the site . Participant 1 = police Participant 2 = CPML-PW_extremists No of Participant 2 = ( 2 ) Material = revolver Date = January 9 2002 Police Station = Nizamabad Mandal = Kamarpally District = Nizamabad Event = exchange of fire

IE in Unstructured data Information extraction on Unstructured, domain dependent data is done in online matrimonial. The sample data was take from The Hindu online matrimonial. This is a rule based system and the rules are hand crafted. Linguistic rules as well heuristic rules play a major role in this. There are various dictionaries for cast, religion, language etc. Which are used by the system. The template to be filled up is static and pre-defined.

Knowledge Management