Design of an Electronic Sanskrit Reader

Design of an Electronic Sanskrit Reader • SALA XXI, • Konstanz, October 2001 • Gérard Huet • INRIA

History • 1994 - personal lexicon in TeX • 1996 - available on Internet • 1998 - 10000 entries - invariants design • 1999 - reverse engineering • 2000 - Hypertext version on the Web, sandhi processor, grammatical engine • 2001 - Segmenter, tagger

Internet Web site (html) Data base source Abstract structure Index a h ... CGI bin (Ocaml) devnag tex dvi ps pdf Processing chains

CGI Dico.ps Dico.pdf Dico DB Index Engine Web site (html) Flexed Index Grind Entries a h ... Flex Grammar Engine Close-up view of functionalities

Printable document Knuth TeX, Metafont, LaTeX2e Velthuis devnag font & ligature comp Adobe Postscript, Pdf, Acrobat Hypertext document W3C HTTP, HTML, CSS Unicode UTF-8 Chris Fynn Indic Times Font Tools used • Processing & Search • INRIA Objective Caml

Entry Crossref | ] Type entry = [ .... syntax usage opt cogs Each entry is a (typed) tree N.B. Syntax is really morphology, usage is part of speech roles plus meanings

Grammatical information • type gender = [ Mas | Neu | Fem | Any ]; • type number = [ Singular | Dual | Plural ]; • type case = [ Nom | Acc | Ins | Dat | Abl | Gen | Loc | Voc ];

Governance templates(Grammatical valence) • \word{ga.n} ... \sem{imputer qqc. <acc.> à qqn. <loc.>} • \ca{chandayati} … \sem{gratifier qqn. <acc.> de <i.>} • \word{niyuj} ... \sem{confier qqc. <acc.> à qqn. <loc.>} • \root{krii} ... \sem{acheter (qqc. <acc.> à qqn. <g. abl.>)} • Other specific notations for synonyms, antonyms, cross-references.

Key points • Each entry is a structured piece of data on which one may compute • Consistency and completeness checks : • every reference is well defined once, there is no dangling reference • etymological origins, when known, are systematically listed • lexicographic ordering at every level is mechanically enforced • Specialised views are easily extracted • Search engines are easily programmable • Maintenance and transfer to new technologies is ensured • Independence from input format, diacritics conventions, etc. • The technology is scalable to much bigger corpus

Generic reuse of the technology • The structure of the dictionary makes separate as much as possible 3 layers : • sanskrit • french • generic dictionary structure • Thus the french meanings, at the leaves, could be replaced by e.g. english definitions or glosses.

Morphological analysis, sandhi • Sanskrit is pronounced as written • … and thus is written as pronounced • Phonetic alliteration is rendered by morphology junction (sandhi) • The sentence is formed of words joined by external sandhi • Compound words are also formed by external sandhi • Whereas flexion, prefixing and suffixing use internal sandhi • External sandhi is local, internal sandhi is less • Sandhi analysis is non-deterministic and sometimes involves sem

Grammatical engine • In sanskrit, declension is determined by stem and gender • Sanskrit is very regular, since the classical language was frozen by Pânini (4th century BC) who invented context-free notation • But it spans about 35 centuries, and thus there are many exceptions • Substantive (adjectives, pronouns, numerals) declension may be arranged in 84 tables of 24 endings (3 numbers * 8 cases) • Then internal sandhi is applied to a stem and an ending • Two applications : • online declension of words given with gender (cgi-bin) • offline computation of flexed forms (2000 pages of double-column fineprint)

Interactions lexicon-grammar • The index engine, when given a string which is not a stem defined in one of the entries of the lexicon, attempts to find it within the flexed forms persistent database, and if found there will propose the corresponding lexicon entry or entries • From within the lexicon, the grammatical engine may be called online as a cgi which lists the declensions of a given stem. It is directly accessible from the gender declarations, because of an important scoping invariant: • every substantive stem is within the scope of one or more genders • every gender declaration is within the scope of a unique substantive stem

Inverting external sandhi • External sandhi rules are of a finite-state nature • The flexed forms lexicon index may be seen as the graph of a deterministic finite automaton recognizing its words • This tree may be uniformly decorated by relevant sandhi rules seen as non-deterministic choice points • This structure may be evaluated as a finite-state transducer graph segmenting an input text as words joined by sandhi

Examples of segmentation • Chunk: o.mnama.h"sivaaya • may be segmented as: • [ om with sandhi m|n -> .mn ] • [ namas with sandhi s|"s -> .h"s ] • [ "sivaaya with no sandhi ] • Chunk: kusuma.mgopiibhya.hk.r.s.nodadati • may be segmented as: • [ kusumam with sandhi m|g -> .mg] • [ gopiibhyas with sandhi s|k -> .hk] • [ k.r.s.nas with sandhi as|d -> od] • [ dadati with no sandhi]

From segments to tagged lemmas • Chunk: kusuma.mgopiibhya.hk.r.s.nodadati • may be lemmatized with tags as: • [ kusumam < acc. sg. n. of kusuma • | nom. sg. n. of kusuma • | voc. sg. n. of kusuma > with sandhi m|g -> .mg ] • [ gopiibhyas < abl. pl. f. of gopa • | dat. pl. f. of gopa > with sandhi s|k -> .hk ] • [ k.r.s.nas < nom. sg. m. of k.r.s.na > with sandhi as|d -> od ] • [ dadati <…> with no sandhi]

Future work • Verb conjugation tables preparation - full flexed forms database • Fixing sandhi analysis for bahuvrihi compounds • Choice of taggings from concord and valency constraints • Semantic guidance from ontology classification • and we shall then able to semi-automatically index corpuses towards • computer-aided concordance of corpus • computer-aided preparation of critical editions • statistical analysis of corpus (co-occurrence, style, etc) • computer-aided accretion of lexicon • fully indexed citations • extraction of corpus-specific lexicons • diachnony control of lexical information

Design of an Electronic Sanskrit Reader

Design of an Electronic Sanskrit Reader

Presentation Transcript

The Design of an Electronic Bicycle Monitor (EBM)

An Elementary Reader

Becoming an Active Reader

An Elementary Reader

The Design of an Electronic Bicycle Monitor (EBM)

ELECTRONIC MESSAGE CENTERS/DIGITAL READER BOARDS

Sanskrit Drama

Rebranding Sanskrit An occidental perspective

Ch2. Reader Antenna Design

Electronic Ballot Reader

become an EXCELLENT reader!

An Elementary Reader

An Elementary Reader

An Introduction to Electronic System Level Design

Sanskrit vowels

Design Exploration of an Electronic Honor System

Reading with an electronic reader.

The Growing World of Sanskrit

Electronic Design

Sanskrit Vyakaran

Design Exploration of an Electronic Honor System

SANSKRIT ANALYZING SYSTEM