270 likes | 587 Views
Table of Contents. Introduction to Multilingual Digital LibrariesDifferent Character Sets and EncodingsStatement of the problemObjectivesNeed for the projectMethodology Implementation System descriptionObservations Limitations Conclusion Future developments . Multilingual Digital Librar
E N D
1. Building digital libraries in Indian languages: case studies with Hindi and Kannada
2. Table of Contents Introduction to Multilingual Digital Libraries
Different Character Sets and Encodings
Statement of the problem
Objectives
Need for the project
Methodology
Implementation
System description
Observations
Limitations
Conclusion
Future developments
3. Multilingual Digital Library Library
Digital library
Monolingual digital library
Multilingual digital library
4. Definition of MDL According to Ana M. B. Pavani
“A multilingual digital library is a digital library that has all functions implemented simultaneously in as many languages as desired and whose search & retrieve functions are language independent”.
5. Terms related to multilingualism i18n (internationalization)
Localization
Multilingual digital library
Multilingual documents (?????, ??????, ??????)
Cross-language Retrieval
6. Issues of MDL Multiple language recognition, manipulation and display.
Multilingual or cross-language search and retrieval
7. Character set and Encodings Charset:- is a bunch of characters, in the way a human would understand them.
Ex: ?, ?,?,?, so on are charset of Kannada
?,?,?,?, so on are charset of Hindi
A,B,C,D, so on are charset of Latin English
Character Encoding:- is a way of storing characters on a computer as bits.
8. Different character sets ASCII
ISO-8859 series
Windows series
User defined
ISO 10646
Utf-8
Utf-16
Utf-32
9. Unicode Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
Developed by Unicode Consortium
There are many versions, 3.2.0 current one
Accommodates more than 65,000.
Synchronized with the corresponding versions of ISO-10646.
10. Unicode Standards incorporated under Unicode
ISO 6937, ISO 8859 series
ISCII, KS C 5601, JIS X 0209, JIS X 0212, GB 2312, and CNS 11643 etc.
Scripts and Characters
European alphabetic scripts
Middle Eastern right-to-left scripts
Scripts of Asia
Indian languages? Devanagari, Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam.
Punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, etc.
11. Assigning Character Codes Unique number is assigned to each code element and is called a code point.
These are the hexadecimal numbers with the prefix “U“ Ex,. , U+0041 is the hexadecimal number "A" .
It groups the characters together by scripts in code blocks.
Code blocks vary in size, depending on the size of the script.
Code elements are grouped logically throughout the range of code points, called the codespace.
12. Text handling Computer text handling involves processing and encoding.
The Unicode Standard directly addresses only the encoding action, processing will be carried out by software.
It does not defines glyph images (character set images), display software retrieve the glyphs.
The Unicode Standard does not specify the size, shape, or orientation of on-screen characters.
13. Objectives To assess the suitability of GSDL for developing digital library collection in Indian languages (Hindi and Kannada)
To create search and browse interface for GSDL Software in Hindi and Kannada
14. Need Immeasurable amount of literature in many languages
E-publishing in Indian languages
E-governance in India
E-learning
Digital libraries for Rural population
15. Greenstone Digital Library Software Open source
Developed by CS Department, University of Waikato, Newzealand
http://greenstone.org
Can handle different file formats
Works on different platforms
Supports for many languages through unicode
16. Multilingual support
Interface part
Content part
17. Methodology Software
Windows XP operating system
GSDL
Macromedia Fireworks
Nudi
Baraha
Internet Explorer 6.0
Hardware
128 RAM with Pentium III
18. Hindi and Kannada Interface Separate .dm files were created for both language
_textimagehome_ {Home Page}
_textimagehome_ [l=kn]{कि सुच }
Creating tabs for Hindi & Kannada
Hindi Tabs
Macromedia Fireworks
Baraha transliteration software
Kannada Tabs
Macromedia Fireworks
Nudi transliteration software
19. Collection building ?????? ????????: is downloaded from http://manaskriti.com/kaavyaalaya/
??????? ????????: is downloaded from http://udayavani.com
?????? Unicode collection
????? ???????? ????????
20. System description ?????? ????????/??????? ????????:
Susha/Shree-Kan-0850 ? Font folder
Lang interface ? Hindi/Kannada
Preference encoding ? Latin Based
Browser encoding ? Latin Based or User defined
Hindi/Kannada Unicode collection:
Mangal/Tunga for Hindi/Kannada ?Font folder
Lang interface ? Hindi/Kannada
Preference encoding ? utf-8
Browser encoding ?utf-8
21. Observations Can have interfaces in many languages .
Can build collection in many languages with different encodings other than Unicode.
Non-Unicode collection has only browse feature.
Titles of the Non-Unicode collection were in English language .
Unicode collections has both search and browse features.
All collections can be accessed over network.
cont…
22. Observations Uses MG compression technique.
Can browse lists of authors, lists of titles, lists of dates, so on.
Can handle very large collections.
New data can be added to existing collection at any point of time.
Open-source software; anybody can develop and it is amendable for local requirements.
23. Limitations Fails to display Unicode html files of Hindi/ Kannada
It doesn’t support truncated searching for Indian scripts.
Case differences option cannot be disabled in the preferences page.
Presently search feature works only on Windows XP.
24. Conclusion Multilingual Digital libraries will be ubiquitous in the future and will provide the basis for a very broad set of distributed living activities including computer-supported co-operative work, distance learning etc. Developing countries like India, where many languages are in practice could utilize comprehensive software such as Greenstone. Since Greenstone, being open-source software is readily extensible to meet the needs of multilingualism.
25. Future developments It can be extended to other Indian languages for which Unicode supports.
Display problem with html files can be solved for Indian languages by creating model mappings in utf-8 charset.
Collection can be tested for different file formats like PDF, RTF, E-mail, etc. for other Indian languages.
It can be tested with other operating systems like UNIX, Linux and browsers like Netscape, Opera to assess their compatibility.
Can develop stemming algorithms for Indian languages, that can be incorporated to GSDL
27.
Any Q’s
?????????????? ?
??? ?????? ?
28.
Thank you
????????
???????