1 / 24

Multilingual Issues in Information Retrieval and Resource Description Overview

Multilingual Issues in Information Retrieval and Resource Description Overview. Yuri Demchenko, TERENA demchenko@terena.nl. In this presentation. Multilingual Issues in TERENA Technical Programme Multilinguality: trends and developments Technical Issues/Background

maire
Download Presentation

Multilingual Issues in Information Retrieval and Resource Description Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multilingual IssuesinInformation Retrieval and Resource DescriptionOverview Yuri Demchenko, TERENA demchenko@terena.nl Multilingual Issues in Information Retrieval and Resource Description

  2. In this presentation • Multilingual Issues in TERENA Technical Programme • Multilinguality: trends and developments • Technical Issues/Background • Data presentation and resource description format • Standards Overview • Metadata and Cataloging • Recent Development in Subject Gateways and SE • Cross-language Information Retrieval • REIS/TAP Initiatives Multilinguality Framework Multilingual Issues in Information Retrieval and Resource Description

  3. TERENA Multilingual Community and TERENA Technical Programme • TERENA has 43 members from 34 countries speaking 30 languages • Multilingual issues always were in the scope of TERENA Technical Program • WG-i18n - WG on Internationalisation issues • C3 Project on messaging transliteration tools • MAITS - initiated by WG-i18n • Multilingual E-Mail Agent Testing • Multilingual issues in Subject Gateways, Section 2.13 in SG Handbook • Multilingual Support in Internet/IT Applications. Information page - http://www.terena.nl/projects/multiling/ • Liaison with STD bodies • CEN/TC304 Character set technology - http://www.stri.is/TC304/default.html • IETF Multilingual Issues in Information Retrieval and Resource Description

  4. Multilinguality: trends and developments • Storing, processing, presentation and exchange of information in many languages • Interactive (protocol based/negotiated ) applications and non-interactive (resource description and information presentation) • Multilingual Search and Retrieval • Multilingual Subject Gateways and Search Engines • CLIR testing at TREC • Data Resource Model and Multilinguality • One or Multiple languages • Data format • Metadata (not part of Data but part of Resource) • References, links • Professional Thesauri (Resource Context) - base for multiple languages and language unification Multilingual Issues in Information Retrieval and Resource Description

  5. Internet Applications • None-interactive Application: Electronic Mail • Correct Message Composition and Rendering • Interactive applications • WWW: HTTP/HTML • http-equiv="Content-Type" Content="text/html; charset=euc-jp" • <META http-equiv="Content-Type" Content="text/html; charset=euc-jp"> • Content Negotiation Protocol • Media features, attributes • Direct and hop-by-hop communication • Operational Applications • (Internationalised) DNS • LDAP and X.500 (Language Support ?) Multilingual Issues in Information Retrieval and Resource Description

  6. I18n and ML issues at IETF and other STD bodies • IETF Architectural Model of Multilingual support in Internet Applications - RFC 2130 • Language and Charset/Encoding tagging • Content negotiation framework (IETF/W3C) • Point-to-point vs hop-by-hop • Message based vs Interactive vs Streaming • Internationalised DNS (IDN) - Internationalised Domain Names • vs E-Mail (SMTP, IMAP) • vs Routing (Routing Policy Specification Language (RPSL)) • vs Network Management (SNMP textual presentation) • vs Network Security (TLS and IPSec) • Content Encoding normalisation (IETF/Unicode) • LSD-2 - Large Scale Services Deployment • IMAP language extension Multilingual Issues in Information Retrieval and Resource Description

  7. Resolution Service / Directory(content MD) Presentation Culture Locale Presentation Culture Locale Language Language Resource Content Transfer Agent Content Transfer Agent Communication Protocol Communication/Network IETF Architectural Model of Multilingual support in Internet Applications • User Interface • Presentation • Culture • Locale • Language • On-the-wire • Coded Character Set - Repertoire of ISO-10646 • Character Encoding Scheme - UTF-8 (ml-text), US-ASCII (e-mail), ISO8859-1 • Transfer Encoding Scheme (Base64, QP) Multilingual Issues in Information Retrieval and Resource Description

  8. Content Negotiation Framework (IETF/W3C) • Content Negotiation covers three elements • Expressing the capabilities of the sender and the data resource to be transmitted • Expressing the capabilities of a receiver • A protocol by which capabilities are exchanged • Abstract framework for content negotiation • (Content) (Transmit.data) (Data document) • [Author]----->-----[Sender]----->-----[Receiver]----->-----[User] • Transparent Content Negotiation in HTTP - RFC 2295 • Protocol-independent Content Negotiation Framework - RFC 2703 • Non-message resource transfer • End-to-end vs hop-by-hop negotiation • Use of directory and resolution services • CC/PP exchange protocol based on HTTP Extension Framework (W3C) • Composite Capability/Preference Profile: A user side framework for content negotiation Multilingual Issues in Information Retrieval and Resource Description

  9. Charset and Language tagging • MIME types (RFC 2045-2049) • text, img, audio, video • Charset = Character Set + Character Encoding Scheme • Transfer Encoding Scheme • base64 • quoted-printable • Other media attributes and features (e.g., resolution, color, language, etc.) • Language • RFC 1766 • ISO639-2 Multilingual Issues in Information Retrieval and Resource Description

  10. WWW: HTTP/HTML • HTTP header includes information about the type of the transferred information and the character encoding for text-based information: • http-equiv="Content-Type" Content="text/html; charset=euc-jp" • The Content-Language entity header field describes the natural language(s) of the intended audience for the enclosed document: • http-equiv="Content-Type" Content-Language=se • Character encoding information in the META information of the HTML document: • <META http-equiv="Content-Type" Content="text/html; charset=euc-jp"> Multilingual Issues in Information Retrieval and Resource Description

  11. XML: Character Set tagging • Character is atomic unit of text • All ISO 10646 characters + TAB, CR, LF • The mechanism for Encoding can vary for different characters • All XML processors must accept UTF-8 and UTF-16 • Character Encoding declaration in XML documents or entities (section 4.3.3) • EncodingDecl : : = S ‘encoding’ Eq ‘ ” ’ EncName ‘ “ ‘ | “ ‘ “ EncName “ ‘ “ )<? xml encoding+’UTF-8’?><? xml encoding+’EUC-JP’?> • Default Character Set Encoding - UTF-8 and UTF-16 • Autodetection of Character Encoding Multilingual Issues in Information Retrieval and Resource Description

  12. XML: Language tagging • Language identification (section 2.12) • Labelling language of the whole document, entity or item • Tag for identification of languages • LanguageID : : = Langcode (‘-’ Subcode) • Langcode : : = ISO639Code | IanaCode | UserCode • Examples: <p xml:lang="en">The quick brown fox jumps over the lazy dog.</p> • <p xml:lang="en-GB">What colour is it?</p> • <p xml:lang="en-US">What color is it?</p> • <sp who="Faust" desc='leise' xml:lang="de"> • <l>Habe nun, ach! Philosophie,</l> • <l>Juristerei, und Medizin</l> • <l>und leider auch Theologie</l> • <l>durchaus studiert mit heißem Bemüh'n.</l> • </sp> Multilingual Issues in Information Retrieval and Resource Description

  13. Unicode Technical Reports • The Unicode Standard, Version 3.0 - Just published! - http://www.unicode.org/unicode/uni2book/u2.html • Unicode 2.0 test page http://www.terena.nl/projects/multiling/euroml/tests/test-ucspages1ucs.html • Multilingual European Subsets of ISO/IEC 10646-1http://www.stri.is/TC304/p10_1998_05_30.pdf • Unicode technical Reports • UTR #15: Unicode Normalization Forms, Version 18.0 I-D by Martin Duerst • UTR #17: Character Encoding Model • UTR #16: UTF-EBCDIC • UTR #10: Unicode Collation Algorithm • UTR #7: Plane 14 Characters for Language Tags Multilingual Issues in Information Retrieval and Resource Description

  14. Language Definition in DC Metadata set - DC.Language Format <meta name = "DC.Language" content = "en"> <meta name = "DC.Language" scheme = "rfc1766" content = "en"> <meta name = "DC.Language" scheme = "ISO639-2” content = "eng"> <meta name = "DC.Language” scheme = "rfc1766” content = "en-US"> <meta name = "DC.Language” content = "zh"> <meta name = "DC.Language" content = "ja"> <meta name = "DC.Language” content = "es"> <meta name = "DC.Language” content = "german"> <meta name = "DC.Language” lang = "fr” content = "allemand"> Multilingual Issues in Information Retrieval and Resource Description

  15. Language Definition in DC Metadata set - Field content language labelling/attributing • A work in Spanish may be assigned the following metadata: • <meta name = "DC.Language” scheme = "rfc1766” content = "es"> • <meta name = "DC.Title" • lang = "es" • content = "La Mesa Verde y la Silla Roja"> • <meta name = "DC.Title" • lang = "en" • content = "The Green Table and the Red Chair"> Multilingual Issues in Information Retrieval and Resource Description

  16. DC in Multiple Languages • The reference language of Int’l DC community is English, however the semantics od DC elements are in principle expressed equally well in any modern language • The versions of DC elements in various languages should share a single name space using tokens that look like English words but stand for universal elements - http://purl.org/dc/elements/1.1/ • DC in Multiple Languages Registry project - http://purl.org/dc/groups/languages.htm • Uses RDF schemas to share machine-readable tokens for translation of DC terms in multiple languages (26 languages to date) • Linkage to and from central DC namespace server • Registry as Dictionary/Thesauri - use Interlinguas to link different translations • Formal recognition and standardization procedure Multilingual Issues in Information Retrieval and Resource Description

  17. Document Description with Unqualified DC and RDF syntax • <?xml:namespace ns="http://purl.org/metadata/dublin_core_elements" prefix="DC"?> • <RDF:RDF> • <RDF:DESCRIPTION RDF:HREF="http://www.biblio.de/buecher/kleist.html"> • <DC:Title XML:lang="de">Das Erdbeben in Chili</DC:Title> • <DC:Creator>Heinrich von Kleist</DC:Creator> • </RDF:Description> • </RDF:RDF> • XML Encoding (Character set) declaration • UTF-8/UTF-16 as default encoding Multilingual Issues in Information Retrieval and Resource Description

  18. Recent Developments in Subject Gateways, Indexing, Searching • NRENs projects • Subject gateways • Commercial Search Engines • Multilingual Text Retrieval and Processing • TUSTEP system - using “fuzzy” multilingual seaching • Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8 Conferences by NIST Multilingual Issues in Information Retrieval and Resource Description

  19. Multilingual Subject Gateway (DESIRE) • Developing multilingual subject gateways (SOSIG as example) • SOSIG accept any languages evaluated for quality • Translation should be coherent and checked • Different language version should be equally well maintained • SOSIG Cataloguing rules • TITLE will be displayed in the first language • ALTERNATIVE TITLE in other languages • DESCRIPTION will mention different languages in which resource is available • URI of all language versions • Labeling URI language • Library standards for multilingual provision • NISO Z39.53 Language codes • USMARC Language codes Multilingual Issues in Information Retrieval and Resource Description

  20. Multilingual provision in popular Internet Search Engines • Multilingual SE • AltaVista - http://www.altavista.com/ - 28 languages • Documents indexed as is • Automatic translation - very simple and naive • Euroseek - http://www.euroseek.com/ - 30 languages • FAST Advanced Search - http://www.alltheweb.com - 31 languages • Google - http://www.google.com/ - 11 languages • Other sites that have dedicated national sites • interface language • language resources • no special language policy • Excite - 11 countries • Lycos - 23 countries Multilingual Issues in Information Retrieval and Resource Description

  21. TUSTEP TUebingen System of Text Processing Programs • 1. File structure • 2. Multilingual capabilities • 3. Internal data presentation • 4. Database publishing/output data presentation • 5. CGI • 6. Sample implementation • http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit • Try entries like Smith or Meier or... • http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery Multilingual Issues in Information Retrieval and Resource Description

  22. Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8 • TREC - Text REtrieval Conference - http://trec.nist.gov/ • Cross-Language Information Retrieval (CLIR) technologies • Using Intermediary or Interlingual representation • Latent Semantic Indexing • Generalised Vector Space Model, etc. • Computer translation • Machine-readable bilingual dictionaries • MultilingualThesauri • Participants: ETH/Eurospider, IBM, Xerox, Cornell, New Mexico Univ, TNO, others Multilingual Issues in Information Retrieval and Resource Description

  23. REIS Project/Initiative Multilinguality framework - First attempt • Multiple language indexing • multiple language documents/indexes • Cross-language Searching • Automatic Query forwarding based on thesauri or ML dictionary • Using “fuzzy” multilingual searching/matching • Multilingual information retrieval • Automatic translation (if requested) • Translation Request Protocol • Internal Data/Indexes presentation • Language and Character Encoding tagging • XML as internal presentation of data and XML language and charset tagging • Text/Charset normalisation (Unicode or TUSTEP-like) • Metadata and Resource Description • DC.Language definition and XML/RDF/DC Language tagging Multilingual Issues in Information Retrieval and Resource Description

  24. Multilinguality Framework for Multilingual Indexing/Search Services To be developed yet Multilingual Issues in Information Retrieval and Resource Description

More Related