1 / 21

Multilinguality and cross-language searching

Multilinguality and cross-language searching. Multilingual aspects in Indexing, Searching and Metadata (Resource Description). Multilingual aspects in Indexing, Searching and Metadata. IETF Model of Multilingual support in Internet Applications Electronic Mail Interactive applications

hedwig
Download Presentation

Multilinguality and cross-language searching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multilingualityand cross-language searching Multilingual aspects in Indexing, Searching and Metadata (Resource Description) Multilinguality in Indexing, Searching and Metadata

  2. Multilingual aspects in Indexing, Searching and Metadata • IETF Model of Multilingual support in Internet Applications • Electronic Mail • Interactive applications • Charset and Language tagging • MIME types • XML Language and Charset tagging • DC language definition • Metadata and RDF • DC.Language • Existing solutions • TUSTEP • Search Engines and Subject Gateways • Multilingual framework for the REIS Project Multilinguality in Indexing, Searching and Metadata

  3. IETF Model of Multilingual support in Internet Applications • Electronic Mail • Language • Character Encoding Scheme • Transfer Encoding Scheme • Interactive applications • WWW: HTTP/HTML • http-equiv="Content-Type" Content="text/html; charset=euc-jp" • <META http-equiv="Content-Type" Content="text/html; charset=euc-jp"> • XML/DOM • LDAP and X.500 (?) Multilinguality in Indexing, Searching and Metadata

  4. XML:Language and Charset tagging • Character is atomic unit of text • All ISO 10646 characters + TAB, CR, LF • The mechanism for Encoding can vary for different characters • All XML processors must accept UTF-8 and UTF-16 • Character Encoding in Entities (XML 4.3.3) • EncodingDecl : : = S ‘encoding’ Eq ‘ ” ’ EncName ‘ “ ‘ | “ ‘ “ EncName “ ‘ “ )<?xml encoding+’UTF-8’?> <?xml encoding+’EUC-JP’?> • Autodetection of Character Encoding • Language identification (XML 2.12) • Tag for identification of languages • LanguageID : : = Langcode (‘-’ Subcode) • Langcode : : = ISO639Code | IanaCode | UserCode Multilinguality in Indexing, Searching and Metadata

  5. Charset and Language tagging • MIME types • text, img, audio, video • Charset = Character Set + Character Encoding Scheme • Transfer Encoding Scheme • base64 • quoted-printable • Language • RFC 1766 • ISO639-2 Multilinguality in Indexing, Searching and Metadata

  6. Language Definition in DC Metadata set • <meta name = “DC.language” • scheme= “rfc1766” “ISO639-2” • content= “es”> • <meta name = “DC.title” • lang = “es” • content= “La Mesa y Silla Roja”> Multilinguality in Indexing, Searching and Metadata

  7. Multilingual Subject Gateway • Developing multilingual subject gateways (SOSIG as example) • SOSIG accept any languages evaluated for quality • Translation should be coherent and checked • Different language version should be equally well maintained • SOSIG Cataloguing rules • TITLE will be displayed in the first language • ALTERNATIVE TITLE in other languages • DESCRIPTION will mention different languages in which resource is available • URI of all language versions • Labeling URI language • Library standards for multilingual provision • NISO Z39.53 Language codes • USMARC Language codes Multilinguality in Indexing, Searching and Metadata

  8. Multilingual provision in popular Internet Search Engines • AltaVista • Search in 25 languages • Documents indexed as is • Automatic translation - very simple and naive • Other sites that have dedicated national sites • interface language • language resoures • no special language policy • Euroseek • Excite • Lycos • Infoseek Multilinguality in Indexing, Searching and Metadata

  9. New Developments in Subject Gateways, Indexing, Searching • NRENs projects • Subject gateways • Commercial Search Engines • Multilingual Text Retrieval and Processing • TUSTEP system Multilinguality in Indexing, Searching and Metadata

  10. NREN projects • Social Science Information Gateway http://sosig.esrc.bris.ac.uk/ • ROADS Project Software/Documentation Server - http://www.roads.lut.ac.uk/ • CHIP-Pilot (Clearing House for Internet Projects) - http://www.terena.nl/chip/ • IMesh - International Collaboration on Internet Subject Gateways - http://www.desire.org/html/subjectgateways/community/imesh/ • DFN Indexing and Searching projects - http://www.dfn.de/links/suchen.html • X.500 Directory E-mail Addresses Search (AMBIX-D) - http://ambix.uni-tuebingen.de:8889 • TUSTEP Munltilingual Textdata Processing and Fuzzy Searching - http://www.uni-tuebingen.de/zdv/tustep/tdv_eng.html • IKEM Toolkit - http://bikit.rug.ac.be:80/ikem/ • DRUID Classification Tools, University of Twente - http://twentyone.tpd.tno.nl/druid/ Multilinguality in Indexing, Searching and Metadata

  11. Search Engines news • CLEVER project at IBM Almaden Research Center - http://www.almaden.ibm.com/cs/k53/clever.html • Cora Search Engine - http://www.cora.justresearch.com/about.html • Google Search Engine - http://www.google.com/why_use.html • Free AltaVista Search Intranet v2.3A Entry Level Software http://www.altavista.software.digital.com/search/intranet/free_3k/index.asp • Ultraseek Server for Linux Platformshttp://software.infoseek.com/products/ultraseek/linux/ultrareq.htm Multilinguality in Indexing, Searching and Metadata

  12. TUSTEP TUebingen System of Text Processing Programs • 1. File structure • 2. Multilingual capabilities • 3. Internal data presentation • 4. Database publishing/output data presentation • 5. CGI • 6. Sample implementation • http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit • Try entries like Smith or Meier or... • http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery Multilinguality in Indexing, Searching and Metadata

  13. TUSTEP: File structure • TUSTEP can handle basically all kinds of (explicitely or implicitely) structured text files) • Special support for XML • "Databases" (i. e. files with a repeated and regular structure) are only a special case of this. • Fuzzy search and other retrieval actions can then be used to access the data Multilinguality in Indexing, Searching and Metadata

  14. TUSTEP: Multilingual capabilities • TUSTEP supports the following scripts: • - Latin • - Cyrillic • - Greek (classical and modern) • - Hebrew (with support for Yiddish) • - Arabic • - Estrangelo • - Coptic • - Old Church Slavonic • More: • Phonetics, Egyptian hieroglyphs • allows use of combining diacritics • Experimental: Indic scripts and Armenian Multilinguality in Indexing, Searching and Metadata

  15. TUSTEP: Internal data presentation and transformation • TUSTEP uses internally a script tagging system with transliteration into ASCII which allows all data to be encoded in a human-readable and easily transmittable form • TUSTEP has a module for importing from and exporting into the UCS (UTF8 and UTF16) • Example: #r+Novij rafiqnij clovnik ykra^ins^bko%:^i movi#r- • Transformation module allows use of other tagging systems and other transliteration schemes Multilinguality in Indexing, Searching and Metadata

  16. TUSTEP: Database publishing • TUSTEP's typesetting module • offers a high-quality, fast and easy way of publishing all or part of the database in paper (or pdf) form Multilinguality in Indexing, Searching and Metadata

  17. TUSTEP: CGI • Complete control over input and output forms • Possibility to configure exactly the kind of search(es), e.g. • exact matches only • SoundEX • "intelligent" fuzzy search • "brute" fuzzy search that allows a number of different letters. Multilinguality in Indexing, Searching and Metadata

  18. Multilinguality framework of the project • Multiple language indexing • multiple language documents/indexes • Cross-language Searching • Multiple language indexes/documents • Automatic Query forwarding based on thesauri • Automatic translation • Multilingual information retrieval • Translation Request Protocol • Language and Character Encoding tagging • XML as internal presentation of data • Using XML language and charset tagging • Metadata • DC.Language definition Multilinguality in Indexing, Searching and Metadata

  19. Multilinguality in Indexing, Searching and Metadata

  20. Multilinguality in Indexing, Searching and Metadata

  21. Multilinguality in Indexing, Searching and Metadata

More Related