1 / 29

Multilingual, Multi-script Catalog Requirements (An Arcadia Project) ________________________

Multilingual, Multi-script Catalog Requirements (An Arcadia Project) ________________________. January 29, 2010. Outline _____________________________________________________. Background about the Arcadia non-Roman script project Introductions Orbis vs. YUFind and systems like YUFind

juan
Download Presentation

Multilingual, Multi-script Catalog Requirements (An Arcadia Project) ________________________

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multilingual, Multi-script Catalog Requirements(An Arcadia Project)________________________ January 29, 2010

  2. Outline_____________________________________________________ • Background about the Arcadia non-Roman script project • Introductions • Orbis vs. YUFind and systems like YUFind • Requirements discussion • Wrap-up Jan 2010

  3. Project Goals _____________________________________________________ • Gap analysis of multilingual, multi-script functionality in Lucene-Solr-Solrmarc discovery applications (e.g., YUFind) • Identification of desirable functionality • Collaboration opportunities, community interest • Recommendations with level-of-effort analysis Jan 2010

  4. Orbis vs. Yufind_____________________________________________________ Jan 2010

  5. vs Chinese example: “中日韩经济合作的新起点” N-gram tokens, where N=2: <中日><日韩><韩经><经济><济合><合作><作的> <的新> <新起> <起点>

  6. Background: NR Scripts in Catalog Records_____________________________________________________ Jan 2010

  7. JACKPHY_____________________________________________________ • Japanese • Arabic • Chinese • Korean • Persian • Hebrew • Yiddish Jan 2010

  8. One-to-Many (CJK)_____________________________________________________ Example: “Mao Zedong” 毛泽东Simplified 毛澤東 Traditional 毛沢東Kanji (Modern) Jan 2010

  9. One-to-Many (CJK) _____________________________________________________ “Mao Zedong” in simplified Chinese characters retrieves 527 results Jan 2010

  10. One-to-Many (CJK) _____________________________________________________ The same search in traditional Chinese characters yields154 hits. Also Note paired fields Jan 2010

  11. One-to-Many (Digraphs)_____________________________________________________ The Yiddish word “Virtshaft” is entered here with two separate vavs (i.e., key stroke ‘u’ in Microsoft’s Hebrew IME): U05D5 + U05D5 ווירטשאפט Jan 2010

  12. One-to-Many (Digraphs) _____________________________________________________ N = 49 results Jan 2010

  13. One-to-Many (Digraphs)_____________________________________________________ The same word is this time entered as a double-vav digraph = U05F0 (via MS Hebrew IME key combo right-alt+u) װירטשאפט Jan 2010

  14. One-to-Many (Digraphs)_____________________________________________________ N = 11 results Jan 2010

  15. NR Spelling Suggestions_____________________________________________________ Unhelpful suggestion? Jan 2010

  16. Labels and Facets_____________________________________________________ Should script/language of query determine script/language of facets? Jan 2010

  17. Labels and Facets_____________________________________________________ OR: Sugimoto, Tsutomu, 1927- (11) Takahashi, Mikio, 1935- (11) Noguchi, Takehiko. (8) Watanabe, Shin’ichirō, 1934- (7) Better would be: 杉本つとむ, 1927- (11) 高橋幹夫, 1935- (11) 野口武彦. (8) 渡辺信一郎, 1934- (7) But not both mixed together. Let end user decide? Jan 2010

  18. Labels and Facets_____________________________________________________ • We would like to ask library users the best option for displaying parallel field data: • <Original scripts> • 江戶 / 田中優子編. • Contributors:田中優子, 1952- • Format: Book • Language: Japanese • Published: 東京 : 作品社, 1998. • Series: 日本の名随筆. 03 别卷 ; 94 • <Paired w/OS first> • 江戶 / 田中優子編. • Edo / Tanaka Yūko hen. • Contributors: 田中優子, 1952- • Tanaka, Yūko, 1952- • Format: Book • Language: Japanese • Published: 東京 : 作品社, 1998. • Tōkyō : Sakuhinsha, 1998. • Series: 日本の名随筆. 03 别卷 ; 94 • Nihon no meizuihitsu. 03 Bekkan ; 94 We would like to choose our preference of display script here. For example, <Original scripts> 江戸 By: 野村兼太郎, 1896-1960. Published: 1942 Format: Book, Electronic Resource 江戶 の 翻訳家たち By: 杉本 つとむ, 1927- Published: 1995 Format: Book, Electronic Resource Jan 2010

  19. Language/Script of Interface _____________________________________________________ OCLC’s brief record display Interface easily flipped to one of several languages Jan 2010

  20. Language/Script of Interface_____________________________________________________ OCLC’s detailed record display with Japanese language interface Jan 2010

  21. Language/Script of Interface OCLC WorldCat.org does localization of labels and instructions as well as localization of mapped facet values. Examples here in Chinese.

  22. Language/Script of Interface_____________________________________________________ Jan 2010

  23. Language/Script of Interface & Text Directionality_____________________________________________________ Jan 2010

  24. Sorting of Results_____________________________________________________ 江戸文学俗信辞典 Edo bungaku zokushin jiten 江戸文学地名辞典 Edo bungaku chimei jiten 江戸文学辞典 Edo bungaku jiten 江戸文様辞典 Edo mon’yo jiten Jan 2010

  25. Sorting of Results_____________________________________________________ Also note bi-directional text Jan 2010

  26. Sorting within result sets: Options to Consider_____________________________________________________ For multiple languages sharing a script, e.g. Chinese ideographs, Arabic, Hebrew, or Latin, how would the users prefer to see the result sets sorted? We consider here the Chinese & Arabic cases… Jan 2010

  27. Sorting within Result Sets: Options to Consider_____________________________________________________ Sorting of results returned in Chinese script— Three sort strategies: (a) sort by Romanized equivalents; (b) sort by pronunciation; or (c) sort by radical-stroke? Jan 2010

  28. Sorting within Results Sets:Arabic script_____________________________________________________ How to handle additional Arabic-script characters in use for languages such as Persian, Kurdish, and/or Urdu? ڤ (vah, derived from ﻑ, fah) پ‎(pah) ﭺ (chah, derived from ج , ǧim) گ (gaf) ژ (zāī, derived from ز, zayin) Jan 2010

  29. Discussion User Needs and Expectations Jan 2010

More Related