Destination japan internationalization of the lycos search engine
This presentation is the property of its rightful owner.
Sponsored Links
1 / 35

Destination Japan: Internationalization of the Lycos Search Engine PowerPoint PPT Presentation

Destination Japan: Internationalization of the Lycos Search Engine. Presented by: Jeff Vander Clute of Lycos, Inc. & Tina Lieu of Basis Technology Corp. Lycos is. A new generation Web company - 4 top 20 Web properties in Network - Lycos, Tripod, Angelfire, HotBot A “hub”

Related searches for Destination Japan: Internationalization of the Lycos Search Engine

Download Presentation

Destination Japan: Internationalization of the Lycos Search Engine

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Destination japan internationalization of the lycos search engine

Destination Japan:Internationalization of the Lycos Search Engine

Presented by:

Jeff Vander Clute of Lycos, Inc. &

Tina Lieu of Basis Technology Corp.


Lycos is

Lycos is...

A new generation Web company

- 4 top 20 Web properties in Network

- Lycos, Tripod, Angelfire, HotBot

A “hub”

Search Engine & Navigation

- Patented search & directory technology

Community & Communication

E-commerce, Content Aggregation, Etc.


Destination japan

The Search Technology

Created by CMU professor (Fuzzy Mauldin) & students in 1994/95.

1. “Intelligent” spidering methods (now patented), but not internationalized. Spiders crawl the web retrieving documents for indexing.

2. Back-end database of webpages, or catalog, plus relevancy algorithms for ordering search results.


First stop europe

First Stop: Europe

  • Lycos search technology initially for ASCII only. In-house work to make data paths 8-bit clean, to accommodate European languages.

  • Otherwise relatively straightforward. Components such as ad servers, Web servers, etc., require little if any changes.

  • Euro service came online in May 1997.


What s unicode where s japan

What’s Unicode? Where’s Japan?

  • The more interesting problem.

  • Business reasons to introduce Japanese search.

  • But not a lot of international(ization) experience within Lycos at the time.

  • We needed assistance and chose Basis Technology.


Goals

Goals

  • Quickdeployment of Japanese search

    • 1995 to 1997, Japanese Internet more than doubling each year

    • Marketing need to launch in Japan ASAP

  • Economical and efficient solution

    • Produce reusable internationalized code

    • Poise Lycos for even quicker deployment into other languages

    • Get "more bang for the buck"


Two main functions of a search engine

Two Main Functions of aSearch Engine

  • Building a catalogCompiling an indexed catalog of webpages from the Internet

  • Performing a queryDelivering a list of webpages matching certain keywords and parameters input by the user


Japanese issues for catalog

Japanese Issues for Catalog

  • Double-Byte: Japanese characters are double-byte.

  • Multiple encodings: Japanese webpages use 3 encodings: Shift-JIS, EUC-JP, and ISO-2022-JP.

  • Options: Multiple vs. Single Catalog

    • Three catalogs: one in Shift-JIS, one in EUC-JP, one in ISO-2022-JP (an awkward and complicated solution to implement)OR

    • One catalog: all catalog data either in one Japanese encoding or in Unicode


Single catalog options

Single Catalog Options

  • A) Convert all data to one Japanese encoding

    • ISO-2022-JP, Shift-JIS, or EUC-JP

  • B) Convert all data to Unicode:

  • The quick andeconomicalchoice, Unicode is . . .

    • A superset of all scripts and character set encodings used on the Web, therefore reusable for other languages

    • More easily implemented into existing code originally written for processing single-byte ASCII


The unicode plan

The Unicode Plan

  • Use Unicode in catalog & internal processing

    • Because all electronic text on the Web maps cleanly into Unicode

  • Required elements:

    • Character encoding conversions Unicode webpage encodings (webpage encodings: Shift-JIS, ISO-2022-JP, and EUC-JP)

    • Encoding auto-detection

    • Japanese word breaking


Encoding conversion

Encoding Conversion

  • Purpose: Convert data between encodings used on the Web and Unicode (which is still not used universally on the Web)

  • From 寿司 in Shift-JIS you want 寿司 in Unicode

  • Functionality provided by Basis Technology's Rosette embedded in Lycos code as source

    • Rosette is a cross-platform C++ library for Unicode; http://www.basistech.com/products/

    • Complete set of mapping tables between Unicode and major legacy encodings

    • Conversions performed quickly and economically with minimal impact on performance


Why encoding auto detection

Why Encoding Auto-Detection?

  • In order to convert text to another encoding, you have to know where you’re starting from. Or you could get . . .

  • Ex. Text in EUC-JP when viewed as other encodings.

    EUC-JP: 寿司コンピュータ花見

    Shift-JIS: シハ・ウ・ヤ・蝪シ・ソイヨクォ

    ASCII:


Encoding auto detection

Encoding Auto-Detection

  • Purpose: to correctly identify encoding of webpage or query in order to convert properly from one encoding to another.

  • Functionality provided by Basis Technology's Rosette

    • Auto-detection on Japanese text in Shift-JIS, EUC-JP, or ISO-2022-JP encodings

    • Enhanced tiebreaker functionality to auto-detect very short strings (queries)


Japanese word breaking

Japanese Word Breaking

  • Purpose: To return indexable units (words) for creating an index, or for breaking the query into words to look up in the index.

  • Problem: Japanese words are not delimited by spaces

  • Solution: Basis Technology's Japanese Morphological Analyzer (http://www.basistech.com/products/)

    • Dictionary-based Japanese word breaking

    • Elimination of stop words (ex. “a”,”the”, etc.)

    • Looks for longest word match


Selecting unicode representation 1 ucs2 characteristics

Selecting Unicode Representation (1) UCS2 characteristics

  • Depending on the task, either the UCS2 or UTF8 representation of Unicode was used in different parts of the Lycos search

  • Characteristics of UCS2

    • Each coded character element is fixed width, 16 bits

    • Data paths must all accommodate 16 bits

    • Text in UCS2 is easy to manipulate and analyze (from a programming viewpoint)


Selecting unicode representation 2 utf8 characteristics

Selecting Unicode Representation (2) UTF8 characteristics

  • Characteristics of UTF8

    • Each coded character is composed of one to six octets (one octet = 8 bits)

    • Data paths need only be "8-bit clean"

    • None of the octets in a multi-byte character are null (i.e., has the value of zero)

    • Text in UTF8 is difficult to manipulate or analyze.

  • "8-bit clean" = computer code which treats all 8 bits of a byte as significant. True of any computer code that processes European languages properly, but not necessarily true of code that processes only ASCII which only uses 7 bits per character.


Ucs2 utf8 ascii etc 16 bit ucs2 can t fit 8 bit clean data pipe

UCS2, UTF8, ASCII, etc.16-bit UCS2 can’t fit :(8-bit clean data pipe

As UTF8As UCS2 ASCII (7 bits)

Latin character

(8-bits)

(w/diacritical)

Japanese character

(double-byte)

(in Shift-JIS, EUC-JP etc.)


Unicode in the lycos system

Unicode in the Lycos System

  • UCS2: Japanese Morphological Analyzer from Basis Technology

    • Using UCS2 is the quick and economical way to process huge volumes of Japanese text.

  • UTF8: Lycos Catalog

    • Economy of disk space: ASCII is smaller in UTF8On the Web: ASCII 79%, double-byte Asian less than 5%, European encodings and others 16%

    • Ease of integration with existing code(a.k.a. transmissibility)

  • Based on the number of Web hosts on the Internet by country (total number of hosts for English-speaking domains as a percentage of the total number of hosts worldwide). Source: Survey by Network Wizards, http://www.nw.com


Project complete lycos japan 1

Project Complete: Lycos Japan (1)

  • Quick:Prototype of Japanese search is produced in two months.Lycos Japan: http://www.lycos.co.jp

    • Beta version of Japanese search debuts July 1998; enters competitive Japanese search engine race in 4th place*

    • Upon formal launch grabs 2nd place in October 1998**According to Search Desk, http://www.searchdesk.com


Project complete lycos japan 2

Project Complete: Lycos Japan (2)

  • E-conomical:Today, Lycos has spider, catalog and query software, which may easily be set to make catalogs in different languages by swapping in and out localized pieces:

    • Settings for target domains

    • Encoding detection and conversion calls

    • Language-specific word breaker (if needed)


Destination japan

Q&A


Destination japan

Q&A

Questions?

[email protected]

www.basistech.com

[email protected]

www.lycos.com


Destination japan

Q&A

Questions?

[email protected]

www.basistech.com

[email protected]

www.lycos.com

Thank you!


  • Login