destination japan internationalization of the lycos search engine
Download
Skip this Video
Download Presentation
Destination Japan: Internationalization of the Lycos Search Engine

Loading in 2 Seconds...

play fullscreen
1 / 35

Destination Japan: Internationalization of the Lycos Search Engine - PowerPoint PPT Presentation


  • 312 Views
  • Uploaded on

Destination Japan: Internationalization of the Lycos Search Engine. Presented by: Jeff Vander Clute of Lycos, Inc. & Tina Lieu of Basis Technology Corp. Lycos is. A new generation Web company - 4 top 20 Web properties in Network - Lycos, Tripod, Angelfire, HotBot A “hub”

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Destination Japan: Internationalization of the Lycos Search Engine' - paul2


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
destination japan internationalization of the lycos search engine

Destination Japan:Internationalization of the Lycos Search Engine

Presented by:

Jeff Vander Clute of Lycos, Inc. &

Tina Lieu of Basis Technology Corp.

lycos is
Lycos is...

A new generation Web company

- 4 top 20 Web properties in Network

- Lycos, Tripod, Angelfire, HotBot

A “hub”

Search Engine & Navigation

- Patented search & directory technology

Community & Communication

E-commerce, Content Aggregation, Etc.

slide3
The Search Technology

Created by CMU professor (Fuzzy Mauldin) & students in 1994/95.

1. “Intelligent” spidering methods (now patented), but not internationalized. Spiders crawl the web retrieving documents for indexing.

2. Back-end database of webpages, or catalog, plus relevancy algorithms for ordering search results.

first stop europe
First Stop: Europe
  • Lycos search technology initially for ASCII only. In-house work to make data paths 8-bit clean, to accommodate European languages.
  • Otherwise relatively straightforward. Components such as ad servers, Web servers, etc., require little if any changes.
  • Euro service came online in May 1997.
what s unicode where s japan
What’s Unicode? Where’s Japan?
  • The more interesting problem.
  • Business reasons to introduce Japanese search.
  • But not a lot of international(ization) experience within Lycos at the time.
  • We needed assistance and chose Basis Technology.
goals
Goals
  • Quickdeployment of Japanese search
    • 1995 to 1997, Japanese Internet more than doubling each year
    • Marketing need to launch in Japan ASAP
  • Economical and efficient solution
    • Produce reusable internationalized code
    • Poise Lycos for even quicker deployment into other languages
    • Get "more bang for the buck"
two main functions of a search engine
Two Main Functions of aSearch Engine
  • Building a catalogCompiling an indexed catalog of webpages from the Internet
  • Performing a queryDelivering a list of webpages matching certain keywords and parameters input by the user
japanese issues for catalog
Japanese Issues for Catalog
  • Double-Byte: Japanese characters are double-byte.
  • Multiple encodings: Japanese webpages use 3 encodings: Shift-JIS, EUC-JP, and ISO-2022-JP.
  • Options: Multiple vs. Single Catalog
    • Three catalogs: one in Shift-JIS, one in EUC-JP, one in ISO-2022-JP (an awkward and complicated solution to implement)OR
    • One catalog: all catalog data either in one Japanese encoding or in Unicode
single catalog options
Single Catalog Options
  • A) Convert all data to one Japanese encoding
    • ISO-2022-JP, Shift-JIS, or EUC-JP
  • B) Convert all data to Unicode:
  • The quick andeconomicalchoice, Unicode is . . .
    • A superset of all scripts and character set encodings used on the Web, therefore reusable for other languages
    • More easily implemented into existing code originally written for processing single-byte ASCII
the unicode plan
The Unicode Plan
  • Use Unicode in catalog & internal processing
    • Because all electronic text on the Web maps cleanly into Unicode
  • Required elements:
    • Character encoding conversions Unicode webpage encodings (webpage encodings: Shift-JIS, ISO-2022-JP, and EUC-JP)
    • Encoding auto-detection
    • Japanese word breaking
encoding conversion
Encoding Conversion
  • Purpose: Convert data between encodings used on the Web and Unicode (which is still not used universally on the Web)
  • From 寿司 in Shift-JIS you want 寿司 in Unicode
  • Functionality provided by Basis Technology's Rosette embedded in Lycos code as source
    • Rosette is a cross-platform C++ library for Unicode; http://www.basistech.com/products/
    • Complete set of mapping tables between Unicode and major legacy encodings
    • Conversions performed quickly and economically with minimal impact on performance
why encoding auto detection
Why Encoding Auto-Detection?
  • In order to convert text to another encoding, you have to know where you’re starting from. Or you could get . . .
  • Ex. Text in EUC-JP when viewed as other encodings.

EUC-JP: 寿司 コンピュータ 花見

Shift-JIS: シハ ・ウ・ヤ・蝪シ・ソ イヨクォ

ASCII:

encoding auto detection
Encoding Auto-Detection
  • Purpose: to correctly identify encoding of webpage or query in order to convert properly from one encoding to another.
  • Functionality provided by Basis Technology's Rosette
    • Auto-detection on Japanese text in Shift-JIS, EUC-JP, or ISO-2022-JP encodings
    • Enhanced tiebreaker functionality to auto-detect very short strings (queries)
japanese word breaking
Japanese Word Breaking
  • Purpose: To return indexable units (words) for creating an index, or for breaking the query into words to look up in the index.
  • Problem: Japanese words are not delimited by spaces
  • Solution: Basis Technology's Japanese Morphological Analyzer (http://www.basistech.com/products/)
    • Dictionary-based Japanese word breaking
    • Elimination of stop words (ex. “a”,”the”, etc.)
    • Looks for longest word match
selecting unicode representation 1 ucs2 characteristics
Selecting Unicode Representation (1) UCS2 characteristics
  • Depending on the task, either the UCS2 or UTF8 representation of Unicode was used in different parts of the Lycos search
  • Characteristics of UCS2
    • Each coded character element is fixed width, 16 bits
    • Data paths must all accommodate 16 bits
    • Text in UCS2 is easy to manipulate and analyze (from a programming viewpoint)
selecting unicode representation 2 utf8 characteristics
Selecting Unicode Representation (2) UTF8 characteristics
  • Characteristics of UTF8
    • Each coded character is composed of one to six octets (one octet = 8 bits)
    • Data paths need only be "8-bit clean"
    • None of the octets in a multi-byte character are null (i.e., has the value of zero)
    • Text in UTF8 is difficult to manipulate or analyze.
  • "8-bit clean" = computer code which treats all 8 bits of a byte as significant. True of any computer code that processes European languages properly, but not necessarily true of code that processes only ASCII which only uses 7 bits per character.
ucs2 utf8 ascii etc 16 bit ucs2 can t fit 8 bit clean data pipe
UCS2, UTF8, ASCII, etc.16-bit UCS2 can’t fit :(8-bit clean data pipe

As UTF8 As UCS2 ASCII (7 bits)

Latin character

(8-bits)

(w/diacritical)

Japanese character

(double-byte)

(in Shift-JIS, EUC-JP etc.)

unicode in the lycos system
Unicode in the Lycos System
  • UCS2: Japanese Morphological Analyzer from Basis Technology
    • Using UCS2 is the quick and economical way to process huge volumes of Japanese text.
  • UTF8: Lycos Catalog
    • Economy of disk space: ASCII is smaller in UTF8On the Web: ASCII 79%, double-byte Asian less than 5%, European encodings and others 16%
    • Ease of integration with existing code(a.k.a. transmissibility)
  • Based on the number of Web hosts on the Internet by country (total number of hosts for English-speaking domains as a percentage of the total number of hosts worldwide). Source: Survey by Network Wizards, http://www.nw.com
project complete lycos japan 1
Project Complete: Lycos Japan (1)
  • Quick:Prototype of Japanese search is produced in two months.Lycos Japan: http://www.lycos.co.jp
    • Beta version of Japanese search debuts July 1998; enters competitive Japanese search engine race in 4th place*
    • Upon formal launch grabs 2nd place in October 1998**According to Search Desk, http://www.searchdesk.com
project complete lycos japan 2
Project Complete: Lycos Japan (2)
  • E-conomical:Today, Lycos has spider, catalog and query software, which may easily be set to make catalogs in different languages by swapping in and out localized pieces:
    • Settings for target domains
    • Encoding detection and conversion calls
    • Language-specific word breaker (if needed)
slide34
Q&A

Questions?

[email protected]

www.basistech.com

[email protected]

www.lycos.com

slide35
Q&A

Questions?

[email protected]

www.basistech.com

[email protected]

www.lycos.com

Thank you!

ad