Keys to Building a Multilingual Search Engine

15th International Unicode Conference August/September, 1999 Keys to Building a Multilingual Search Engine Thierry Sourbier

Client-Side (browser) How to make the best use of the browsers when dealing with multiple languages Server-Side How to provide efficient multilingual information retrieval Search Engine Overview Create index HTTP Submit query Process query Display results

Index creation steps: Normalization gives the pages a standard format Segmentation breaks the pages in units that will be stored in the index Index building Query processing steps: Normalization makes sure that the query has the same format as the indexed pages Segmentation breaks the query in units that will be looked up in the index Index search Overview of the Server-side Typically only Normalization and Segmentation are language dependent. The goal is to reduce these dependencies as much as possible.

Multilingual Normalization • Normalizing the character encoding • One size fits all: Unicode • Removing the unnecessary • HTML tags, extra white spaces, etc. • Character normalization • Mapping together characters that have the same meaning • Locale dependent

Multilingual Segmentation • Linguistic features can’t be used • Too complex and/or costly to implement • Relying on N-Gram • N-Gram = a sequence of N contiguous characters • N-Gram may overlap • example with N=4 • “unicode conference” => “unic”,”nico”,”icod”,”code”,”de c”,”e co”,” con”,...

N-Grams Advantages • Advantages: • Simple to implement • Increased tolerance for typos • “Free” morphology • Language independent

N-Grams Disadvantages • Disadvantages: • Index is bigger • Minimum query length is N characters • shorter query will yield to no results • May introduce “noise” • sometime the system may be too tolerant (e.g.: a query to “standing” may send back pages containing “understand”) • Not as good as linguistic based IR system. • no explicit word normalization possible

What value should N have? • N is language dependent • Typically we use a value between 1 and 6 • High N-gram size improves quality, but reduces tolerance and increases the minimal query size • Some languages may require more than one N-Gram size • Japanese example

Client-side • Must be compatible with most browsers • We restrict ourselves to HTML • We use the “standard” encodings for each language for our pages: • many people still use browsers that are not Unicode friendly • this makes content editing easier

Using a FORM • The parameters of the query are passed via the URL to a CGI script • e.g: http://www.my_site.com/my_script?query=%22San+Jose%22 • What is the charset of the data sent back from the client?

URL Encoding Issues • Different browsers have different behaviors • Example: a Japanese query • Could be submitted to the server as: • ...search.pl?Query=%93%FA%96%7B%8C%EA • Or by another browser as: • ...search.pl?Query=%26%2326085%3B%26%2326412%3B%26%2335486%3B

FORM and CGI • The server tells the client which encoding to use at the HTTP level • <HTML> • <HEAD> • <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=..."> • … • </HEAD> • …. • </HTML>

FORM and CGI • The client returns the information to the script using the Private FORM/CGI Protocol • A “hidden” form field adds a parameter to the query which identifies the locale • <form> • ... • <input type=hidden name=Locale value=ja> • ... • </form>

Displaying the Results • Simple if only one code set per page is required • For multilingual content: • use UTF-8 • use multiples frames • Unexpected browser behavior

Conclusion • Solutions exist to provide a robust multilingual search engine • Code set issues on the client side can be a limitation but it will soon disappear as more and more people will be using UTF-8 friendly browsers

Q&A Thierry Sourbier Software Developer tsourbier@research.intl.com

Keys to Building a Multilingual Search Engine

Keys to Building a Multilingual Search Engine

Presentation Transcript

Choosing a Search Engine

Choosing a Search Engine

Search Engine Optimization 101 Building a Search Engine Friendly Web Site.

Search Engine Optimization 101 Building a Search Engine Friendly Web Site.

Keys to a Successful Job Search

Frompo a Search Engine

Search Engine

Search Engine

How to Build a Search Engine

Keys to PFS Building

Keys to a Successful Job Search

Keys to a Successful Job Search

SEARCH ENGINE

Search Engine

Search Engine

Search engine

Search Engine

search engine

SEARCH ENGINE

Keys to a Successful Job Search