slide1 n.
Skip this Video
Download Presentation
Keys to Building a Multilingual Search Engine

Loading in 2 Seconds...

play fullscreen
1 / 16

Keys to Building a Multilingual Search Engine - PowerPoint PPT Presentation

  • Uploaded on

15th International Unicode Conference August/September, 1999. Keys to Building a Multilingual Search Engine. Thierry Sourbier. Client-Side (browser) How to make the best use of the browsers when dealing with multiple languages. Server-Side

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Keys to Building a Multilingual Search Engine' - winola

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

15th International Unicode Conference

August/September, 1999

Keys to Building a Multilingual Search Engine

Thierry Sourbier

search engine overview
Client-Side (browser)

How to make the best use of the browsers when dealing with multiple languages


How to provide efficient multilingual information retrieval

Search Engine Overview

Create index


Submit query

Process query

Display results

overview of the server side
Index creation steps:


gives the pages a standard format


breaks the pages in units that will be stored in the index

Index building

Query processing steps:


makes sure that the query has the same format as the indexed pages


breaks the query in units that will be looked up in the index

Index search

Overview of the Server-side

Typically only Normalization and Segmentation are language dependent. The goal is to reduce these dependencies as much as possible.

multilingual normalization
Multilingual Normalization
  • Normalizing the character encoding
    • One size fits all: Unicode
  • Removing the unnecessary
    • HTML tags, extra white spaces, etc.
  • Character normalization
    • Mapping together characters that have the same meaning
    • Locale dependent
multilingual segmentation
Multilingual Segmentation
  • Linguistic features can’t be used
    • Too complex and/or costly to implement
  • Relying on N-Gram
    • N-Gram = a sequence of N contiguous characters
    • N-Gram may overlap
      • example with N=4
      • “unicode conference” => “unic”,”nico”,”icod”,”code”,”de c”,”e co”,” con”,...
n grams advantages
N-Grams Advantages
  • Advantages:
    • Simple to implement
    • Increased tolerance for typos
    • “Free” morphology
    • Language independent
n grams disadvantages
N-Grams Disadvantages
  • Disadvantages:
    • Index is bigger
    • Minimum query length is N characters
      • shorter query will yield to no results
    • May introduce “noise”
      • sometime the system may be too tolerant (e.g.: a query to “standing” may send back pages containing “understand”)
    • Not as good as linguistic based IR system.
      • no explicit word normalization possible
what value should n have
What value should N have?
  • N is language dependent
    • Typically we use a value between 1 and 6
  • High N-gram size improves quality, but reduces tolerance and increases the minimal query size
  • Some languages may require more than one N-Gram size
    • Japanese example
client side
  • Must be compatible with most browsers
    • We restrict ourselves to HTML
    • We use the “standard” encodings for each language for our pages:
      • many people still use browsers that are not Unicode friendly
      • this makes content editing easier
using a form
Using a FORM
  • The parameters of the query are passed via the URL to a CGI script
      • e.g:
  • What is the charset of the data sent back from the client?
url encoding issues
URL Encoding Issues
  • Different browsers have different behaviors
    • Example: a Japanese query
    • Could be submitted to the server as:
    • Or by another browser as:
form and cgi
  • The server tells the client which encoding to use at the HTTP level
      • <HTML>
      • <HEAD>
      • <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=...">
      • </HEAD>
      • ….
      • </HTML>
form and cgi1
  • The client returns the information to the script using the Private FORM/CGI Protocol
    • A “hidden” form field adds a parameter to the query which identifies the locale
      • <form>
      • ...
      • <input type=hidden name=Locale value=ja>
      • ...
      • </form>
displaying the results
Displaying the Results
  • Simple if only one code set per page is required
  • For multilingual content:
    • use UTF-8
    • use multiples frames
  • Unexpected browser behavior
  • Solutions exist to provide a robust multilingual search engine
  • Code set issues on the client side can be a limitation but it will soon disappear as more and more people will be using UTF-8 friendly browsers

Thierry Sourbier

Software Developer