15th International Unicode Conference
This presentation is the property of its rightful owner.
Sponsored Links
1 / 16

Keys to Building a Multilingual Search Engine PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

15th International Unicode Conference August/September, 1999. Keys to Building a Multilingual Search Engine. Thierry Sourbier. Client-Side (browser) How to make the best use of the browsers when dealing with multiple languages. Server-Side

Download Presentation

Keys to Building a Multilingual Search Engine

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Keys to building a multilingual search engine 1364046

15th International Unicode Conference

August/September, 1999

Keys to Building a Multilingual Search Engine

Thierry Sourbier


Search engine overview

Client-Side (browser)

How to make the best use of the browsers when dealing with multiple languages

Server-Side

How to provide efficient multilingual information retrieval

Search Engine Overview

Create index

HTTP

Submit query

Process query

Display results


Overview of the server side

Index creation steps:

Normalization

gives the pages a standard format

Segmentation

breaks the pages in units that will be stored in the index

Index building

Query processing steps:

Normalization

makes sure that the query has the same format as the indexed pages

Segmentation

breaks the query in units that will be looked up in the index

Index search

Overview of the Server-side

Typically only Normalization and Segmentation are language dependent. The goal is to reduce these dependencies as much as possible.


Multilingual normalization

Multilingual Normalization

  • Normalizing the character encoding

    • One size fits all: Unicode

  • Removing the unnecessary

    • HTML tags, extra white spaces, etc.

  • Character normalization

    • Mapping together characters that have the same meaning

    • Locale dependent


Multilingual segmentation

Multilingual Segmentation

  • Linguistic features can’t be used

    • Too complex and/or costly to implement

  • Relying on N-Gram

    • N-Gram = a sequence of N contiguous characters

    • N-Gram may overlap

      • example with N=4

      • “unicode conference” => “unic”,”nico”,”icod”,”code”,”de c”,”e co”,” con”,...


N grams advantages

N-Grams Advantages

  • Advantages:

    • Simple to implement

    • Increased tolerance for typos

    • “Free” morphology

    • Language independent


N grams disadvantages

N-Grams Disadvantages

  • Disadvantages:

    • Index is bigger

    • Minimum query length is N characters

      • shorter query will yield to no results

    • May introduce “noise”

      • sometime the system may be too tolerant (e.g.: a query to “standing” may send back pages containing “understand”)

    • Not as good as linguistic based IR system.

      • no explicit word normalization possible


What value should n have

What value should N have?

  • N is language dependent

    • Typically we use a value between 1 and 6

  • High N-gram size improves quality, but reduces tolerance and increases the minimal query size

  • Some languages may require more than one N-Gram size

    • Japanese example


Client side

Client-side

  • Must be compatible with most browsers

    • We restrict ourselves to HTML

    • We use the “standard” encodings for each language for our pages:

      • many people still use browsers that are not Unicode friendly

      • this makes content editing easier


Using a form

Using a FORM

  • The parameters of the query are passed via the URL to a CGI script

    • e.g: http://www.my_site.com/my_script?query=%22San+Jose%22

  • What is the charset of the data sent back from the client?


  • Url encoding issues

    URL Encoding Issues

    • Different browsers have different behaviors

      • Example: a Japanese query

      • Could be submitted to the server as:

        • ...search.pl?Query=%93%FA%96%7B%8C%EA

      • Or by another browser as:

        • ...search.pl?Query=%26%2326085%3B%26%2326412%3B%26%2335486%3B


    Form and cgi

    FORM and CGI

    • The server tells the client which encoding to use at the HTTP level

      • <HTML>

      • <HEAD>

      • <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=...">

      • </HEAD>

      • ….

      • </HTML>


    Form and cgi1

    FORM and CGI

    • The client returns the information to the script using the Private FORM/CGI Protocol

      • A “hidden” form field adds a parameter to the query which identifies the locale

        • <form>

        • ...

        • <input type=hidden name=Locale value=ja>

        • ...

        • </form>


    Displaying the results

    Displaying the Results

    • Simple if only one code set per page is required

    • For multilingual content:

      • use UTF-8

      • use multiples frames

    • Unexpected browser behavior


    Conclusion

    Conclusion

    • Solutions exist to provide a robust multilingual search engine

    • Code set issues on the client side can be a limitation but it will soon disappear as more and more people will be using UTF-8 friendly browsers


    Keys to building a multilingual search engine 1364046

    Q&A

    Thierry Sourbier

    Software Developer

    [email protected]


  • Login