1 / 16

Keys to Building a Multilingual Search Engine

15th International Unicode Conference August/September, 1999. Keys to Building a Multilingual Search Engine. Thierry Sourbier. Client-Side (browser) How to make the best use of the browsers when dealing with multiple languages. Server-Side

winola
Download Presentation

Keys to Building a Multilingual Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 15th International Unicode Conference August/September, 1999 Keys to Building a Multilingual Search Engine Thierry Sourbier

  2. Client-Side (browser) How to make the best use of the browsers when dealing with multiple languages Server-Side How to provide efficient multilingual information retrieval Search Engine Overview Create index HTTP Submit query Process query Display results

  3. Index creation steps: Normalization gives the pages a standard format Segmentation breaks the pages in units that will be stored in the index Index building Query processing steps: Normalization makes sure that the query has the same format as the indexed pages Segmentation breaks the query in units that will be looked up in the index Index search Overview of the Server-side Typically only Normalization and Segmentation are language dependent. The goal is to reduce these dependencies as much as possible.

  4. Multilingual Normalization • Normalizing the character encoding • One size fits all: Unicode • Removing the unnecessary • HTML tags, extra white spaces, etc. • Character normalization • Mapping together characters that have the same meaning • Locale dependent

  5. Multilingual Segmentation • Linguistic features can’t be used • Too complex and/or costly to implement • Relying on N-Gram • N-Gram = a sequence of N contiguous characters • N-Gram may overlap • example with N=4 • “unicode conference” => “unic”,”nico”,”icod”,”code”,”de c”,”e co”,” con”,...

  6. N-Grams Advantages • Advantages: • Simple to implement • Increased tolerance for typos • “Free” morphology • Language independent

  7. N-Grams Disadvantages • Disadvantages: • Index is bigger • Minimum query length is N characters • shorter query will yield to no results • May introduce “noise” • sometime the system may be too tolerant (e.g.: a query to “standing” may send back pages containing “understand”) • Not as good as linguistic based IR system. • no explicit word normalization possible

  8. What value should N have? • N is language dependent • Typically we use a value between 1 and 6 • High N-gram size improves quality, but reduces tolerance and increases the minimal query size • Some languages may require more than one N-Gram size • Japanese example

  9. Client-side • Must be compatible with most browsers • We restrict ourselves to HTML • We use the “standard” encodings for each language for our pages: • many people still use browsers that are not Unicode friendly • this makes content editing easier

  10. Using a FORM • The parameters of the query are passed via the URL to a CGI script • e.g: http://www.my_site.com/my_script?query=%22San+Jose%22 • What is the charset of the data sent back from the client?

  11. URL Encoding Issues • Different browsers have different behaviors • Example: a Japanese query • Could be submitted to the server as: • ...search.pl?Query=%93%FA%96%7B%8C%EA • Or by another browser as: • ...search.pl?Query=%26%2326085%3B%26%2326412%3B%26%2335486%3B

  12. FORM and CGI • The server tells the client which encoding to use at the HTTP level • <HTML> • <HEAD> • <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=..."> • … • </HEAD> • …. • </HTML>

  13. FORM and CGI • The client returns the information to the script using the Private FORM/CGI Protocol • A “hidden” form field adds a parameter to the query which identifies the locale • <form> • ... • <input type=hidden name=Locale value=ja> • ... • </form>

  14. Displaying the Results • Simple if only one code set per page is required • For multilingual content: • use UTF-8 • use multiples frames • Unexpected browser behavior

  15. Conclusion • Solutions exist to provide a robust multilingual search engine • Code set issues on the client side can be a limitation but it will soon disappear as more and more people will be using UTF-8 friendly browsers

  16. Q&A Thierry Sourbier Software Developer tsourbier@research.intl.com

More Related