1 / 80

A search engine for a mixture of European languages

A search engine for a mixture of European languages. Overview. Introduction Design Data Features Document alignment Language detection and query translation Link analysis Evaluation Future work API - Interface Demo. Introduction. Goals Approach Project name. Goals.

sheryl
Download Presentation

A search engine for a mixture of European languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A search engine for a mixture of European languages

  2. Overview • Introduction • Design • Data • Features • Document alignment • Language detection and query translation • Link analysis • Evaluation • Future work • API - Interface Demo

  3. Introduction • Goals • Approach • Project name

  4. Goals • Build Cross Language search engine for a large collection of European web documents • Participate in WebCLEF • Create topics • Submit runs • Do something extra

  5. Main Challenge • Deal with multiple languages • Search engine will have to • Accept queries in multiple languages • Return results in multiple languages

  6. Approach • Feature based • Take existing retrieval engine • Extend it with several features • Document alignment (dictionaries) • Language detection and query translation • Link analysis • User interface + project website

  7. Project Name • MELANGE • Multi European LANGuage Engine

  8. Overview • Introduction • Design • Data • Features • Document alignment • Language detection and query translation • Link analysis • Evaluation • Future work • API - Interface Demo

  9. Approach • Focus on cross-lingual features • Not: to build a search engine from scratch • We used an existing search engine to jump-start our development

  10. Terrier • Open-source Java search engine • Modular design • Clean and clear interfaces • Easy to understand • Intended to be extended • Supports new modules to be plugged in • Several ranking models built-in

  11. Apache Cocoon • Online document publishing framework • Open-source Java project • XML-based • Highly flexible ‘pipeline’ system

  12. Features • Query language classification • Query translation using dictionaries • Use PageRank to improve scores • Feature-rich user interface 

  13. Interface Use Cases The user can: • Submit a query • Specify native language, then submit Optionally: • Specify filter for preferred languages in results

  14. Use Case Diagram

  15. Melange Architecture • One ‘search’ pipeline in Cocoon • Features are integrated as ‘transformers’ on this pipeline

  16. Overview • Introduction • Design • Data • Features • Document alignment • Language detection and query translation • Link analysis • Evaluation • Future work • API - Interface Demo

  17. About the EuroGOV dataset • Pages of government websites from 27 domains:at be cy cz de dk ee es eu.int fi fr gr hu ie it lt lu lv mt nl pl pt ru se si sk uk • 86 Gigabytes of raw data • 11 Gigabytes compressed • Includes a crude language detection and list of exact duplicates by MD5 hashes

  18. EuroGOV domain files • Every domain consists of 1 to 27 compressed files with size 6-220MB:se/001.gz se/003.gz se/005.gz sk/001.gz sk/003.gz se/002.gz se/004.gz si/001.gz sk/002.gz • Those files contain multiple documents (up to 25,000) in pseudo-XML.“This might smell like XML but it will not be XML.”

  19. Bad news: pseudo-XML Nothing is escaped: • & • Content may contain nested <![CDATA[ or ]]> • Even worse: URL attribute can contain " which is not escaped! <EuroGOV:doc url="http://www.regeringen.se/" id="Ese-001-35" md5="659b462005b40f04bde5946b2beaad71" fetchDate="Wed Sep 22 10:57:39 MEST 2004" contentType="text/html"> <EuroGOV:content> <![CDATA[ ... content ...]]> </EuroGOV:content></EuroGOV:doc>

  20. URLs are very unclean • Bad URL attribute:url="http://www.micr.cz/scripts/detail.php?id=1410">"should have beenurl="http://www.micr.cz/scripts/detail.php?id=1410"andurl="http://www.bmgs.bund.de/deu/txt/service/links/="/deu/txt/index_1766.cfm""should have beenurl="http://www.bmgs.bund.de/deu/txt/index_1766.cfm"but becameurl="http://www.bmgs.bund.de/deu/txt/service/links/"

  21. Python data tools • Terrier indexing mechanism was designed to be fast • Parsers can be replaced by your own classes • Designed for sequential term-by-term processing only • Only supports parsing of a document ID and the document content, but not the URL, content-type, etc • Solution: Python EuroGOV parser module

  22. Python EuroGOV parser design Class EuroGOVProcessor with overridable methods: • processStart • domainStart: “be” • domainFileCheck: “001.gz” • documentHeaderCheck: check by URL, document ID only • documentProcess • … • domainFileCheck: “002.gz” • … • domainEnd • processEnd

  23. DocumentProcess • Inside the documentProcess event an instance of class Document is passed. • Supports extraction of: • url, id, HTTP content type header, date, md5 • content format (HTML, PDF, Word) • codepage • URL extraction • HTML -> Text • HTML tag extraction (get content of <title> tag, for example)

  24. Tools created • Extracting clean text with unambiguous language detection • Extracting link structure from the dataset • Quickly extracting a single document • Dataset language reclassification • Converting the dataset into something indexable by Terrier TREC classes • Snippet server (getting the raw document in real-time)

  25. Thanks to… • BeautifulSoup module very valuable for robust HTML parsing“You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.” • Psyco, a specializing compiler for Python (like Java JIT compiler), making unmodified programs run 2-100x faster

  26. Indexing with Terrier • Converted the EuroGOV dataset into the TREC file format (with Unicode support) • Size of raw text (annotated with language) compressed is 2.5 GB instead of 11 GB • Supposedly Terrier stands for “Terabyte retrieval engine” but it hung with OutOfMemory exception after 80% • Increasing Java VM size from 512MB to 2GB ‘fixed’ it • The same had to be done for retrieval engine…

  27. Terrier indexing: metadata • Added support for extra metadata per document: PageRank and language • Requires rewriting classes such as DocumentIndex, DocumentIndexBuilder, etc • Cannot add variable length fields • DocumentIndex contains a very nice assumption: documents must be indexed in lexicographical order of the document ID string, otherwise lookup by this string will not work

  28. Statistics • Constructing the dataset in TREC format took about a day on 7 staff machines in parallel (about 4 days on 1 machine) • True indexing with Terrier took 24 hours • Index size: 7.4 GB • Serving snippets requires uncompressed dataset (86 GB)

  29. Overview • Introduction • Design • Data • Features • Document alignment • Language detection and query translation • Link analysis • Evaluation • Future work • API - Interface Demo

  30. Overview Document Alignment • Introduction • Goal • Approach • Results

  31. Introduction • sentence level • word level • dataset

  32. Goal • simple automatic chain of building dictionaries from two texts which are known to be translations • easy to use, self-explanatory dictionaries

  33. Approach

  34. Approach • Step 1 collect data (buildDocs.java) • Step 2 prepare data for sentence alignment (plain2align.py) • Step 3 sentence alignment (align) [GaleChurch] • Step 4 rewrite output of Step 3 (align2giza.py) • Step 5 preparing output of Step 4 for word alignment (plain2snt.out) • Step 6 producing word classes (mkcls) [Och] • Step 7 word alignment (GIZA++v2) [Och]

  35. Results • http://student.science.uva.nl/~eigenman/ii/dict.php

  36. Overview • Introduction • Design • Data • Features • Document alignment • Language detection and query translation • Link analysis • Evaluation • Future work • API - Interface Demo

  37. Query Translation - Introduction • Query’s language detected => it has to be translated • to a subset of languages the user has selected (all supported EU languages by default) • Reliable offline dictionaries could not be found • Travlang’s ERGANE - vocabularies too limited • FREELANG – can be edited online by anyone, data cannot be accessed without a GUI • How about online dictionaries?

  38. Query Translation – Online translating tools • Google’s Translation Tools, BabbelFish, WorldLingo, etc. • Slower, limited language support • have to connect to the URL for every query • typically support 8-9 languages within EU • WorldLingo was the translator of the choice • offers Textbox Translator and Website Translator • available in 9 major EU languages => multilingual support of MELANGE was restricted from then on • performs some phrase-matching • uses HTML forms for input/output

  39. Query Expansion • Use local dictionaries, built by the Document Alignment team, to expand the query • contain probabilities of word matches across documents in different languages • often contain synonyms or related words • Append words with relatively high probability to the query translation • Term-based, as opposed to online translators, which are phrase-based

  40. Language Detection • Has been studied in last few years • Considered a solved problem • Techniques: • Word Lists • Common Words • N-Grams

  41. Basic Process • Standard machine learning process

  42. Reinvent the Wheel? • We decided to implement our own language detector • Reasons: • Weren’t convinced by performance of freely available tools • Learning Factor! • New Approach

  43. Our Approach • Need to be able to detect language on short queries accurately • Decided to use character n-grams • Decided to use tri-grams

  44. Common N-Gram Techniques • Extract all n-grams from text and order by frequency

  45. N-Grams Our Way • Inspired by stochastic language model • Uses a probabilistic way to define a syntax • Rules are stored in form of n-grams (CONTEXT, WORD, PROBABILITY) • Can be used for generating strings • Can also be used to calculate probability that a string was generated by a grammar

  46. The Basic Idea • Define a stochastic language model for generating words in each language • Classify by comparing probabilities that a word was generated by a language

  47. Example • Training text: “Test text” (CONTEXT, CHAR, PROBABILITY) => (cn-2 cn-1, cn, P( cn | cn-1, cn-2) ) Probabilities are calculated as ( ^^, t, 1.0 ) ( ^t, e, 1.0 ) ( te, s, 0.5 ) ( es, t, 1.0 ) ( st, , 1.0 ) ( te, x, 0.5 ) ( ex, t, 1.0 ) ( xt, , 1.0 )

  48. Classification • Probabilities of a word being generated are given by: • For the previous example: • Classification:

  49. Implementation - Issues • Speed • Encoding • Machine Precision • Noisy Data • Zero Frequency

More Related