Linking legal thesaurii to enable semi automated multilingual searching
1 / 22

Linking legal thesaurii to enable semi-automated multilingual searching - PowerPoint PPT Presentation

  • Uploaded on

Linking legal thesaurii to enable semi-automated multilingual searching . Philip Chung, Graham Greenleaf & Andrew Mowbray Co-Directors, AustLII Law via the Internet Conference Jersey, Channel Islands September 2013. Outline. Cross-lingual searching: Issues Document vs query translation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Linking legal thesaurii to enable semi-automated multilingual searching' - yuki

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Linking legal thesaurii to enable semi automated multilingual searching

Linking legal thesaurii to enable semi-automated multilingual searching

Philip Chung, Graham Greenleaf & Andrew Mowbray

Co-Directors, AustLII

Law via the Internet ConferenceJersey, Channel IslandsSeptember 2013


Cross-lingual searching: Issues

Document vsquery translation

Proposed approach to query translation

Searching in multiple languages (multi-lingual)

Extending SINO using the u16a representation

Using SINO’s synonym function

Discussion and future work

Cross lingual searching
Cross-lingual searching

  • Cross-lingual searching = Retrieval of documents in a language other than the language of the query

  • Main motivations:

    • Allow monolingual searchers to be aware of the existence of relevant documents in other languages

    • Assist users who are more familiar with one language to find documents in other languages

    • Avoid the need to enter search queries in different languages

  • Key issue: translation

    • Document translation

    • Query translation

Document translation
Document translation

  • Translate documents into the language of the query

    • eg translating all documents into English

  • Searching can then be done using the language of the query directly

  • Users may also be able to read and use the translated file (assuming a good translation)

  • However, very resource intensive – impossible to translate documents into all languages

  • Unless documents are already translated, this approach is not feasible/practical for free-access LIIs

Query translation
Query translation

Translate the search query into the languages of the documents contained in the system

Less text/words to be translated

More flexible (may be dynamically generated)

However, may not be able to handle complex queries

Document retrieved may then need to be translated into a language that the user understands

This approach is more feasible from a free-access LII’s perspective

A possible approach to query translation
A possible approach to query translation

  • Creating new bilingual mappings of legal terms is too expensive

  • Use of existing bilingual dictionaries/glossaries is more practical, where they exist

  • For cross-lingual searches across multiple languages, use one language as a ‘link language’

    • egEnglish to construct mapping tables

  • Each term in the query is then expanded based on the equivalent entries in the mapping table

  • Search is then conducted over the corpus based on the expanded term(s)

Legal dictionaries relevant to east asia in likely order of availability
Legal Dictionaries relevant to East Asia(in likely order of availability)

  • Hong Kong: Chinese (HK) <-> English

    • Official translation dictionary of Hong Kong government available

  • Eurovoc - 22 European languages <-> English

    • Available for use

  • Indonesia: Bahasa Indonesia <-> English

    • Dictionary of basic legal terms developed by AustLII

  • Japan: Japanese <-> English

    • Japanese Law Translation dictionary (Nagoya project) is available for 3rd party use – various download options available

  • South Korea: Korean <-> English

    • MOLEG and/or KLRI has developed dictionary

  • Taiwan: Chinese (Tw) <-> English

    • Prof Amy Shee’s group may be developing a dictionary

  • Vietnam: Vietnamese <-> English

    • Law Science Institute (Hanoi) has developed, but availability is uncertain

Example 1 indonesian english
Example 1: Indonesian <-> English

Bahasa Indonesia







crime against humanity

Example 2 chinese english
Example 2: Chinese <-> English

Chinese (HK)







crimes against humanity


  • Mapping table may be extended using EuroVoc

    • The Council of Europe’s official multilingual thesaurus

    • It contains a sub-section for legal terms

  • EuroVoc – Contains terms in 23 EU languages

    • Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish

    • plus Serbian

  • Link languages

    • English can then be the link language between European and some Asian languages in many Asian jurisdictions

    • Portuguese can also be a link language to some Asian jurisdictions and other Asian languages

Eurovoc example
EuroVoc example

bgпрестъпление против човечеството

escrimen contra la humanidad

cstrestný čin proti lidskosti

da forbrydelse mod menneskeheden

de Verbrechen gegen die Menschlichkeit

et inimsusvastane kuritegu

el έγκλημα κατά της ανθρωπότητας

en crime against humanity

frcrime contre l'humanité

it crimine contro l'umanità

lv noziegums pret cilvēci

ltnusikaltimas žmogiškumui

huemberiség elleni bűncselekmény

mt crime against humanity (under translation)

nlmisdaad tegen de menselijkheid

plzbrodnia przeciwko ludzkości

ptcrime contra a humanidade

rocrime împotriva umanității

skzločin proti ľudskosti

slzločin proti človečnosti

fi rikos ihmisyyttä vastaan

svbrott mot mänskligheten

hrzločin protiv čovječnosti

srзлочин против човечности

Simultaneous searching in multiple languages
Simultaneous searching in multiple languages

  • AustLII’s SINO search engine

  • Open source, free-text search engine

  • Speed, flexibility, portability and reliability

    • build performance: 20GB per hour on commodity hardware

    • search performance: Single word searches return in under 0.050 seconds

  • ‘Size is no object’ – trade-off between disk space and speed of retrieval

    • concordance ratio: 55% approx – relatively large but concordance is easy to read and minimises unnecessary file input/output

  • Used by many LIIs from around the world: BAILII, PacLII, SAFLII, LIIofIndia, HKLII, NZLII, LiberLII, CyLaw, SamLII

Simultaneous searching in multiple languages 2
Simultaneous searching in multiple languages (2)

SINO was developed initially for English and has been extended to other western languages

extending SINO to handle UTF-8 encoding for multilingual searching

Sino s u16a representation
SINO’s u16a representation

  • SINO’s u16a representation

    • Any non-ASCII UTF-8 character (eg Chinese, Korean, Thai) can be converted into an alpha-numeric (flat) representation

      • Hexadecimal form – 0 to 9 and A to F

    • Resulting form may be confused with numeric words in western languages

      • ‘春’ is ‘6625’ in hexadecimal form

Sino s u16a representation 2
SINO’s u16a representation (2)

  • The characters ‘u16a’ are added to any such representation to create a unique string

    • ‘u16a’ is rare to non-existent in natural language

  • These u16a ‘shadow files’ are then used for SINO to search (as a proxy for the original)

    • text in the original language is presented to the user

  • Example: bankrupt* or insolven* or การล้มละลาย or kepailitan or pailit or 破產 or 破产 or Phásản

Sino and synonyms
SINO and synonyms

  • Possible implementationofquery translation

  • Synonyms can be defined via the .sino_synonymsfile

  • Consists of zero or more lines each with a comma separated list of words and/or phrases.

  • For example:

    • unsw, “university of new south wales”

    • small, tiny, little

  • Use of a .sino_synonyms file as a starting point for automating cross-lingual searching

Example kompensasi or
Example: kompensasior 補償

Discussion and future work
Discussion and future work

  • What are the criteria for success of cross-lingual searching?

    • What extent of false positives are allowable?

    • What testing would be most useful?

  • Extracting and mapping legal terms from multiple dictionaries

  • Developing an interface to manage the addition of new legal terms

Discussion and future work 2
Discussion and future work (2)

  • What if a search contains non-legal terms?

    • Could automated translations supplement dictionaries?

  • Addition of general (non-legal) terms to dictionaries

    • coverage vs performance

  • Possible performance improvement: Expand legal terms at concordance time

    • Rather than simply indexing on the words of the original text

    • Include in the concordance the expanded list of legal terms in multiple languages