second global symposium of intellectual property authorities n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Second Global Symposium of Intellectual Property Authorities PowerPoint Presentation
Download Presentation
Second Global Symposium of Intellectual Property Authorities

Loading in 2 Seconds...

play fullscreen
1 / 31

Second Global Symposium of Intellectual Property Authorities - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

Second Global Symposium of Intellectual Property Authorities. An approach to process multilingual patents corpora. Dr. Barrou DIALLO Head of Research, European Patent Office, Rijswijk. Geneva, September 17, 2010. EPO R&D Department.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Second Global Symposium of Intellectual Property Authorities' - linus


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
second global symposium of intellectual property authorities

Second Global Symposium of Intellectual Property Authorities

An approach to process multilingual patents corpora

Dr. Barrou DIALLO

Head of Research, European Patent Office, Rijswijk

Geneva, September 17, 2010

about the r d department

EPO R&D Department

At the origin of the 1st Machine Translation System for patents

Entry point for testing and evaluating available solutions

Portfolio of academic & international collaborations

Strong background in algorithmic and linguistics

Network of active users and testers

About the R&D Department
our mission
Our Mission
  • Providing an instrument to translate user needs into Projects

Main tasks:

  • Technical advises
  • Market studies and research
  • Technological Forecasting
  • Risk analysis and Strategic planning

Resulting to:

On request supporting the EPO MT Task force and/or IP5 MT activities

Management support on ICT issues

  • Performing quantitative analysis
  • Advise over technical solutions to decision-makers
  • Providing users with sensible options and recommending courses of action

Coordinating research initiatives across IT

  • Identifying and communicating business opportunities
  • Ensuring smooth transition from research to development
  • Communicate practices and experiences
  • Formalise research work across all departments
current research subjects

Our Expertise

Current Research Subjects

Machine Translation for Asian Languages

Semantic Search Engines

Graphical Visualization

our mission1
Our Mission

Our Vision: Turning Technology into an effective IP Process

R&D center as a source of Efficiency:

  • Efficient Reading
  • Accurate Searching
  • Fast Granting
logical view of a document in an ir systems

structure

Full text

Index terms

Logical view of a document in an IR systems

Accents,

spacing

Noun

groups

Manual

indexing

Doc

stopwords

stemming

structure

Doc - Translated x1

Doc - Translated xn...

Multiples languages add another dimension to Retrieval systems for patents.

adapted from J. H. Wang, 2008

mt setup of an evaluation platform
MT: Setup of an evaluation platform

GOAL: To provide the users with a clear assessment of the quality of MT systems

  • Unix server hosting fullttext patent data of source and target languages
  • "mteval" scoring script for the Open MT Evaluation (http://www.itl.nist.gov/iad/mig//tools/)
  • Case of a small set of Japanese patent documents:
    • 54 JP patents
    • 54 JP priority documents published at the USPTO
    • analysis over the claim section

Which indicators of quality can be considered as valid?

absolute score computation processing scheme
Absolute score computation processing scheme
  • For each document
    • choose candidate sentences (ca. 10 segments)
      • find the corresponding HT
      • compute the BLEU score
      • compute the NIST score
    • compute the document average BLEU score
    • compute the document average NIST score
    • compute the BLEU - NIST correlation
    • compute the BLEU - HT correlation
    • compute the NIST - HT correlation
    • store the IPC class
  • For the collection
    • (per IPC class):
      • compute the average BLEU score
      • compute the average NIST score
    • compute the correlation between scores in each IPC class
example of raw results
Example of raw results
  • High variation of scores
  • High correlation between BLEU and NIST
  • Extreme cases:
  • BLEU 0 at 9th position
  • BLEU 0.30 at 25th position

Bleu score JPO 54 documents

NIST score JPO 54 documents

nist vs bleu correlation
NIST vs. BLEU correlation

JP system case

Google case

our findings based on a limited example
Correlation between BLEU scores and Human-translated documents

high scores correlate with understandable translations

low scores correlate with non-understandable translations

Differences between documents from different IPC classes

Spread of scores is large (cf. std dev.)

Our findings based on a limited example

BLEU is consistent with Human translations:

But:

  • Results are absolute: they need to be compared to other systems
  • Bias can be introduced by the origin of data (IPC class, complexity, ...)
jp bleu nist vs ipc classes
JP: BLEU & NIST vs. IPC classes

To address the issue of data origin:

BLEU score

NIST score

mt systems relative score computation scheme
MT systems relative score computation scheme

BLEU score JP translation system vs. Google system

BLEU score Google

BLEU score JP

best medium and worse case examples
Best, medium and worse case examples

Mean scores for the whole collection:

  • JP translation system:
    • NIST: 4.7962
    • BLEU: 0.1443
  • Google:
    • NIST: 4.5796
    • BLEU: 0.1185

Worse case:

Bleu score = 0 for JP

Medium case:

Bleu score = 0.15 for JP

Best case:

Bleu score =0.30 for JP

worse case example jp bleu 0
Worse case example JP (BLEU=0)

JP Human Translation claim 1:

1. A low-level light detector, comprising: an avalanche photodiode with a bias voltage adjusted to produce a multiplication factor of up to 30; a capacitor connected to the avalanche photodiode for accumulating carriers produced and multiplied in the avalanche photodiode;>biasing means of the avalanche photodiode; outputting means of a capacitor voltage change; and control means of the

biasing and outputting means; wherein the low-level light detector detects an intensity of light impinging on the avalanche photodiode by periodically reading capacitor voltages and obtaining differences between the voltages.

JP MT Claim 1:

an avalanche photo-diode (APD) which adjusted bias voltage so that a multiplication factor might

become 30 or less ] A microscopic weak optical power detector detecting intensity of light irradiated

by above APD by connecting a capacitor for generating inside this APD and accumulating a ****(ed) carrier,

reading voltage of this capacitor periodically, and taking the difference

JP MT Google claim 1:

01 claim Avalanche adjusted so that the bias voltage multiplication factor of 30 or less (APD), and comprises

APD occurs within, connect a capacitor for storing carriers multiplication reads regularly voltage capacitor

comprises, by taking the difference above, APD was irradiated characterized by intensity of light to detect

Ru, very faint light detector. [2]

  • Remarks:
  • "APD" is not in Human Translation
  • "Avalanche photodiode appears 5 times in Human vs. different occurrences in MT
  • "photo diode" is missing in Google
  • Much more information in HT than MT
medium case example jp bleu 0 15
Medium case example: JP (BLEU=0.15)

JP Human Translation:

1. An electronic throttle control device of an internal-combustion engine that controls an engine output by computing a quantity of a throttle opening degree on the basis of a manipulation quantity of an accelerator pedal by a driver by means of a computation portion in an electronic control unit, and by controlling a throttle opening degree using a specific actuator on the basis of a computed command value of the throttle opening degree,>wherein the electronic control unit includes:>a judgment function portion

JP MT:

[Claims 1]. It has the following and is characterized by choosing a predetermined map from said two or more characteristic conversion factor maps, and calculating a target throttle opening command value corresponding to a judgment result of said

judgment function part. being based on the amount of operations of a driver's accelerator by a calculating means of an electronic control unit (ECU) -- a throttle -- an opening -- quantity calculating and,

Google MT:

claims [claim] 1 electronic control unit (ECU) by means of operation, the driver's accelerator operation is calculated caliber throttle opening based on the amount that means actuator given on the command throttle opening by this operation, to control the opening of the throttle control, electronic throttle the internal combustion engine to control engine output apparatus, the electronic control unit, the normal operating conditions and engine systems, engine control unit to determine the abnormality detection capabilities,

Conclusion:Quality is not good enough for understanding the content

best case example jp bleu 0 30
Best case example: JP (BLEU=0.30)

Human Translation JP:

1. A signal processing circuit comprising:>a pulse generation part that generates a pulse signal corresponding to an input signal;>an integration part that generates an integrated voltage having a time slope proportional to an input voltage with a duration specified by said pulse signal being set as an integration period; and>a hold part that holds and outputs a difference voltage between a start voltage and an end voltage of said integrated voltage in said integration period.

JP Machine Translation:

A signal-processing circuit comprising:A pulse generating means which generates a pulse signal according to an input signal.An integrating means which generates integration voltage which has a time slope which is proportional to input voltage by making into an integration period a period specified with said pulse signal.A hold means which holds and outputs difference voltage of starting potential of said integration voltage and end voltage in said integration period.

Google MT:

01 and pulse generation to generate a pulse signal corresponding to the input signal, the pulsed integration time period as specified in the signal integration means for generating a voltage gradient with an integration time proportional to the input voltage, the voltage difference between voltage and hold the start voltage and end voltage of said integration of said integration period hold, and a signal processing circuit means and output.

Conclusion: Tiny differences between JP MT and HT

rank ordered n gram co occurrence scores

NIST scores for MT vs. Human translations

Rank-ordered N-gram co-occurrence scores

6 commercial MT systems and 7 professional translators

Is NIST 0.4

sufficient for patent professionals?

Maximum

score for MT

(c) NIST N-gram scoring study

manual vs automatic evaluation result interpretations
Manual vs. Automatic evaluation: Result Interpretations
  • Scores have to be carefully interpreted: no statistical significance at the moment.
  • There is a clear correlation between manual scores and automatic scores
  • Both scores NIST and BLEU are complementary and show different aspects
  • Relative scores should be calculted to assessment systems between each other
  • Both end-users assessments AND automatic scores are necessary for testing a system
the general problem
The general problem

Finding documents written in any language using queries expressed in a single language

  • Main strategies for query translation
    • dictionary-based methods
      • Limitations of dictionaries
      • Inflected word forms
      • Phrases and compound words
      • Lexical ambiguity
      • Possible solution: Approximate string matching
    • corpus-based methods
      • frequency analysis (aboutness of the 2 collections should be similar)
    • machine translation
      • use of morphological parser
      • Translates source language texts into target language using:
        • Translation dictionaries
        • Other linguistic resources
        • Syntax analysis
      • Limited availability

Source language: the language that gives access to the required information; the query language

Target language: the language of the content in the database

Usage: patent query translation and/or patent translation from the source language.

cross language retrieval in a nutshell
Cross-language Retrieval in a nutshell

Mohsen Jamali, Sharif Univ. of technology

how to start cross language information retrieval for patent
How to start Cross-language Information Retrieval for Patent?

Classic CLIR system tree: which strategy for patent documents?

the main issue of clir term disambiguation
The main issue of CLIR: Term disambiguation

How to deal with ambiguity?

  • Solution 1:
    • Selecting the most likely translation (1st one offered by a dictionary?), the longest term?
    • Problem: a low probability of success.
  • Solution 2:
    • Use of all possible translations in the query with the OR operator.
    • Problem: it includes the correct translation, but also introduces noise into the query. This can lead to the retrieval of many irrelevant documents
  • Solution 3 (most popular):
    • Term co-occurrences models.
    • A query defines a single concept or an information need, thus the terms in a query are assumed to exhibit relatively strong relationship. Therefore, the correct translation of one query term would be expected to show a strong correlation with other translated query words.
a proposed measure mutual information
A proposed measure: Mutual Information

Mutual Information (MI) is a technique based on co-occurrence statistic

  • Relationship between query terms can be quantified co-occurrences model
  • The Mutual Information measure quantifies the distance between the joint distribution of terms X and Y and the product of their marginal distributions
  • x, y are the translation of two query terms;
  • f(x), f(y), f(x,y) are the frequency that x appears, the frequency that y appears and the frequency that x and y appears together, respectively;
  • N is the size of the corpus
translation selection total correlation
Translation selection : Total Correlation

Measure:

  • We have a list of translation candidates.
  • Goal is to find the correct translation from the candidate list.
  • The correct translation will be selected using MI

Decision:

  • Total correlation - a generalization of the Mutual Information to calculate the relationship between the query words is proposed:
  • xi are the translation of query words
  • f(xi) is the frequency that the xi appears in the corpus
  • f(x1,x2,x3,...) is the frequency that all query words appears in the corpus.
  • N is the size of the corpus

If a set of translated query terms has a high MI value, then this set of translated terms

is to be considered as correct

conclusion on term disambiguation
Conclusion on Term disambiguation

Mutual Information associated to Total Correlation is proposed as a

measure for cross-language patent Retrieval

  • MI is a simple measure and not too computer-intensive
  • It performs as well as other co-occurrence approaches (Maeda et al. (2000).
  • Co-occurrences frequencies can be obtained from the document collection.

This approach is compatible with a collaborative view:

  • Make use or build test collections to evaluate the systems:
    • example of CLEF (Cross Language Evaluation Forum)
    • collect set of queries (rare items in IP)
    • collect sets of relevance judgments (which documents are relevant to which queries)
visualization and analysis of patent queries

Another solution for Term disambiguation

Visualization and analysis of Patent Queries

Graphical and textual editing of queries

Visual support of different search engines:

Full-text search

Semantic search

Image similarity

Metadata search

Query management functionality:

Storing of queries

Parameterization of queries using variables

Checking & amending interactively when necessary increase the chance of good results

perspective and conclusion
Perspective and conclusion

The field of patent processing is still in a maturing mode

  • The number of subjects to be addressed is large (MT, IR, SE theory, Scoring and Evaluation, etc...)
  • The difficulty of retrieving patents raise theoretical problems. Testing theory need a large amount of:
    • clean datasets and queries
    • CPU power
    • feedbacks from users communities
  • Current implementations do no satisfy entirely the users needs (usability, language independent, etc...)
  • Metrics in place need to be revisited and/or replaced by patent-specific metrics (i.e PRES/Univ. Dublin)
  • Patents not only represent technical texts, but also a set of environmental attributes which have to be consulted in order to achieve the goals (IPC classes, patent searcher behaviours, legal changes, ...)
thank you for your attention any questions

Thank you for your attentionAny Questions?

Barrou DIALLO bdiallo@epo.org