Applying ir to patent search trec 2009 chem track experience a recent advancement in ir theory
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Applying IR to Patent Search TREC 2009 Chem Track Experience & A Recent Advancement in IR Theory PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on
  • Presentation posted in: General

Applying IR to Patent Search TREC 2009 Chem Track Experience & A Recent Advancement in IR Theory. Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University 2010-11-16. Prior Art task 2009.

Download Presentation

Applying IR to Patent Search TREC 2009 Chem Track Experience & A Recent Advancement in IR Theory

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Applying ir to patent search trec 2009 chem track experience a recent advancement in ir theory

Applying IR to Patent SearchTREC 2009 Chem Track Experience& A Recent Advancement in IR Theory

Le Zhao and Jamie CallanLanguage Technologies InstituteSchool of Computer ScienceCarnegie Mellon University2010-11-16


Prior art task 2009

Prior Art task 2009

  • Task: Patent as query, Citations as relevant results

  • Our approach

    • Date filtering (Prior) [AleksandrBelinskiy, MaillistComm]

      • Query patent:

        • Multiple priority dates – use latest priority date

      • Result patent:

        • Multiple dates – use publication date

    • Weighted bag of word queries (Relevant Art)

      • Title + Claims

      • Description

        • Too long, only used to weight terms, (similar to selecting keywords)


Indri query example

Indri Query Example

  • #filrej( #dateafter(07/07/1994)

    #weight( 0.6 #combine( detergent compositions)

    0.4 #weight(

    16 1 14 bleaching 12 agent 11 composition 11 oxygen 11 7 10 4 8 u 8 2 7 o 7 available 6 claims 5 triazacyclononane 5 silver 5 coating 5 clo 5 organic 5 3 4 mn 4 minutes 3 co 3 mniii 3 0 3 5 3 bispyridylamine 3 description 3 n 3 containing 3 described 3 releasing 3 method 2 mixtures 2 time 2 compound 2 mixture 2 dentate 2 remainder 2 rate 2 mniv 2 source 2 tri 2 making 2 sprayed 2 intimate 2 completely 2 oac 2 cl 2 trimethyl 2 selected 2 premixed 2 bleach 2 dispersing 2 compositions 2 pf 2 released 2 perchlorate 2 oil 2 10 2 di 2 group 2 methyl 2 release 2 non 2 cobalt 2 consisting 2 interval 2 process 2 paraffin 2 particles 2 present 1 claim 1 perhydrate 1 nh 1 salt 1 copper 1 total 1 corrosion 1 bispyridyl 1 chlorate 1 bi 1 8 1 dry 1 measured 1 partially 1 mnivbipy 1 och 1 trisdipyridylamine 1 comprises 1 mnivn 1 isothiocyanato 1 ligands 1 combination 1 triglycerides 1 bis 1 amine 1 6 1 bipy 1 binuclear 1 pyridylamine 1 mniiimniv 1 relasing 1 inorganic 1 mixed 1 precursor 1 iron 1 hydrogenated 1 peroxyacid 1 additional 1 inhibitor 1 tetra 1 tris 1 level 1 derivatives 1 provided 1 diglycerides 1 gluconate 1 mono 1 wholly 1 complexed 1 catalyst ) ) )


Map performance 2009

MAP Performance (2009)

Main Points

Result PatentCitations

  • The optimal term weight (theory)

  • Failures of current IR models (practice)

  • Root cause of the problems

  • The beginning of a solution

Query Patent Content

75%

Term Weighting

Relational retrieval/citation finding: (Ni Lao and William Cohen, ML 2010)


The optimal term weighting

The Optimal Term Weighting

Main Points

  • Binary Independence Model

    • [Robertson and Spärck Jones 1976]

    • “Relevance Weight”, “Term Relevance”

      • P(t | R)is effectively the only part about relevance.

  • The optimal term weight (theory)

  • Failures of current IR models (practice)

  • Root cause of the problems

  • The beginning of a solution

idf (sufficiency)

Odds P(t | R)


Definition of necessity p t r q

Definition of Necessity P(t | Rq)

Collection

Directly calculated given relevance judgements for q

Relevant (q)

Docs that contain t

P(t | Rq) = 0.4

Term Necessity

== term recall

== 1 – mismatch


Without necessity

Without Necessity

  • The emphasis problem for idf-only term weighting

  • Emphasize high idf terms in query

    • “prognosis/viability of a political third party in U.S.” (Topic 206)

  • Affects

    • tf*idf,

    • Okapi BM25,

    • Language Models,

      all models that use idf-only term weighting


  • Ground truth

    Ground Truth

    TREC 4 topic 206

    Emphasis


    Indri top results

    Indri Top Results

    1. (ZF32-220-147) Recession concerns lead to a discouraging prognosis for 1991

    2. (AP880317-0017) Politics … party … Robertson's viability as a candidate

    3. (WSJ910703-0174) political parties …

    4. (AP880512-0050) there is no viable opposition …

    5. (WSJ910815-0072) A third of the votes

    6. (WSJ900710-0129) politics, party, two thirds

    7. (AP880729-0250) third ranking political movement…

    8. (AP881111-0059) political parties

    9. (AP880224-0265) prognosis for the Sunday school

    10. (ZF32-051-072) third party provider

    (Google, Bing still have top 10 false positives. Emphasis also a problem for large search engines!)


    Without necessity1

    Without Necessity

    • The emphasis problem for idf-only term weighting

    • Emphasize high idf terms in query

      • “prognosis/viability of a political third party in U.S.” (Topic 206)

    • False positives throughout rank list

      • especially detrimental at top rank

    • No term recall hurts precision at all recall levels

    • Affected models: BIM, and also more advanced tf*idf, Okapi BM25, LM that use tf.

  • How significant is the emphasis problem?


  • Failure analysis of 44 topics from trec 6 8

    Failure Analysis of 44 Topics from TREC 6-8

    Main Points

    • The optimal term weight (theory)

    • Failures of current IR models (practice)

    • Root cause of the problems

    • The beginning of a solution

    Necessity term weighting

    Term expansion

    Basis: Term Mismatch Problem

    RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)


    Given true necessity

    Given True Necessity

    • +100% over BIM (in precision at all recall levels)

      • [Robertson and Spärk Jones 1976]

  • +30-80% over Language Model, BM25 (in MAP)

    • [Zhao and Callan 2010]

  • Limit for using necessity term weighting

  • Solving mismatch would give more gain!


  • The mismatch problem causes the emphasis problem

    The Mismatch Problem Causes the Emphasis Problem

    • Emphasis problem: high mismatch & high idf

    • Solving mismatch solves emphasis problem!


    Expansion for individual terms

    Expansion for Individual Terms

    Main Points

    • This works great:(prognosis OR viability OR possibility OR impossibility OR future)AND(political third party)

    • Even just this is better than original:(prognosis OR viability)AND(political third party)

    • The optimal term weight (theory)

    • Failures of current IR models (practice)

    • Root cause of the problems

    • The beginning of a solution


    Wikiquery

    WikiQuery

    Summary

    • A tool to easily create such complex queries

    • To easily modify and see the results

    • To store high quality queries

    • To share with others

    • To collaboratively & iteratively build a perfect query

    • The optimal term weight (theory)

    • == term mismatch + idf

    • Failures of current IR models (practice)

    • Emphasis (64%) + Mismatch (27%)

    • Root cause of the problems

    • Mismatch

    • The beginning of a solution: WikiQuery


    Feedback

    Feedback

    • Questions? Comments? Ideas?

    • Want to be Users?

      (Le Zhao: [email protected])

    Le Zhao and Jamie Callan. Term Necessity Prediction. CIKM 2010


    Failure analysis false positives ep topic 3

    Failure Analysis – False Positives (EP topic 3)

    • Q: Oxygen-releasing (controlled release) bleaching agent, with a non-paraffin oil organic silver coating agent, and additional corrosion inhibitor compound

    • Top ranked results (cited means relevant)

      Relevance: [Title]: [Summary of invention]

      • NR: Controlled release laundry bleach product (+ 2 more others)

      • NR: Bleach activation: improved bleach catalyst for low temperatures

      • NR: Accelerated release laundry bleach product

      • R: Bleach activation: activated by a catalytic amount of a transition metal complex

      • R: Concentrated detergent powder compositions: a surfactant, a detergency builder, enzymes, a peroxygen compound bleach and a manganese complex as effective bleach catalyst


    Learning false positives

    Learning (false positives)

    • Bag-of-Wordwill fail in many cases

      • Most false positives have reasonably relevant descriptions

    • The most important part of a query patent is its novel part

      • Typically a small part of the document

      • Human intervention needed?


    Failure analysis misses ep topic 3

    Failure Analysis – Misses (EP topic 3)

    • Misses

      • Cited in the content:

      • “other catalyst examples include EPxxxx, USxxxx …”

      • about an unimportant area of the patent

      • These mentions also include the returned relevant patents


    Learning misses

    Learning (misses)

    • Patents cite related prior patents

      • Many citations are

        • mentions of prior arts made by the query patent

          • about an unimportant part of the patent

        • If these are relevant, all false positives can be relevant

      • Will a patent cite other patents that may invalidate itself?

        • Mechanism to ensure that? Increased application fee?

    • For evaluation: what to include as Relevant?

      • Use the whole original reference list?

      • or only use citations added by others?

      • Patents have well marked search reports

        • For EP: X, Y, for US: *, judged by patent offices

        • EP 1-6, only 6 has 4 XYs, but a lot more applicant citations

          • We need better test sets


  • Login