Applying ir to patent search trec 2009 chem track experience a recent advancement in ir theory
Download
1 / 21

Applying IR to Patent Search TREC 2009 Chem Track Experience A Recent Advancement in IR Theory - PowerPoint PPT Presentation


  • 150 Views
  • Uploaded on

Applying IR to Patent Search TREC 2009 Chem Track Experience & A Recent Advancement in IR Theory. Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University 2010-11-16. Prior Art task 2009.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Applying IR to Patent Search TREC 2009 Chem Track Experience A Recent Advancement in IR Theory' - doctor


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Applying ir to patent search trec 2009 chem track experience a recent advancement in ir theory

Applying IR to Patent SearchTREC 2009 Chem Track Experience& A Recent Advancement in IR Theory

Le Zhao and Jamie CallanLanguage Technologies InstituteSchool of Computer ScienceCarnegie Mellon University2010-11-16


Prior art task 2009
Prior Art task 2009

  • Task: Patent as query, Citations as relevant results

  • Our approach

    • Date filtering (Prior) [AleksandrBelinskiy, MaillistComm]

      • Query patent:

        • Multiple priority dates – use latest priority date

      • Result patent:

        • Multiple dates – use publication date

    • Weighted bag of word queries (Relevant Art)

      • Title + Claims

      • Description

        • Too long, only used to weight terms, (similar to selecting keywords)


Indri query example
Indri Query Example

  • #filrej( #dateafter(07/07/1994)

    #weight( 0.6 #combine( detergent compositions)

    0.4 #weight(

    16 1 14 bleaching 12 agent 11 composition 11 oxygen 11 7 10 4 8 u 8 2 7 o 7 available 6 claims 5 triazacyclononane 5 silver 5 coating 5 clo 5 organic 5 3 4 mn 4 minutes 3 co 3 mniii 3 0 3 5 3 bispyridylamine 3 description 3 n 3 containing 3 described 3 releasing 3 method 2 mixtures 2 time 2 compound 2 mixture 2 dentate 2 remainder 2 rate 2 mniv 2 source 2 tri 2 making 2 sprayed 2 intimate 2 completely 2 oac 2 cl 2 trimethyl 2 selected 2 premixed 2 bleach 2 dispersing 2 compositions 2 pf 2 released 2 perchlorate 2 oil 2 10 2 di 2 group 2 methyl 2 release 2 non 2 cobalt 2 consisting 2 interval 2 process 2 paraffin 2 particles 2 present 1 claim 1 perhydrate 1 nh 1 salt 1 copper 1 total 1 corrosion 1 bispyridyl 1 chlorate 1 bi 1 8 1 dry 1 measured 1 partially 1 mnivbipy 1 och 1 trisdipyridylamine 1 comprises 1 mnivn 1 isothiocyanato 1 ligands 1 combination 1 triglycerides 1 bis 1 amine 1 6 1 bipy 1 binuclear 1 pyridylamine 1 mniiimniv 1 relasing 1 inorganic 1 mixed 1 precursor 1 iron 1 hydrogenated 1 peroxyacid 1 additional 1 inhibitor 1 tetra 1 tris 1 level 1 derivatives 1 provided 1 diglycerides 1 gluconate 1 mono 1 wholly 1 complexed 1 catalyst ) ) )


Map performance 2009
MAP Performance (2009)

Main Points

Result PatentCitations

  • The optimal term weight (theory)

  • Failures of current IR models (practice)

  • Root cause of the problems

  • The beginning of a solution

Query Patent Content

75%

Term Weighting

Relational retrieval/citation finding: (Ni Lao and William Cohen, ML 2010)


The optimal term weighting
The Optimal Term Weighting

Main Points

  • Binary Independence Model

    • [Robertson and Spärck Jones 1976]

    • “Relevance Weight”, “Term Relevance”

      • P(t | R)is effectively the only part about relevance.

  • The optimal term weight (theory)

  • Failures of current IR models (practice)

  • Root cause of the problems

  • The beginning of a solution

idf (sufficiency)

Odds P(t | R)


Definition of necessity p t r q
Definition of Necessity P(t | Rq)

Collection

Directly calculated given relevance judgements for q

Relevant (q)

Docs that contain t

P(t | Rq) = 0.4

Term Necessity

== term recall

== 1 – mismatch


Without necessity
Without Necessity

  • The emphasis problem for idf-only term weighting

  • Emphasize high idf terms in query

    • “prognosis/viability of a political third party in U.S.” (Topic 206)

  • Affects

    • tf*idf,

    • Okapi BM25,

    • Language Models,

      all models that use idf-only term weighting


  • Ground truth
    Ground Truth

    TREC 4 topic 206

    Emphasis


    Indri top results
    Indri Top Results

    1. (ZF32-220-147) Recession concerns lead to a discouraging prognosis for 1991

    2. (AP880317-0017) Politics … party … Robertson's viability as a candidate

    3. (WSJ910703-0174) political parties …

    4. (AP880512-0050) there is no viable opposition …

    5. (WSJ910815-0072) A third of the votes

    6. (WSJ900710-0129) politics, party, two thirds

    7. (AP880729-0250) third ranking political movement…

    8. (AP881111-0059) political parties

    9. (AP880224-0265) prognosis for the Sunday school

    10. (ZF32-051-072) third party provider

    (Google, Bing still have top 10 false positives. Emphasis also a problem for large search engines!)


    Without necessity1
    Without Necessity

    • The emphasis problem for idf-only term weighting

    • Emphasize high idf terms in query

      • “prognosis/viability of a political third party in U.S.” (Topic 206)

    • False positives throughout rank list

      • especially detrimental at top rank

    • No term recall hurts precision at all recall levels

    • Affected models: BIM, and also more advanced tf*idf, Okapi BM25, LM that use tf.

  • How significant is the emphasis problem?


  • Failure analysis of 44 topics from trec 6 8
    Failure Analysis of 44 Topics from TREC 6-8

    Main Points

    • The optimal term weight (theory)

    • Failures of current IR models (practice)

    • Root cause of the problems

    • The beginning of a solution

    Necessity term weighting

    Term expansion

    Basis: Term Mismatch Problem

    RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)


    Given true necessity
    Given True Necessity

    • +100% over BIM (in precision at all recall levels)

      • [Robertson and Spärk Jones 1976]

  • +30-80% over Language Model, BM25 (in MAP)

    • [Zhao and Callan 2010]

  • Limit for using necessity term weighting

  • Solving mismatch would give more gain!


  • The mismatch problem causes the emphasis problem
    The Mismatch Problem Causes the Emphasis Problem

    • Emphasis problem: high mismatch & high idf

    • Solving mismatch solves emphasis problem!


    Expansion for individual terms
    Expansion for Individual Terms

    Main Points

    • This works great:(prognosis OR viability OR possibility OR impossibility OR future)AND(political third party)

    • Even just this is better than original:(prognosis OR viability)AND(political third party)

    • The optimal term weight (theory)

    • Failures of current IR models (practice)

    • Root cause of the problems

    • The beginning of a solution


    Wikiquery
    WikiQuery

    Summary

    • A tool to easily create such complex queries

    • To easily modify and see the results

    • To store high quality queries

    • To share with others

    • To collaboratively & iteratively build a perfect query

    • The optimal term weight (theory)

    • == term mismatch + idf

    • Failures of current IR models (practice)

    • Emphasis (64%) + Mismatch (27%)

    • Root cause of the problems

    • Mismatch

    • The beginning of a solution: WikiQuery


    Feedback
    Feedback

    • Questions? Comments? Ideas?

    • Want to be Users?

      (Le Zhao: [email protected])

    Le Zhao and Jamie Callan. Term Necessity Prediction. CIKM 2010


    Failure analysis false positives ep topic 3
    Failure Analysis – False Positives (EP topic 3)

    • Q: Oxygen-releasing (controlled release) bleaching agent, with a non-paraffin oil organic silver coating agent, and additional corrosion inhibitor compound

    • Top ranked results (cited means relevant)

      Relevance: [Title]: [Summary of invention]

      • NR: Controlled release laundry bleach product (+ 2 more others)

      • NR: Bleach activation: improved bleach catalyst for low temperatures

      • NR: Accelerated release laundry bleach product

      • R: Bleach activation: activated by a catalytic amount of a transition metal complex

      • R: Concentrated detergent powder compositions: a surfactant, a detergency builder, enzymes, a peroxygen compound bleach and a manganese complex as effective bleach catalyst


    Learning false positives
    Learning (false positives)

    • Bag-of-Wordwill fail in many cases

      • Most false positives have reasonably relevant descriptions

    • The most important part of a query patent is its novel part

      • Typically a small part of the document

      • Human intervention needed?


    Failure analysis misses ep topic 3
    Failure Analysis – Misses (EP topic 3)

    • Misses

      • Cited in the content:

      • “other catalyst examples include EPxxxx, USxxxx …”

      • about an unimportant area of the patent

      • These mentions also include the returned relevant patents


    Learning misses
    Learning (misses)

    • Patents cite related prior patents

      • Many citations are

        • mentions of prior arts made by the query patent

          • about an unimportant part of the patent

        • If these are relevant, all false positives can be relevant

      • Will a patent cite other patents that may invalidate itself?

        • Mechanism to ensure that? Increased application fee?

    • For evaluation: what to include as Relevant?

      • Use the whole original reference list?

      • or only use citations added by others?

      • Patents have well marked search reports

        • For EP: X, Y, for US: *, judged by patent offices

        • EP 1-6, only 6 has 4 XYs, but a lot more applicant citations

          • We need better test sets


    ad