1 / 8

Relation Extraction for Academic Collaboration 10-709 Project Proposal

Relation Extraction for Academic Collaboration 10-709 Project Proposal. Justin Betteridge, Matthew Bilotti, Simon Fung, Sophie Wang Jan 26, 2006. Relation Extraction. We want: CollaboratesWith( <x>, <y> ) where <x>, <y> are of type ‘person’

Download Presentation

Relation Extraction for Academic Collaboration 10-709 Project Proposal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relation Extraction for Academic Collaboration10-709 Project Proposal Justin Betteridge, Matthew Bilotti, Simon Fung, Sophie Wang Jan 26, 2006

  2. Relation Extraction • We want: CollaboratesWith( <x>, <y> ) where <x>, <y> are of type ‘person’ • Two redundant sources of information for co-training: • Extraction Patterns to find Relations expressed in surface text or tables on the web • Rote learner keeps track of Relations it is told about, aggregating evidence in the form of confidence scores when Relations are multiply-extracted from different sources

  3. Sketch of a Co-Training Algorithm Let: R = a set of Relations; P = a set of Extraction Patterns Initialize: R <- seed Relations, P <- seed Patterns do, until termination condition is reached: • For each p in P, where p is of the form ( “before context”, <x>, “between context”, <y>, “after context” ), query Google using the literal context strings in the Pattern to retrieve text windows from which a set of Relations ( <x>, <y> ) can be extracted. • For each new Relation, compute new confidence score and add it to R, combining evidence if necessary. • Weed out any r in R the confidence of which is below a threshold, or optionally, any r the arguments of which are unlikely to be of type person. • For each r in R, where r is of the form ( <x>, <y> ), query Google to retrieve a set of text windows containing the strings <x> and <y>. From these text windows, generalize a set of Patterns ( “before”, <x>, “between”, <y>, “after”) • For each new Pattern, compute new confidence score and add it to P, combining evidence if necessary. • Weed out any p in P the confidence of which is below a threshold.

  4. Coverage as a Confidence Measure • Confidence for an Extraction Pattern p • For each r in R, query Google to see if p can extract r • Coverage is the number of relations in R extractable by p divided by |R| • Confidence for a Relation r • For each p in P, query Google to see if p can extract r • Similarly, coverage is the number of patterns in P that can extract r divided by |P|

  5. Combining Confidence Scores • Given a Relation with confidence c • Extracted again; pattern has confidence p • New confidence score of s (may be < c) • One idea: MYCIN Calculus [Shortliffe 76] • new confidence = c + ( 1 – c ) * p * s • intuitively, going p * s percent of the way from old confidence c to maximal confidence 1.0 • Another idea: = ( c + p * s ) / ( 1 + c * p * s ) • confidences increase monotonically, stay between 0 and 1.0, but never reach 1.0

  6. Example Seed Data for Co-Training • Extraction Patterns • <x> “in collaboration with” <y> • <x> “joint work with” <y> • Patterns that extract information from tables, lists of citations, etc... • Relations • CollaboratesWith( mbilotti, ehn ) • CollaboratesWith( jbetter, teruko ) ...

  7. Extraction Pattern Examples Query: “in collaboration with” site:web.mit.edu/biology/www

  8. Open Questions • Additional useful sources of information: • Anchor text and link structure: advisor-advisee cross-refs, department or lab organization • Heuristics or Named Entity Recognition to weed out relation arguments that are not people • Confidence metrics for patterns, relations • Methods of combining confidence scores • Termination condition

More Related