4 relationship extraction
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

4. Relationship Extraction PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on
  • Presentation posted in: General

4. Relationship Extraction. Part 4 of Information Extraction Sunita Sarawagi. The Problem. Relate extracted entities – unstructured text not partitioned into records Various competitions MUC ACE BioCreAtIvE II Protein-Protein Interaction. Groups of Relationships. ACE:

Download Presentation

4. Relationship Extraction

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


4. Relationship Extraction

Part 4 of Information Extraction

SunitaSarawagi

CS 652, Peter Lindes


The Problem

  • Relate extracted entities – unstructured text not partitioned into records

  • Various competitions

    • MUC

    • ACE

    • BioCreAtIvE II Protein-Protein Interaction

CS 652, Peter Lindes


Groups of Relationships

  • ACE:

    • located at, near, part, role, social for entities:

    • person, organization, facility, location, and geo-political entity

  • Biomedical: gene-disease, protein-protein, subcellular regularizations

  • NAGA knowledge base: 26 relationships such as: isA, bornInYear, establishedInYear, hasWonPrize, locatedIn, politicianOf, …

CS 652, Peter Lindes


Three Problem Levels

  • First case:

    • Entities preidentified in unstructured text

    • Given a pair of entities, find type of relationship

  • Second case:

    • Given relationship type r, entity name e

    • Extract entities with which e has relationship r

  • Third case:

    • Open-ended corpus – the web

    • Given relationship type r, find entity pairs

CS 652, Peter Lindes


Given Entity Pair, Find Relationship

  • R: set of relationship types

  • : R plus a special member for “other”

  • x: a “snippet” of text (might be a sentence)

  • E1 and E2 in x

  • Identify relationships in between E1 and E2

  • Resources available:

    • Surface Tokens

    • Part of Speech tags

    • Syntactic Parse Tree Structure

    • Dependency Graph

  • Use these clues to classify (x, E1,E2) into one of

CS 652, Peter Lindes


Parse Tree

CS 652, Peter Lindes


Dependency Graph

CS 652, Peter Lindes


Methods to Extract Relationships

  • Feature-based methods

    • String form, orthographic type, POS tag, etc.

    • Features from Dependency Graph

    • Features from Word Sequence

    • Features from Parse Trees

  • Kernel-based methods

    • Kernel function K(X, X’) captures similarity

    • Support Vector Machine (SVM) classifier

  • Rule-based methods

CS 652, Peter Lindes


Given Relationship, Find Entity Pairs

  • Given one or more relationship types

  • Find all occurrences in a corpus

  • Open document collection

  • No labeled unstructured training data

  • Instead, seeding for each relationship type is used

CS 652, Peter Lindes


Seed Data for Relationship Type r

  • The types of entities that are arguments of r

    • Often specified at a high level, eg. proper noun, common noun, numeric, etc.

    • Types such as “Person” or “Company” require patterns to recognize them

  • A seed database S of entities that have r

    • May include negative examples

  • A seed set or manually coded patterns

    • Easy for generic relationships, eg. hypernym or meronym (part-of)

CS 652, Peter Lindes


3 Steps for Relationship Extraction

  • Start with above seeding data

    • A corpus D

    • Relationship types r1,…,rk

    • Entity types Tr1, Tr2 for each r

    • A set S of examples (Ei1,Ei2,ri) 1 ≤ i ≤ N

  • 1: Use S to learn extraction patterns M

  • 2: Use a subset of patterns to create candidates

  • 3: Validation: select a subset based on statistical tests

CS 652, Peter Lindes


Example Data

  • Relationships: “IsPhDAdvisorOf”, “Acquired”

  • Entity types: “(Person, Person)”, “(Company, Company)”

CS 652, Peter Lindes


Learn Patterns from Seed Triples

  • Assume only one relationship for each pair

  • Thus each example for r is negative for r’

  • 1: Find sentences with entity pairs

    • For (E1,E2,r) query for “E1 NEAR E2”

    • Filter out where E1, E2 don’t match Tr1, Tr2

  • 2: Filter sentences for the relationship

  • 3: Learn patterns from sentences

CS 652, Peter Lindes


Filtering Sentences

  • Example:

  • Banko: a simple heuristic using the length of dependency links

  • This fails for above example

CS 652, Peter Lindes


Learn Patterns from Sentences

  • Formulate as a standard classification problem

  • Two practical problems:

    • No guarantee of positive examples

      • Bunescu and Mooney: use SVM

    • Many sentences for each pair

      • Bunescu and Mooney: down-weight correlated terms

CS 652, Peter Lindes


Extract Candidate Entity Pairs

  • Learned model M: (x,E1,E2) -> r

  • Simple method: sequential scan over D

    • Look for Tr1, Tr2, then apply M

  • Large, indexed corpus: retrieve relevant sentences

    • Use keyword search

      • Pattern-based

      • Keyword-based

      • Agichtein and Gravano: iterative solution

CS 652, Peter Lindes


Validate Extracted Relationships

  • Extraction has high error rates

  • Validation based on corpus-wide statistics

  • Probabilities based on count of occurrences

    • Extract only high-confidence relationships

  • Rare relationships:

    • Use contextual pattern

    • Alternative: correct entity boundary errors

CS 652, Peter Lindes


Summary

  • Setting 1: entities already marked

    • Feature-based and kernel-based methods

    • Clues from word sequence, parse trees, and dependency graphs

    • Training data with labeled relationships

  • Setting 2: open corpus, given relationship types

    • No labeled unstructured data

    • Seed database of (E1,E2,r) examples

    • Bootstrapping from seed data

    • Filter based on relevancy

  • Accuracy:

    • 50%-70% for closed benchmark datasets

    • Lots of special case handling for the web

CS 652, Peter Lindes


Further Readings

  • Concentrated here on binary relationships

  • Natural extension: records with multi-way relationships

  • Requires cross-sentence analysis:

    • Co-reference resolution

    • Discourse analysis

  • Much literature on this topic

  • Future research: discovering relevant relationship types

CS 652, Peter Lindes


  • Login