Scalable information extraction
This presentation is the property of its rightful owner.
Sponsored Links
1 / 90

Scalable Information Extraction PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on
  • Presentation posted in: General

Scalable Information Extraction. Eugene Agichtein. Example: Angina treatments. Structured databases (e.g., drug info, WHO drug adverse effects DB, etc). Medical reference and literature. Web search results. Research Goal.

Download Presentation

Scalable Information Extraction

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Scalable information extraction

Scalable Information Extraction

Eugene Agichtein


Example angina treatments

Example: Angina treatments

Structured databases (e.g., drug info, WHO drug adverse effects DB, etc)

Medical reference and literature

Web search results


Research goal

Research Goal

Accurate, intuitive, and efficient access to knowledge in unstructured sources

Approaches:

  • Information Retrieval

    • Retrieve the relevant documents or passages

    • Question answering

  • Human Reading

    • Construct domain-specific “verticals” (MedLine)

  • Machine Reading

    • Extract entities and relationships

    • Network of relationships: Semantic Web


Semantic relationships buried in unstructured text

]

M essageU nderstandingC onferences

Semantic Relationships “Buried” in Unstructured Text

RecommendedTreatment

A number of well-designed and -executed large-scale clinical trials have now shown that treatment with statins reduces recurrent myocardial infarction, reduces strokes, and lessens the need for revascularization or hospitalization for unstable angina pectoris

  • Web, newsgroups, web logs

  • Text databases (PubMed, CiteSeer, etc.)

  • Newspaper Archives

    • Corporate mergers, succession, location

    • Terrorist attacks


What structured representation can do for you

What Structured Representation Can Do for You:

  • … allow precise and efficient querying

  • … allow returning answers instead of documents

  • … support powerful query constructs

  • … allow data integration with (structured) RDBMS

  • … provide useful content for Semantic Web

Structured Relation


Challenges in information extraction

Challenges in Information Extraction

  • Portability

    • Reduce effort to tune for new domains and tasks

    • MUC systems: experts would take 8-12 weeks to tune

  • Scalability, Efficiency, Access

    • Enable information extraction over large collections

    • 1 sec / document * 5 billion docs = 158 CPU years

  • Approach: learn from data ( “Bootstrapping” )

    • Snowball: Partially Supervised Information Extraction

    • Querying Large Text Databases for Efficient Information Extraction


Outline

Outline

  • Snowball: partially supervised information extraction (overview and key results)

  • Effective retrieval algorithms for information extraction (in detail)

  • Current: mining user behavior for web search

  • Future work


The snowball system overview

1

2

3

The Snowball System: Overview

Snowball

... ... ..


Snowball getting user input

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Getting User Input

ACM DL 2000

  • User input:

  • a handful of example instances

  • integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc…


Snowball finding example occurrences

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Finding Example Occurrences

Can use any full-text search engine

Search Engine

Computer servers at Microsoft’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA-based MicrosoftCorp

The Armonk-based IBM introduced a new line…Change of guard at IBM Corporation’s headquarters near Armonk, NY ...


Snowball tagging entities

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Tagging Entities

Named entitytaggers can recognize Dates, People, Locations, Organizations, …

MITRE’s Alembic, IBM’s Talent, LingPipe, …

Computer servers at Microsoft ’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp

The Armonk -based IBM introduced a new line…Change of guard at IBM Corporation‘s headquarters near Armonk, NY ...


Snowball extraction patterns

Computer servers at Microsoft’s headquarters in Redmond…

Snowball: Extraction Patterns

  • General extraction pattern model:

    acceptor0, Entity, acceptor1, Entity, acceptor2

  • Acceptor instantiations:

    • String Match (accepts string “’s headquarters in”)

    • Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5), (in, 0.5)] )

    • Classifier (estimate P(T=valid | ‘s, headquarters, in) )


Snowball generating patterns

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

LOCATION

{<- 0.71>, < based 0.71>}

ORGANIZATION

LOCATION

{<- 0.71>, < based 0.71>}

ORGANIZATION

ORGANIZATION

LOCATION

{<‘s 0.57>, <headquarters 0.57>, < near 0.57>}

Snowball: Generating Patterns

Represent occurrences as vectors of tags and terms

1

Cluster similar occurrences.

2

LOCATION

{<'s 0.57>, <headquarters 0.57>, < in 0.57>}

ORGANIZATION


Snowball generating patterns1

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

LOCATION

{<- 0.71>, < based 0.71>}

ORGANIZATION

Snowball: Generating Patterns

Represent occurrences as vectors of tags and terms

1

Cluster similar occurrences.

2

Create patternsas filtered clustercentroids

3

LOCATION

{ <'s 0.71>, <headquarters 0.71>}

ORGANIZATION


Snowball extracting new tuples

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Extracting New Tuples

Match tagged text fragments against patterns

Google 's new headquarters in Mountain Vieware …

V

LOCATION

ORGANIZATION

{<'s 0.5>, <new 0.5> <headquarters 0.5>, < in 0.5>}

{<are 1>}

LOCATION

{<located 0.71>, < in 0.71>}

ORGANIZATION

Match=0.4

P2

ORGANIZATION

{<'s 0.71>, <headquarters 0.71> }

LOCATION

Match=0.8

P1

{<- 0.71>, <based 0.71>

LOCATION

ORGANIZATION

Match=0

P3


Snowball evaluating patterns

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Evaluating Patterns

Automatically estimate patternconfidence:Conf(P4)= Positive / Total

= 2/3 = 0.66

Current seed tuples

P4

ORGANIZATION

{ < , 1> }

LOCATION

IBM,Armonk, reported… Positive

Intel,SantaClara, introduced... Positive

“Bet on Microsoft”,New York-based analyst Jane Smith said... Negative

x


Snowball evaluating tuples

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Evaluating Tuples

Automatically evaluate tuple confidence:

Conf(T) =

A tuple has high confidence if generated by high-confidence patterns.

Conf(T): 0.83

ORGANIZATION

P4: 0.66

{ < , 1> }

LOCATION

3COMSanta Clara

0.4

{<- 0.75>, <based 0.75>}

ORGANIZATION

LOCATION

0.8

P3: 0.95


Snowball evaluating tuples1

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Evaluating Tuples

... .... ..

... .... ..

Keep only high-confidence tuples for next iteration


Snowball evaluating tuples2

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Evaluating Tuples

Start new iteration with expandedexample set

Iterate until no new tuples are extracted


Pattern tuple duality

Pattern-Tuple Duality

  • A “good” tuple:

    • Extracted by “good” patterns

    • Tuple weight  goodness

  • A “good” pattern:

    • Generated by “good” tuples

    • Extracts “good” new tuples

    • Pattern weight  goodness

  • Edge weight:

    • Match/Similarity of tuple context to pattern


How to set node weights

How to Set Node Weights

  • Constraint violation (from before)

    • Conf(P) = Log(Pos) Pos/(Pos+Neg)

    • Conf(T) =

  • HITS [Hassan et al., EMNLP 2006]

    • Conf(P) = ∑Conf(T)

    • Conf(T) = ∑Conf(P)

  • URNS [Downey et al., IJCAI 2005]

  • EM-Spy [Agichtein, SDM 2006]

    • Unknown tuples = Neg

    • Compute Conf(P), Conf(T)

    • Iterate


Snowball em based pattern evaluation

Snowball: EM-based Pattern Evaluation


Evaluating patterns and tuples expectation maximization

Evaluating Patterns and Tuples: Expectation Maximization

  • EM-Spy Algorithm

    • “Hide” labels for some seed tuples

    • Iterate EM algorithm to convergence on tuple/pattern confidence values

    • Set threshold t such that (t > 90% of spy tuples)

    • Re-initialize Snowball using new seed tuples

…..


Adapting snowball for new relations

Adapting Snowball for New Relations

  • Large parameter space

    • Initial seed tuples (randomly chosen, multiple runs)

    • Acceptor features: words, stems, n-grams, phrases, punctuation, POS

    • Feature selection techniques: OR, NB, Freq, ``support’’, combinations

    • Feature weights: TF*IDF, TF, TF*NB, NB

    • Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy

  • Automatically estimate parameter values:

    • Estimate operating parameters based on occurrences of seed tuples

    • Run cross-validation on hold-out sets of seed tuples for optimal perf.

    • Seed occurrences that do not have close “neighbors” are discarded


Example task 1 diseaseoutbreaks

SDM 2006

Example Task 1: DiseaseOutbreaks

Proteus: 0.409

Snowball: 0.415


Example task 2 bioinformatics a k a mining the bibliome

ISMB 2003

Example Task 2: Bioinformaticsa.k.a. mining the “bibliome”

“APO-1, also known as DR6…”“MEK4, also called SEK1…”

  • 100,000+gene and protein synonyms extracted from 50,000+ journal articles

  • Approximately 40% of confirmed synonyms not previously listedin curated authoritative reference (SWISSPROT)


Snowball used in various domains

Snowball Used in Various Domains

  • News: NYT, WSJ, AP [DL’00, SDM’06]

    • CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks

  • Medical literature: PDRHealth, Micromedex… [Thesis]

    • AdverseEffects, DrugInteractions, RecommendedTreatments

  • Biological literature: GeneWays corpus [ISMB’03]

    • Gene and Protein Synonyms


  • Limits of bootstrapping for extraction

    PresidentGeorgeWBush’s three-day visit to India

    CIKM 2005

    Limits of Bootstrapping for Extraction

    • Task “easy” when context term distributions diverge from background

    • Quantify as relative entropy (Kullback-Liebler divergence)

    • After calibration, metric predicts if bootstrapping likely to work


    Few relations cover common questions

    25 relations cover > 50% of question types, 5 relations cover > 55% question instances

    SIGIR 2005

    Few Relations Cover Common Questions


    Outline1

    Outline

    • Snowball, a domain-independent, partially supervised information extraction system

    • Retrieval algorithms for scalable information extraction

    • Current: mining user behavior for web search

    • Future work


    Extracting a relation from a large text database

    ]

    Expensive for large collections

    Extracting A Relation From a Large Text Database

    InformationExtraction

    System

    • Brute force approach: feed all docs to information extraction system

    • Only a tiny fraction of documents are often useful

    • Many databases are not crawlable

    • Often a search interface is available, with existing keyword index

    • How to identify “useful” documents?

    StructuredRelation


    Accessing text dbs via search engines

    Accessing Text DBs via Search Engines

    Search engines impose limitations

    • Limit on documents retrieved per query

    • Support simple keywords and phrases

    • Ignore “stopwords” (e.g., “a”, “is”)

    InformationExtraction

    System

    Search Engine

    StructuredRelation


    Qxtract q uerying text databases for robust scalable information e xtrac t ion

    QXtract: Querying Text Databases for Robust Scalable Information EXtraction

    User-Provided Seed Tuples

    Query Generation

    Queries

    Promising Documents

    Information Extraction System

    Problem: Learn keyword queries to retrieve “promising” documents

    Extracted Relation


    Learning queries to retrieve promising documents

    Learning Queries to Retrieve Promising Documents

    User-Provided Seed Tuples

    • Get document sample with “likely negative” and “likely positive” examples.

    • Labelsample documents using information extraction system as “oracle.”

    • Train classifiers to “recognize” useful documents.

    • Generate queries from classifier model/rules.

    Seed Sampling

    Information Extraction System

    Classifier Training

    Query Generation

    Queries


    Training classifiers to recognize useful documents

    Training Classifiers to Recognize “Useful” Documents

    +

    D1

    Document features: words

    +

    D2

    -

    D3

    -

    D4

    Okapi (IR)

    SVM

    Ripper

    products

    disease

    disease AND reported => USEFUL

    exported

    reported

    used

    epidemic

    far

    infected

    virus


    Generating queries from classifiers

    Generating Queries from Classifiers

    Ripper

    SVM

    Okapi (IR)

    disease AND reported => USEFUL

    products

    disease

    exported

    reported

    epidemic

    used

    infected

    far

    virus

    epidemicvirus

    virusinfected

    disease AND reported

    QCombined

    disease and reportedepidemicvirus


    Sigmod 2003 demonstration

    SIGMOD 2003 Demonstration


    Tuples a simple querying strategy

    Convert given tuples into queries

    Retrieve matching documents

    Extract new tuples from documents and iterate

    Tuples: A Simple Querying Strategy

    “Ebola” and “Zaire”

    Search Engine

    InformationExtraction

    System


    Comparison of document access methods

    Comparison of Document Access Methods

    QXtract: 60% of relation extracted from 10% of documents of 135,000 newspaper article database

    Tuplesstrategy: Recall at most 46%


    How to choose the best strategy

    How to choose the best strategy?

    • Tuples: Simple, no training, but limited recall

    • QXtract: Robust, but has training and query overhead

    • Scan: No overhead, but must process all documents


    Predicting recall of tuples strategy

    WebDB 2003

    Predicting Recall of Tuples Strategy

    Seed

    Tuple

    Seed

    Tuple

    SUCCESS!

    FAILURE 

    Can we predict if Tuples will succeed?


    Abstract the problem querying graph

    Abstract the problem: Querying Graph

    Tuples

    Documents

    “Ebola” and “Zaire”

    t1

    Search Engine

    d1

    t2

    d2

    t3

    d3

    t4

    d4

    t5

    d5

    Note: Only top K docs returned for each query. <Violence, U.S.>  retrieves many documents that do not contain tuples;  searching for an extracted tuple may not retrieve source document


    Information reachability graph

    Information ReachabilityGraph

    Tuples

    Documents

    t1

    t1

    d1

    t2

    t3

    d2

    t2

    t3

    d3

    t4

    t5

    t4

    d4

    t1retrieves document d1that contains t2

    t2, t3, and t4 “reachable” from t1

    t5

    d5


    Connected components

    Connected Components

    Tuples that retrieve other tuples and themselves

    Reachable Tuples, do not retrieve tuples in Core

    Tuples that retrieve other tuples but are not reachable


    Sizes of connected components

    Sizes of Connected Components

    How many tuples are in largest Core + Out?

    • Conjecture:

      • Degree distribution in reachability graphs follows “power-law.”

      • Then, reachability graph has at most one giant component.

    • Define Reachability as Fraction of tuples in largest Core + Out

    Out

    In

    Out

    In

    Core

    Core

    t0

    Out

    In

    (strongly

    Core

    connected)


    Nyt reachability graph outdegree distribution

    NYT Reachability Graph: Outdegree Distribution

    Matches the power-law distribution

    MaxResults=10

    MaxResults=50


    Nyt component size distribution

    NYT: Component Size Distribution

    Not “reachable”

    “reachable”

    MaxResults=10

    MaxResults=50

    CG / |T| = 0.297

    CG / |T| = 0.620


    Connected components visualization

    Connected Components Visualization

    DiseaseOutbreaks, New York Times 1995


    Estimating reachability

    Estimating Reachability

    In a power-law random graph G a giant component CG emerges* if d (the average outdegree) > 1, and:

    • Estimate: Reachability ~ CG / |T|

    • Depends only on d (average outdegree)

    Chung and Lu, Annals of Combinatorics, 2002

    * For b < 3.457


    Estimating reachability algorithm

    Estimating Reachability Algorithm

    Tuples

    Documents

    t1

    t1

    d1

    • Pick some random tuples

    • Use tuples to query database

    • Extract tuples from matching documents to compute reachability graph edges

    • Estimate average outdegree

    • Estimate reachability using results of Chung and Lu, Annals of Combinatorics, 2002

    d2

    t2

    t3

    t3

    d3

    t4

    d4

    t2

    t2

    d =1.5

    t4


    Estimating reachability of nyt

    Estimating Reachability of NYT

    .46

    Approximate reachability is estimated after ~ 50 queries.

    Can be used to predict success (or failure) of a Tuples querying strategy.


    Scalable information extraction

    To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks, [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

    • Information extraction applications extract structured relations from unstructured text

    May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

    Disease Outbreaks in The New York Times

    Information Extraction System (e.g., NYU’s Proteus)


    An abstract view of text centric tasks

    For the rest of the talk

    An Abstract View of Text-Centric Tasks

    [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

    Text Database

    Extraction

    System

    Retrieve documents from database

    Process documents

    Extract output tuples


    Executing a text centric task

    Executing a Text-Centric Task

    Text Database

    Extraction

    System

    Similar to relational world

    Retrieve documents from database

    Process documents

    Extract output tuples

    Two major execution paradigms

    • Scan-based: Retrieve and process documents sequentially

    • Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results

    →underlying data distribution dictates what is best

    • Indexes are only “approximate”: index is on keywords, not on tuples of interest

    • Choice of execution plan affects output completeness (not only speed)

    Unlike the relational world


    Execution plan characteristics

    Execution Plan Characteristics

    Question: How do we choose the fastestexecution plan for reaching a targetrecall ?

    Text Database

    Extraction

    System

    Retrieve documents from database

    Process documents

    Extract output tuples

    Execution Plans have two main characteristics:

    • Execution Time

    • Recall (fraction of tuples retrieved)

    “What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?”


    Outline2

    Outline

    • Description and analysis of crawl- and query-based plans

      • Scan

      • Filtered Scan

      • Iterative Set Expansion

      • Automatic Query Generation

    • Optimization strategy

    • Experimental results and conclusions

    Crawl-based

    Query-based

    (Index-based)


    Scalable information extraction

    Scan

    Text Database

    Extraction

    System

    • Scanretrieves and processes documentssequentially (until reaching target recall)

      Execution time = |Retrieved Docs| · (R + P)

    Retrieve docs from database

    Process documents

    Extract output tuples

    Question: How many documents does Scan retrieve to reach target recall?

    Time for processing a document

    Time for retrieving a document

    Filtered Scanuses a classifier to identify and process only promising documents (details in paper)


    Estimating recall of scan

    S documents

    Estimating Recall of Scan

    <SARS, China>

    Modeling Scan for tuple t:

    • What is the probability of seeing t (with frequency g(t)) after retrieving S documents?

    • A “sampling without replacement” process

    • After retrieving S documents, frequency of tuple t follows hypergeometric distribution

    • Recall for tuplet is the probability that frequency of t in S docs > 0

    Probability of seeing tuple t after retrieving S documents

    g(t) = frequency of tuple t


    Estimating recall of scan1

    Estimating Recall of Scan

    <SARS, China>

    <Ebola, Zaire>

    Modeling Scan:

    • Multiple “sampling without replacement” processes, one for each tuple

    • Overall recall is average recall across tuples

      → We can compute number of documents required to reach target recall

    Execution time = |Retrieved Docs| · (R + P)


    Iterative set expansion

    Iterative Set Expansion

    Text Database

    Extraction

    System

    Query

    Generation

    Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q

    Process retrieved documents

    Extract tuplesfrom docs

    Augment seed tuples with new tuples

    Query database with seed tuples

    (e.g., <Malaria, Ethiopia>)

    (e.g., [Ebola AND Zaire])

    Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall?

    Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall?

    Time for answering a query

    Time for retrieving a document

    Time for processing a document


    Using querying graph for analysis

    Using Querying Graph for Analysis

    tuples

    Documents

    We need to compute the:

    • Number of documents retrieved after sending Q tuples as queries (estimates time)

    • Number of tuples that appear in the retrieved documents (estimates recall)

      To estimate these we need to compute the:

    • Degree distribution of the tuples discovered by retrieving documents

    • Degree distribution of the documents retrieved by the tuples

    • (Not the same as the degree distribution of a randomly chosen tuple or document – it is easier to discover documents and tuples with high degrees)

    t1

    d1

    <SARS, China>

    d2

    t2

    <Ebola, Zaire>

    t3

    d3

    <Malaria, Ethiopia>

    t4

    d4

    <Cholera, Sudan>

    t5

    d5

    <H5N1, Vietnam>


    Summary of cost analysis

    Summary of Cost Analysis

    • Our analysis so far:

      • Takes as input a target recall

      • Gives as output the time for each plan to reach target recall(time = infinity, if plan cannot reach target recall)

    • Time and recall depend on task-specific properties of database:

      • tuple degree distribution

      • Document degree distribution

    • Next, we show how to estimate degree distributions on-the-fly


    Estimating cost model parameters

    Estimating Cost Model Parameters

    tuple and document degree distributions belong to known distribution families

    Can characterize distributions with only a few parameters!


    Parameter estimation

    Parameter Estimation

    • Naïve solution for parameter estimation:

      • Start with separate, “parameter-estimation” phase

      • Perform random sampling on database

      • Stop when cross-validation indicates high confidence

    • We can do better than this!

    • No need for separate sampling phase

    • Sampling is equivalent to executing the task:

      →Piggyback parameter estimation into execution


    On the fly parameter estimation

    Initial, default estimation

    Updated estimation

    Updated estimation

    On-the-fly Parameter Estimation

    Correct (but unknown) distribution

    • Pick most promising execution plan for target recall assuming “default” parameter values

    • Start executing task

    • Update parameter estimates during execution

    • Switch plan if updated statistics indicate so

    Important

    • Only Scan acts as “random sampling”

    • All other execution plan need parameter adjustment (see paper)


    Outline3

    Outline

    • Description and analysis of crawl- and query-based plans

    • Optimization strategy

    • Experimental results and conclusions


    Correctness of theoretical analysis

    Correctness of Theoretical Analysis

    • Solid lines: Actual time

    • Dotted lines: Predicted time with correct parameters

    Task: Disease Outbreaks

    Snowball IE system

    182,531 documents from NYT

    16,921 tuples


    Experimental results information extraction

    Experimental Results (Information Extraction)

    • Solid lines: Actual time

    • Green line: Time with optimizer

      (results similar in other experiments – see paper)


    Conclusions

    Conclusions

    • Common execution plans for multiple text-centric tasks

    • Analytic models for predicting execution time and recall of various crawl- and query-based plans

    • Techniques for on-the-fly parameter estimation

    • Optimization framework picks on-the-fly the fastestplan for target recall


    Can we do better

    Can we do better?

    • Yes. For some information extraction systems


    Bindings engine be slides cafarella 2005

    Bindings Engine (BE) [Slides: Cafarella 2005]

    • Bindings Engine (BE) is search engine where:

      • No downloads during query processing

      • Disk seeks constant in corpus size

      • #queries = #phrases

    • BE’s approach:

      • “Variabilized” search query language

      • Pre-processes all documents before query-time

      • Integrates variable/type data with inverted index, minimizing query seeks


    Be query support

    BE Query Support

    cities such as <NounPhrase>

    President Bush <Verb>

    <NounPhrase> is the capital of <NounPhrase>

    reach me at <phone-number>

    • Any sequence of concrete terms and typed variables

    • NEAR is insufficient

    • Functions (e.g., “head(<NounPhrase>)”)


    Be operation

    BE Operation

    • Like a generic search engine, BE:

      • Downloads a corpus of pages

      • Creates an index

      • Uses index to process queries efficiently

    • BE further requires:

      • Set of indexed types (e.g., “NounPhrase”), with a “recognizer” for each

      • String processing functions (e.g., “head()”)

    • A BE system can only process types and functions that its index supports


    Scalable information extraction

    #docs

    docid0

    docid1

    docid2

    docid#docs-1

    #docs

    docid0

    docid1

    #docs

    #docs

    docid0

    docid0

    docid1

    docid1

    docid2

    docid2

    docid#docs-1

    docid#docs-1

    #docs

    docid0

    docid1

    docid2

    docid3

    #docs

    docid0

    #docs

    docid0

    docid1

    docid2

    #docs

    docid0

    docid1

    #docs

    docid0

    docid1

    docid2

    docid#docs-1

    #docs

    docid0


    Scalable information extraction

    104

    21

    150

    322

    2501

    15

    99

    322

    426

    1309

    Query: such as

    #docs

    docid0

    docid1

    docid2

    docid#docs-1

    • Test for equality

    • Advance smaller pointer

    • Abort when a list is exhausted

    #docs

    docid0

    docid1

    docid2

    docid#docs-1

    322

    Returned docs:


    Scalable information extraction

    #docs

    docid0

    pos0

    docid1

    pos1

    docid#docs-1

    pos#docs-1

    #docs

    #docs

    docid0

    docid0

    docid1

    docid1

    docid2

    docid2

    docid#docs-1

    docid#docs-1

    #posns

    pos0

    pos1

    pos#pos-1

    #docs

    docid0

    pos0

    docid1

    pos1

    docid#docs-1

    pos#docs-1

    #posns

    pos0

    pos1

    pos#pos-1

    “such as”

    In phrase queries, match positions as well


    Neighbor index

    Right

    Left

    Neighbor Index

    • At each position in the index, store “neighbor text” that might be useful

    • Let’s index <NounPhrase> and <Adj-Term>

    “I love cities such as Atlanta.”

    AdjT: “love”


    Neighbor index1

    Right

    Left

    Neighbor Index

    • At each position in the index, store “neighbor text” that might be useful

    • Let’s index <NounPhrase> and <Adj-Term>

    “I love cities such as Atlanta.”

    AdjT: “cities”

    NP: “cities”

    AdjT: “I”

    NP: “I”


    Neighbor index2

    Right

    Left

    Neighbor Index

    Query: “cities such as <NounPhrase>”

    “I love cities such as Atlanta.”

    AdjT: “Atlanta”

    NP: “Atlanta”

    AdjT: “such”


    Scalable information extraction

    “cities such as <NounPhrase>”

    #docs

    docid0

    pos0

    docid1

    pos1

    docid#docs-1

    pos#docs-1

    19

    #posns

    #posns

    pos0

    pos0

    neighbor0

    pos1

    pos1

    neighbor1

    pos#pos-1

    pos#pos-1

    12

    blk_offset

    #neighbors

    neighbor0

    str0

    neighbor1

    str1

    <offset>

    3

    AdjTleft

    such

    NPright

    Atlanta

    In doc 19, starting at posn 8:

    “I love cities such as Atlanta.”

    • Find phrase query positions, as with phrase queries

    • If term is adjacent to variable, extract typed value


    Current research directions

    Current Research Directions

    • Modeling explicit and Implicit network structures

      • Modeling evolution of explicit structure on web, blogspace, wikipedia

      • Modeling implicit link structures in text, collections, web

      • Exploiting implicit & explicit social networks (e.g., for epidemiology)

    • Knowledge Discovery from Biological and Medical Data

      • Automatic sequence annotation  bioinformatics, genetics

      • Actionable knowledge extraction from medical articles

    • Robust information extraction, retrieval, and query processing

      • Integrating information in structured and unstructured sources

      • Robust search/question answering for medical applications

      • Confidence estimation for extraction from text and other sources

      • Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)

      • Accuracy (!=authority) of online sources

    • Information diffusion/propagation in online sources

      • Information propagation on the web

      • In collaborative sources (wikipedia, MedLine)


    Page quality in search of an unbiased web ranking cho roy adams sigmod 2005

    Page Quality: In Search of an Unbiased Web Ranking[Cho, Roy, Adams, SIGMOD 2005]

    • “popular pages tend to get even more popular, while unpopular pages get ignored by an average user”


    Scalable information extraction

    Sic Transit Gloria Telae:Towards an Understanding of theWeb’s Decay [Bar-Yossef, Broder, Kumar, Tomkins, WWW 2004]


    Modeling social networks for

    Modeling Social Networks for

    • Epidemiology, security, …

    Email exchange mapped onto cubicle locations.


    Some research directions

    Some Research Directions

    • Modeling explicit and Implicit network structures

      • Modeling evolution of explicit structure on web, blogspace, wikipedia

      • Modeling implicit link structures in text, collections, web

      • Exploiting implicit & explicit social networks (e.g., for epidemiology)

    • Knowledge Discovery from Biological and Medical Data

      • Automatic sequence annotation  bioinformatics, genetics

      • Actionable knowledge extraction from medical articles

    • Robust information extraction, retrieval, and query processing

      • Integrating information in structured and unstructured sources

      • Query processing over unstructured text

      • Robust search/question answering for medical applications

      • Confidence estimation for extraction from text and other sources

      • Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)

    • Information diffusion/propagation in online sources

      • Information propagation on the web

      • In collaborative sources (wikipedia, MedLine)


    Mining text and sequence data

    Agichtein & Eskin, PSB 2004

    Mining Text and Sequence Data

    ROC50 scores for each class and method


    Some research directions1

    Some Research Directions

    • Modeling explicit and Implicit network structures

      • Modeling evolution of explicit structure on web, blogspace, wikipedia

      • Modeling implicit link structures in text, collections, web

      • Exploiting implicit & explicit social networks (e.g., for epidemiology)

    • Knowledge Discovery from Biological and Medical Data

      • Automatic sequence annotation  bioinformatics, genetics

      • Actionable knowledge extraction from medical articles

    • Robust information extraction, retrieval, and query processing

      • Integrating information in structured and unstructured sources

      • Robust search/question answering for medical applications

      • Confidence estimation for extraction from text and other sources

      • Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)

      • Accuracy (!=authority) of online sources

    • Information diffusion/propagation in online sources

      • Information propagation on the web

      • In collaborative sources (wikipedia, MedLine)


    Structure and evolution of blogspace kumar novak raghavan tomkins cacm 2004 kdd 2006

    Structure and evolution of blogspace[Kumar, Novak, Raghavan, Tomkins, CACM 2004, KDD 2006]

    Fraction of nodes in components of various sizes within Flickr and Yahoo! 360 timegraph, by week.


    Current research directions1

    Current Research Directions

    • Modeling explicit and Implicit network structures

      • Modeling evolution of explicit structure on web, blogspace, wikipedia

      • Modeling implicit link structures in text, collections, web

      • Exploiting implicit & explicit social networks (e.g., for epidemiology)

    • Knowledge Discovery from Biological and Medical Data

      • Automatic sequence annotation  bioinformatics, genetics

      • Actionable knowledge extraction from medical articles

    • Robust information extraction, retrieval, and query processing

      • Integrating information in structured and unstructured sources

      • Robust search/question answering for medical applications

      • Confidence estimation for extraction from text and other sources

      • Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)

      • Accuracy (!=authority) of online sources

    • Information diffusion/propagation in online sources

      • Information propagation on the web, news

      • In collaborative sources (wikipedia, MedLine)


    Thank you

    Thank You

    • Details:

      http://www.mathcs.emory.edu/~eugene/


  • Login