scalable information extraction n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Scalable Information Extraction PowerPoint Presentation
Download Presentation
Scalable Information Extraction

Loading in 2 Seconds...

play fullscreen
1 / 90

Scalable Information Extraction - PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on

Scalable Information Extraction. Eugene Agichtein. Example: Angina treatments. Structured databases (e.g., drug info, WHO drug adverse effects DB, etc). Medical reference and literature. Web search results. Research Goal.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Scalable Information Extraction' - colm


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
example angina treatments
Example: Angina treatments

Structured databases (e.g., drug info, WHO drug adverse effects DB, etc)

Medical reference and literature

Web search results

research goal
Research Goal

Accurate, intuitive, and efficient access to knowledge in unstructured sources

Approaches:

  • Information Retrieval
    • Retrieve the relevant documents or passages
    • Question answering
  • Human Reading
    • Construct domain-specific “verticals” (MedLine)
  • Machine Reading
    • Extract entities and relationships
    • Network of relationships: Semantic Web
semantic relationships buried in unstructured text

]

M essageU nderstandingC onferences

Semantic Relationships “Buried” in Unstructured Text

RecommendedTreatment

A number of well-designed and -executed large-scale clinical trials have now shown that treatment with statins reduces recurrent myocardial infarction, reduces strokes, and lessens the need for revascularization or hospitalization for unstable angina pectoris

  • Web, newsgroups, web logs
  • Text databases (PubMed, CiteSeer, etc.)
  • Newspaper Archives
    • Corporate mergers, succession, location
    • Terrorist attacks
what structured representation can do for you
What Structured Representation Can Do for You:
  • … allow precise and efficient querying
  • … allow returning answers instead of documents
  • … support powerful query constructs
  • … allow data integration with (structured) RDBMS
  • … provide useful content for Semantic Web

Structured Relation

challenges in information extraction
Challenges in Information Extraction
  • Portability
    • Reduce effort to tune for new domains and tasks
    • MUC systems: experts would take 8-12 weeks to tune
  • Scalability, Efficiency, Access
    • Enable information extraction over large collections
    • 1 sec / document * 5 billion docs = 158 CPU years
  • Approach: learn from data ( “Bootstrapping” )
    • Snowball: Partially Supervised Information Extraction
    • Querying Large Text Databases for Efficient Information Extraction
outline
Outline
  • Snowball: partially supervised information extraction (overview and key results)
  • Effective retrieval algorithms for information extraction (in detail)
  • Current: mining user behavior for web search
  • Future work
the snowball system overview

1

2

3

The Snowball System: Overview

Snowball

... ... ..

snowball getting user input

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Getting User Input

ACM DL 2000

  • User input:
  • a handful of example instances
  • integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc…
snowball finding example occurrences

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Finding Example Occurrences

Can use any full-text search engine

Search Engine

Computer servers at Microsoft’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA-based MicrosoftCorp

The Armonk-based IBM introduced a new line…Change of guard at IBM Corporation’s headquarters near Armonk, NY ...

snowball tagging entities

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Tagging Entities

Named entitytaggers can recognize Dates, People, Locations, Organizations, …

MITRE’s Alembic, IBM’s Talent, LingPipe, …

Computer servers at Microsoft ’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp

The Armonk -based IBM introduced a new line…Change of guard at IBM Corporation‘s headquarters near Armonk, NY ...

snowball extraction patterns

Computer servers at Microsoft’s headquarters in Redmond…

Snowball: Extraction Patterns
  • General extraction pattern model:

acceptor0, Entity, acceptor1, Entity, acceptor2

  • Acceptor instantiations:
    • String Match (accepts string “’s headquarters in”)
    • Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5), (in, 0.5)] )
    • Classifier (estimate P(T=valid | ‘s, headquarters, in) )
snowball generating patterns

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

LOCATION

{<- 0.71>, < based 0.71>}

ORGANIZATION

LOCATION

{<- 0.71>, < based 0.71>}

ORGANIZATION

ORGANIZATION

LOCATION

{<‘s 0.57>, <headquarters 0.57>, < near 0.57>}

Snowball: Generating Patterns

Represent occurrences as vectors of tags and terms

1

Cluster similar occurrences.

2

LOCATION

{<'s 0.57>, <headquarters 0.57>, < in 0.57>}

ORGANIZATION

snowball generating patterns1

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

LOCATION

{<- 0.71>, < based 0.71>}

ORGANIZATION

Snowball: Generating Patterns

Represent occurrences as vectors of tags and terms

1

Cluster similar occurrences.

2

Create patternsas filtered clustercentroids

3

LOCATION

{ <'s 0.71>, <headquarters 0.71>}

ORGANIZATION

snowball extracting new tuples

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Extracting New Tuples

Match tagged text fragments against patterns

Google 's new headquarters in Mountain Vieware …

V

LOCATION

ORGANIZATION

{<'s 0.5>, <new 0.5> <headquarters 0.5>, < in 0.5>}

{<are 1>}

LOCATION

{<located 0.71>, < in 0.71>}

ORGANIZATION

Match=0.4

P2

ORGANIZATION

{<'s 0.71>, <headquarters 0.71> }

LOCATION

Match=0.8

P1

{<- 0.71>, <based 0.71>

LOCATION

ORGANIZATION

Match=0

P3

snowball evaluating patterns

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Evaluating Patterns

Automatically estimate patternconfidence:Conf(P4)= Positive / Total

= 2/3 = 0.66

Current seed tuples

P4

ORGANIZATION

{ < , 1> }

LOCATION

IBM,Armonk, reported… Positive

Intel,SantaClara, introduced... Positive

“Bet on Microsoft”,New York-based analyst Jane Smith said... Negative

x

snowball evaluating tuples

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Evaluating Tuples

Automatically evaluate tuple confidence:

Conf(T) =

A tuple has high confidence if generated by high-confidence patterns.

Conf(T): 0.83

ORGANIZATION

P4: 0.66

{ < , 1> }

LOCATION

3COMSanta Clara

0.4

{<- 0.75>, <based 0.75>}

ORGANIZATION

LOCATION

0.8

P3: 0.95

snowball evaluating tuples1

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Evaluating Tuples

... .... ..

... .... ..

Keep only high-confidence tuples for next iteration

snowball evaluating tuples2

GetExamples

Find Example Occurrences in Text

Evaluate Tuples

Tag Entities

Extract Tuples

Generate Extraction Patterns

Snowball: Evaluating Tuples

Start new iteration with expandedexample set

Iterate until no new tuples are extracted

pattern tuple duality
Pattern-Tuple Duality
  • A “good” tuple:
    • Extracted by “good” patterns
    • Tuple weight  goodness
  • A “good” pattern:
    • Generated by “good” tuples
    • Extracts “good” new tuples
    • Pattern weight  goodness
  • Edge weight:
    • Match/Similarity of tuple context to pattern
how to set node weights
How to Set Node Weights
  • Constraint violation (from before)
    • Conf(P) = Log(Pos) Pos/(Pos+Neg)
    • Conf(T) =
  • HITS [Hassan et al., EMNLP 2006]
    • Conf(P) = ∑Conf(T)
    • Conf(T) = ∑Conf(P)
  • URNS [Downey et al., IJCAI 2005]
  • EM-Spy [Agichtein, SDM 2006]
    • Unknown tuples = Neg
    • Compute Conf(P), Conf(T)
    • Iterate
evaluating patterns and tuples expectation maximization
Evaluating Patterns and Tuples: Expectation Maximization
  • EM-Spy Algorithm
    • “Hide” labels for some seed tuples
    • Iterate EM algorithm to convergence on tuple/pattern confidence values
    • Set threshold t such that (t > 90% of spy tuples)
    • Re-initialize Snowball using new seed tuples

…..

adapting snowball for new relations
Adapting Snowball for New Relations
  • Large parameter space
    • Initial seed tuples (randomly chosen, multiple runs)
    • Acceptor features: words, stems, n-grams, phrases, punctuation, POS
    • Feature selection techniques: OR, NB, Freq, ``support’’, combinations
    • Feature weights: TF*IDF, TF, TF*NB, NB
    • Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy
  • Automatically estimate parameter values:
    • Estimate operating parameters based on occurrences of seed tuples
    • Run cross-validation on hold-out sets of seed tuples for optimal perf.
    • Seed occurrences that do not have close “neighbors” are discarded
example task 1 diseaseoutbreaks

SDM 2006

Example Task 1: DiseaseOutbreaks

Proteus: 0.409

Snowball: 0.415

example task 2 bioinformatics a k a mining the bibliome

ISMB 2003

Example Task 2: Bioinformaticsa.k.a. mining the “bibliome”

“APO-1, also known as DR6…”“MEK4, also called SEK1…”

  • 100,000+gene and protein synonyms extracted from 50,000+ journal articles
  • Approximately 40% of confirmed synonyms not previously listedin curated authoritative reference (SWISSPROT)
snowball used in various domains
Snowball Used in Various Domains
  • News: NYT, WSJ, AP [DL’00, SDM’06]
      • CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks
  • Medical literature: PDRHealth, Micromedex… [Thesis]
      • AdverseEffects, DrugInteractions, RecommendedTreatments
  • Biological literature: GeneWays corpus [ISMB’03]
      • Gene and Protein Synonyms
limits of bootstrapping for extraction

PresidentGeorgeWBush’s three-day visit to India

CIKM 2005

Limits of Bootstrapping for Extraction
  • Task “easy” when context term distributions diverge from background
  • Quantify as relative entropy (Kullback-Liebler divergence)
  • After calibration, metric predicts if bootstrapping likely to work
outline1
Outline
  • Snowball, a domain-independent, partially supervised information extraction system
  • Retrieval algorithms for scalable information extraction
  • Current: mining user behavior for web search
  • Future work
extracting a relation from a large text database

]

Expensive for large collections

Extracting A Relation From a Large Text Database

InformationExtraction

System

  • Brute force approach: feed all docs to information extraction system
  • Only a tiny fraction of documents are often useful
  • Many databases are not crawlable
  • Often a search interface is available, with existing keyword index
  • How to identify “useful” documents?

StructuredRelation

accessing text dbs via search engines
Accessing Text DBs via Search Engines

Search engines impose limitations

  • Limit on documents retrieved per query
  • Support simple keywords and phrases
  • Ignore “stopwords” (e.g., “a”, “is”)

InformationExtraction

System

Search Engine

StructuredRelation

qxtract q uerying text databases for robust scalable information e xtrac t ion
QXtract: Querying Text Databases for Robust Scalable Information EXtraction

User-Provided Seed Tuples

Query Generation

Queries

Promising Documents

Information Extraction System

Problem: Learn keyword queries to retrieve “promising” documents

Extracted Relation

learning queries to retrieve promising documents
Learning Queries to Retrieve Promising Documents

User-Provided Seed Tuples

  • Get document sample with “likely negative” and “likely positive” examples.
  • Labelsample documents using information extraction system as “oracle.”
  • Train classifiers to “recognize” useful documents.
  • Generate queries from classifier model/rules.

Seed Sampling

Information Extraction System

Classifier Training

Query Generation

Queries

training classifiers to recognize useful documents
Training Classifiers to Recognize “Useful” Documents

+

D1

Document features: words

+

D2

-

D3

-

D4

Okapi (IR)

SVM

Ripper

products

disease

disease AND reported => USEFUL

exported

reported

used

epidemic

far

infected

virus

generating queries from classifiers
Generating Queries from Classifiers

Ripper

SVM

Okapi (IR)

disease AND reported => USEFUL

products

disease

exported

reported

epidemic

used

infected

far

virus

epidemicvirus

virusinfected

disease AND reported

QCombined

disease and reportedepidemicvirus

tuples a simple querying strategy
Convert given tuples into queries

Retrieve matching documents

Extract new tuples from documents and iterate

Tuples: A Simple Querying Strategy

“Ebola” and “Zaire”

Search Engine

InformationExtraction

System

comparison of document access methods
Comparison of Document Access Methods

QXtract: 60% of relation extracted from 10% of documents of 135,000 newspaper article database

Tuplesstrategy: Recall at most 46%

how to choose the best strategy
How to choose the best strategy?
  • Tuples: Simple, no training, but limited recall
  • QXtract: Robust, but has training and query overhead
  • Scan: No overhead, but must process all documents
predicting recall of tuples strategy

WebDB 2003

Predicting Recall of Tuples Strategy

Seed

Tuple

Seed

Tuple

SUCCESS!

FAILURE 

Can we predict if Tuples will succeed?

abstract the problem querying graph
Abstract the problem: Querying Graph

Tuples

Documents

“Ebola” and “Zaire”

t1

Search Engine

d1

t2

d2

t3

d3

t4

d4

t5

d5

Note: Only top K docs returned for each query. <Violence, U.S.>  retrieves many documents that do not contain tuples;  searching for an extracted tuple may not retrieve source document

information reachability graph
Information ReachabilityGraph

Tuples

Documents

t1

t1

d1

t2

t3

d2

t2

t3

d3

t4

t5

t4

d4

t1retrieves document d1that contains t2

t2, t3, and t4 “reachable” from t1

t5

d5

connected components
Connected Components

Tuples that retrieve other tuples and themselves

Reachable Tuples, do not retrieve tuples in Core

Tuples that retrieve other tuples but are not reachable

sizes of connected components
Sizes of Connected Components

How many tuples are in largest Core + Out?

  • Conjecture:
    • Degree distribution in reachability graphs follows “power-law.”
    • Then, reachability graph has at most one giant component.
  • Define Reachability as Fraction of tuples in largest Core + Out

Out

In

Out

In

Core

Core

t0

Out

In

(strongly

Core

connected)

nyt reachability graph outdegree distribution
NYT Reachability Graph: Outdegree Distribution

Matches the power-law distribution

MaxResults=10

MaxResults=50

nyt component size distribution
NYT: Component Size Distribution

Not “reachable”

“reachable”

MaxResults=10

MaxResults=50

CG / |T| = 0.297

CG / |T| = 0.620

connected components visualization
Connected Components Visualization

DiseaseOutbreaks, New York Times 1995

estimating reachability
Estimating Reachability

In a power-law random graph G a giant component CG emerges* if d (the average outdegree) > 1, and:

  • Estimate: Reachability ~ CG / |T|
  • Depends only on d (average outdegree)

Chung and Lu, Annals of Combinatorics, 2002

* For b < 3.457

estimating reachability algorithm
Estimating Reachability Algorithm

Tuples

Documents

t1

t1

d1

  • Pick some random tuples
  • Use tuples to query database
  • Extract tuples from matching documents to compute reachability graph edges
  • Estimate average outdegree
  • Estimate reachability using results of Chung and Lu, Annals of Combinatorics, 2002

d2

t2

t3

t3

d3

t4

d4

t2

t2

d =1.5

t4

estimating reachability of nyt
Estimating Reachability of NYT

.46

Approximate reachability is estimated after ~ 50 queries.

Can be used to predict success (or failure) of a Tuples querying strategy.

slide52
To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks, [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]
  • Information extraction applications extract structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Disease Outbreaks in The New York Times

Information Extraction System (e.g., NYU’s Proteus)

an abstract view of text centric tasks

For the rest of the talk

An Abstract View of Text-Centric Tasks

[Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

Text Database

Extraction

System

Retrieve documents from database

Process documents

Extract output tuples

executing a text centric task
Executing a Text-Centric Task

Text Database

Extraction

System

Similar to relational world

Retrieve documents from database

Process documents

Extract output tuples

Two major execution paradigms

  • Scan-based: Retrieve and process documents sequentially
  • Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results

→underlying data distribution dictates what is best

  • Indexes are only “approximate”: index is on keywords, not on tuples of interest
  • Choice of execution plan affects output completeness (not only speed)

Unlike the relational world

execution plan characteristics
Execution Plan Characteristics

Question: How do we choose the fastestexecution plan for reaching a targetrecall ?

Text Database

Extraction

System

Retrieve documents from database

Process documents

Extract output tuples

Execution Plans have two main characteristics:

  • Execution Time
  • Recall (fraction of tuples retrieved)

“What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?”

outline2
Outline
  • Description and analysis of crawl- and query-based plans
    • Scan
    • Filtered Scan
    • Iterative Set Expansion
    • Automatic Query Generation
  • Optimization strategy
  • Experimental results and conclusions

Crawl-based

Query-based

(Index-based)

slide57
Scan

Text Database

Extraction

System

  • Scanretrieves and processes documentssequentially (until reaching target recall)

Execution time = |Retrieved Docs| · (R + P)

Retrieve docs from database

Process documents

Extract output tuples

Question: How many documents does Scan retrieve to reach target recall?

Time for processing a document

Time for retrieving a document

Filtered Scanuses a classifier to identify and process only promising documents (details in paper)

estimating recall of scan

S documents

Estimating Recall of Scan

<SARS, China>

Modeling Scan for tuple t:

  • What is the probability of seeing t (with frequency g(t)) after retrieving S documents?
  • A “sampling without replacement” process
  • After retrieving S documents, frequency of tuple t follows hypergeometric distribution
  • Recall for tuplet is the probability that frequency of t in S docs > 0

Probability of seeing tuple t after retrieving S documents

g(t) = frequency of tuple t

estimating recall of scan1
Estimating Recall of Scan

<SARS, China>

<Ebola, Zaire>

Modeling Scan:

  • Multiple “sampling without replacement” processes, one for each tuple
  • Overall recall is average recall across tuples

→ We can compute number of documents required to reach target recall

Execution time = |Retrieved Docs| · (R + P)

iterative set expansion
Iterative Set Expansion

Text Database

Extraction

System

Query

Generation

Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q

Process retrieved documents

Extract tuplesfrom docs

Augment seed tuples with new tuples

Query database with seed tuples

(e.g., <Malaria, Ethiopia>)

(e.g., [Ebola AND Zaire])

Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall?

Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall?

Time for answering a query

Time for retrieving a document

Time for processing a document

using querying graph for analysis
Using Querying Graph for Analysis

tuples

Documents

We need to compute the:

  • Number of documents retrieved after sending Q tuples as queries (estimates time)
  • Number of tuples that appear in the retrieved documents (estimates recall)

To estimate these we need to compute the:

  • Degree distribution of the tuples discovered by retrieving documents
  • Degree distribution of the documents retrieved by the tuples
  • (Not the same as the degree distribution of a randomly chosen tuple or document – it is easier to discover documents and tuples with high degrees)

t1

d1

<SARS, China>

d2

t2

<Ebola, Zaire>

t3

d3

<Malaria, Ethiopia>

t4

d4

<Cholera, Sudan>

t5

d5

<H5N1, Vietnam>

summary of cost analysis
Summary of Cost Analysis
  • Our analysis so far:
    • Takes as input a target recall
    • Gives as output the time for each plan to reach target recall(time = infinity, if plan cannot reach target recall)
  • Time and recall depend on task-specific properties of database:
    • tuple degree distribution
    • Document degree distribution
  • Next, we show how to estimate degree distributions on-the-fly
estimating cost model parameters
Estimating Cost Model Parameters

tuple and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters!

parameter estimation
Parameter Estimation
  • Naïve solution for parameter estimation:
    • Start with separate, “parameter-estimation” phase
    • Perform random sampling on database
    • Stop when cross-validation indicates high confidence
  • We can do better than this!
  • No need for separate sampling phase
  • Sampling is equivalent to executing the task:

→Piggyback parameter estimation into execution

on the fly parameter estimation

Initial, default estimation

Updated estimation

Updated estimation

On-the-fly Parameter Estimation

Correct (but unknown) distribution

  • Pick most promising execution plan for target recall assuming “default” parameter values
  • Start executing task
  • Update parameter estimates during execution
  • Switch plan if updated statistics indicate so

Important

  • Only Scan acts as “random sampling”
  • All other execution plan need parameter adjustment (see paper)
outline3
Outline
  • Description and analysis of crawl- and query-based plans
  • Optimization strategy
  • Experimental results and conclusions
correctness of theoretical analysis
Correctness of Theoretical Analysis
  • Solid lines: Actual time
  • Dotted lines: Predicted time with correct parameters

Task: Disease Outbreaks

Snowball IE system

182,531 documents from NYT

16,921 tuples

experimental results information extraction
Experimental Results (Information Extraction)
  • Solid lines: Actual time
  • Green line: Time with optimizer

(results similar in other experiments – see paper)

conclusions
Conclusions
  • Common execution plans for multiple text-centric tasks
  • Analytic models for predicting execution time and recall of various crawl- and query-based plans
  • Techniques for on-the-fly parameter estimation
  • Optimization framework picks on-the-fly the fastestplan for target recall
can we do better
Can we do better?
  • Yes. For some information extraction systems
bindings engine be slides cafarella 2005
Bindings Engine (BE) [Slides: Cafarella 2005]
  • Bindings Engine (BE) is search engine where:
    • No downloads during query processing
    • Disk seeks constant in corpus size
    • #queries = #phrases
  • BE’s approach:
    • “Variabilized” search query language
    • Pre-processes all documents before query-time
    • Integrates variable/type data with inverted index, minimizing query seeks
be query support
BE Query Support

cities such as <NounPhrase>

President Bush <Verb>

<NounPhrase> is the capital of <NounPhrase>

reach me at <phone-number>

  • Any sequence of concrete terms and typed variables
  • NEAR is insufficient
  • Functions (e.g., “head(<NounPhrase>)”)
be operation
BE Operation
  • Like a generic search engine, BE:
    • Downloads a corpus of pages
    • Creates an index
    • Uses index to process queries efficiently
  • BE further requires:
    • Set of indexed types (e.g., “NounPhrase”), with a “recognizer” for each
    • String processing functions (e.g., “head()”)
  • A BE system can only process types and functions that its index supports
slide74

#docs

docid0

docid1

docid2

docid#docs-1

#docs

docid0

docid1

#docs

#docs

docid0

docid0

docid1

docid1

docid2

docid2

docid#docs-1

docid#docs-1

#docs

docid0

docid1

docid2

docid3

#docs

docid0

#docs

docid0

docid1

docid2

#docs

docid0

docid1

#docs

docid0

docid1

docid2

docid#docs-1

#docs

docid0

slide75

104

21

150

322

2501

15

99

322

426

1309

Query: such as

#docs

docid0

docid1

docid2

docid#docs-1

  • Test for equality
  • Advance smaller pointer
  • Abort when a list is exhausted

#docs

docid0

docid1

docid2

docid#docs-1

322

Returned docs:

slide76

#docs

docid0

pos0

docid1

pos1

docid#docs-1

pos#docs-1

#docs

#docs

docid0

docid0

docid1

docid1

docid2

docid2

docid#docs-1

docid#docs-1

#posns

pos0

pos1

pos#pos-1

#docs

docid0

pos0

docid1

pos1

docid#docs-1

pos#docs-1

#posns

pos0

pos1

pos#pos-1

“such as”

In phrase queries, match positions as well

neighbor index

Right

Left

Neighbor Index
  • At each position in the index, store “neighbor text” that might be useful
  • Let’s index <NounPhrase> and <Adj-Term>

“I love cities such as Atlanta.”

AdjT: “love”

neighbor index1

Right

Left

Neighbor Index
  • At each position in the index, store “neighbor text” that might be useful
  • Let’s index <NounPhrase> and <Adj-Term>

“I love cities such as Atlanta.”

AdjT: “cities”

NP: “cities”

AdjT: “I”

NP: “I”

neighbor index2

Right

Left

Neighbor Index

Query: “cities such as <NounPhrase>”

“I love cities such as Atlanta.”

AdjT: “Atlanta”

NP: “Atlanta”

AdjT: “such”

slide80

“cities such as <NounPhrase>”

#docs

docid0

pos0

docid1

pos1

docid#docs-1

pos#docs-1

19

#posns

#posns

pos0

pos0

neighbor0

pos1

pos1

neighbor1

pos#pos-1

pos#pos-1

12

blk_offset

#neighbors

neighbor0

str0

neighbor1

str1

<offset>

3

AdjTleft

such

NPright

Atlanta

In doc 19, starting at posn 8:

“I love cities such as Atlanta.”

  • Find phrase query positions, as with phrase queries
  • If term is adjacent to variable, extract typed value
current research directions
Current Research Directions
  • Modeling explicit and Implicit network structures
    • Modeling evolution of explicit structure on web, blogspace, wikipedia
    • Modeling implicit link structures in text, collections, web
    • Exploiting implicit & explicit social networks (e.g., for epidemiology)
  • Knowledge Discovery from Biological and Medical Data
    • Automatic sequence annotation  bioinformatics, genetics
    • Actionable knowledge extraction from medical articles
  • Robust information extraction, retrieval, and query processing
    • Integrating information in structured and unstructured sources
    • Robust search/question answering for medical applications
    • Confidence estimation for extraction from text and other sources
    • Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
    • Accuracy (!=authority) of online sources
  • Information diffusion/propagation in online sources
    • Information propagation on the web
    • In collaborative sources (wikipedia, MedLine)
page quality in search of an unbiased web ranking cho roy adams sigmod 2005
Page Quality: In Search of an Unbiased Web Ranking[Cho, Roy, Adams, SIGMOD 2005]
  • “popular pages tend to get even more popular, while unpopular pages get ignored by an average user”
slide83
Sic Transit Gloria Telae:Towards an Understanding of theWeb’s Decay [Bar-Yossef, Broder, Kumar, Tomkins, WWW 2004]
modeling social networks for
Modeling Social Networks for
  • Epidemiology, security, …

Email exchange mapped onto cubicle locations.

some research directions
Some Research Directions
  • Modeling explicit and Implicit network structures
    • Modeling evolution of explicit structure on web, blogspace, wikipedia
    • Modeling implicit link structures in text, collections, web
    • Exploiting implicit & explicit social networks (e.g., for epidemiology)
  • Knowledge Discovery from Biological and Medical Data
    • Automatic sequence annotation  bioinformatics, genetics
    • Actionable knowledge extraction from medical articles
  • Robust information extraction, retrieval, and query processing
    • Integrating information in structured and unstructured sources
    • Query processing over unstructured text
    • Robust search/question answering for medical applications
    • Confidence estimation for extraction from text and other sources
    • Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
  • Information diffusion/propagation in online sources
    • Information propagation on the web
    • In collaborative sources (wikipedia, MedLine)
mining text and sequence data

Agichtein & Eskin, PSB 2004

Mining Text and Sequence Data

ROC50 scores for each class and method

some research directions1
Some Research Directions
  • Modeling explicit and Implicit network structures
    • Modeling evolution of explicit structure on web, blogspace, wikipedia
    • Modeling implicit link structures in text, collections, web
    • Exploiting implicit & explicit social networks (e.g., for epidemiology)
  • Knowledge Discovery from Biological and Medical Data
    • Automatic sequence annotation  bioinformatics, genetics
    • Actionable knowledge extraction from medical articles
  • Robust information extraction, retrieval, and query processing
    • Integrating information in structured and unstructured sources
    • Robust search/question answering for medical applications
    • Confidence estimation for extraction from text and other sources
    • Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
    • Accuracy (!=authority) of online sources
  • Information diffusion/propagation in online sources
    • Information propagation on the web
    • In collaborative sources (wikipedia, MedLine)
structure and evolution of blogspace kumar novak raghavan tomkins cacm 2004 kdd 2006
Structure and evolution of blogspace[Kumar, Novak, Raghavan, Tomkins, CACM 2004, KDD 2006]

Fraction of nodes in components of various sizes within Flickr and Yahoo! 360 timegraph, by week.

current research directions1
Current Research Directions
  • Modeling explicit and Implicit network structures
    • Modeling evolution of explicit structure on web, blogspace, wikipedia
    • Modeling implicit link structures in text, collections, web
    • Exploiting implicit & explicit social networks (e.g., for epidemiology)
  • Knowledge Discovery from Biological and Medical Data
    • Automatic sequence annotation  bioinformatics, genetics
    • Actionable knowledge extraction from medical articles
  • Robust information extraction, retrieval, and query processing
    • Integrating information in structured and unstructured sources
    • Robust search/question answering for medical applications
    • Confidence estimation for extraction from text and other sources
    • Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
    • Accuracy (!=authority) of online sources
  • Information diffusion/propagation in online sources
    • Information propagation on the web, news
    • In collaborative sources (wikipedia, MedLine)
thank you
Thank You
  • Details:

http://www.mathcs.emory.edu/~eugene/