towards web scale information extraction l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Towards Web-Scale Information Extraction PowerPoint Presentation
Download Presentation
Towards Web-Scale Information Extraction

Loading in 2 Seconds...

play fullscreen
1 / 45

Towards Web-Scale Information Extraction - PowerPoint PPT Presentation


  • 177 Views
  • Uploaded on

Towards Web-Scale Information Extraction. Eugene Agichtein Mathematics & Computer Science Emory University eugene@mathcs.emory.edu http://www.mathcs.emory.edu/~eugene/. The Value of Text Data. “Unstructured” text data is the primary form of human-generated information

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Towards Web-Scale Information Extraction' - aulii


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
towards web scale information extraction

Towards Web-Scale Information Extraction

Eugene Agichtein

Mathematics & Computer Science

Emory University

eugene@mathcs.emory.edu

http://www.mathcs.emory.edu/~eugene/

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

the value of text data
The Value of Text Data
  • “Unstructured” text data is the primary form of human-generated information
    • Blogs, web pages, news, scientific literature, online reviews, …
    • Semi-structured data (database generated): see Prof. Bing Liu’s KDD webinar: http://www.cs.uic.edu/~liub/WCM-Refs.html
    • The techniques discussed here are complimentary to structured object extraction methods
  • Need to extract structured information to effectively manage, search, and mine the data
  • Information Extraction: mature, but active research area
    • Intersection of Computational Linguistics, Machine Learning, Data mining, Databases, and Information Retrieval
    • Traditional focus on accuracy of extraction

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

example answering queries over text
Example: Answering Queries Over Text

For years, Microsoft CorporationCEOBill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Select Name

From PEOPLE

Where Organization = ‘Microsoft’

PEOPLE

Name Title Organization

Bill GatesCEOMicrosoft

Bill VeghteVPMicrosoft

Richard StallmanFounderFree Soft..

Bill Gates

Bill Veghte

(from William Cohen’s IE tutorial, 2003)

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

outline
Outline
  • Information Extraction Tasks
    • Entity tagging
    • Relation extraction
    • Event extraction
  • Scaling up Information Extraction
    • Focus on scaling up to large collections (where data mining can be most beneficial)
    • Other dimensions of scalability

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

information extraction tasks
Information Extraction Tasks
  • Extracting entities and relations: this talk
    • Entities: named (e.g., Person) and generic (e.g., disease name)
    • Relations: entities related in a predefined way (e.g., Location of a Disease outbreak, or a CEO of a Company)
    • Events: can be composed from multiple relation tuples
  • Common extraction subtasks:
    • Preprocess: sentence chunking, syntactic parsing, morphological analysis
    • Create rules or extraction patterns: hand-coded, machine learning, and hybrid
    • Apply extraction patterns or rules to extract new information
    • Postprocess and integrate information
      • Co-reference resolution, deduplication, disambiguation

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

entity tagging
Entity Tagging
  • Identifying mentions of entities (e.g., person names, locations, companies) in text
    • MUC (1997): Person, Location, Organization, Date/Time/Currency
    • ACE (2005): more than 100 more specific types
  • Hand-coded vs. Machine Learning approaches
  • Best approach depends on entity type and domain:
    • Closed class (e.g., geographical locations, disease names, gene & protein names): hand coded + dictionaries
    • Syntactic (e.g., phone numbers, zip codes): regular expressions
    • Semantic (e.g., person and company names): mixture of context, syntactic features, dictionaries, heuristics, etc.
    • “Almost solved” for common/typical entity types

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

example extracting entities from text
Useful for data warehousing, data cleaning, web data integration

House number

Zip

State

Building

Road

City

Example: Extracting Entities from Text

Address

4089 Whispering Pines Nobel Drive San Diego CA 92122

1

Ronald Fagin, Combining Fuzzy Information from Multiple Systems, Proc. of ACM SIGMOD, 2002

Citation

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

hand coded methods

ContactPattern  RegularExpression(Email.body,”can be reached at”)

Hand-Coded Methods
  • Easy to construct in some cases
    • e.g., to recognize prices, phone numbers, zip codes, conference names, etc.
  • Intuitive to debug and maintain
    • Especially if written in a “high-level” language:
    • Can incorporate domain knowledge
  • Scalability issues:
    • Labor-intensive to create
    • Highly domain-specific
    • Often corpus-specific
    • Rule-matches can be expensive

[IBM Avatar]

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

machine learning methods

The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-dependent transcription by inhibiting MyoD-binding to the KIX domain of p300.“

[From AliBaba]

Machine Learning Methods
  • Can work well when lots of training data easy to construct
  • Can capture complex patterns that are hard to encode with hand-crafted rules
    • e.g., determine whether a review is positive or negative
    • extract long complex gene names
    • Non-local dependencies

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

representation models cohen and mccallum 2003
Representation Models [Cohen and McCallum, 2003]

Classify Pre-segmentedCandidates

Lexicons

Sliding Window

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

member?

Classifier

Classifier

Alabama

Alaska

Wisconsin

Wyoming

which class?

which class?

Try alternatewindow sizes:

Context Free Grammars

Boundary Models

Finite State Machines

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

BEGIN

Most likely state sequence?

NNP

NNP

V

V

P

NP

Most likely parse?

Classifier

PP

which class?

VP

NP

VP

BEGIN

END

BEGIN

END

S

…and beyond

Any of these models can be used to capture words, formatting or both.

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

popular machine learning methods
Popular Machine Learning Methods

For details: [Feldman, 2006 and Cohen, 2004]

  • Naive Bayes
  • SRV [Freitag 1998], Inductive Logic Programming
  • Rapier [Califf and Mooney 1997]
  • Hidden Markov Models [Leek 1997]
  • Maximum Entropy Markov Models [McCallum et al. 2000]
  • Conditional Random Fields [Lafferty et al. 2001]
  • Scalability
    • Can be labor intensive to construct training data
    • At run time, complex features can be expensive to construct or process (batch algorithms can help: [Chandel et al. 2006] )

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

some available entity taggers
Some Available Entity Taggers
  • ABNER:
    • http://www.cs.wisc.edu/~bsettles/abner/
    • Linear-chain conditional random fields (CRFs) with orthographic and contextual features.
  • Alias-I LingPipe
    • http://www.alias-i.com/lingpipe/
  • MALLET:
    • http://mallet.cs.umass.edu/index.php/Main_Page
    • Collection of NLP and ML tools, can be trained for name entity tagging
  • MinorThird:
    • http://minorthird.sourceforge.net/
    • Tools for learning to extract entities, categorization, and some visualization
  • Stanford Named Entity Recognizer:
    • http://nlp.stanford.edu/software/CRF-NER.shtml
    • CRF-based entity tagger with non-local features

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

alias i lingpipe http www alias i com lingpipe
Alias-I LingPipe ( http://www.alias-i.com/lingpipe/ )
  • Statistical named entity tagger
    • Generative statistical model
      • Find most likely tags given lexical and linguistic features
      • Accuracy at (or near) state of the art on benchmark tasks
  • Explicitly targets scalability:
    • ~100K tokens/second runtime on single PC
    • Pipelined extraction of entities
    • User-defined mentions, pronouns and stop list
      • Specified in a dictionary, left-to-right, longest match
    • Can be trained/bootstrapped on annotated corpora

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

outline14
Outline
  • Overview of Information Extraction
    • Entity tagging
    • Relation extraction
    • Event extraction
  • Scaling up Information Extraction
    • Focus on scaling up to large collections (where data mining and ML techniques shine)
    • Other dimensions of scalability

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

relation extraction examples

May 19 1995, Atlanta -- The Centers for Disease Control

and Prevention, which is in the front line of the world's

response to the deadly Ebola epidemic in Zaire, is finding

itself hard pressed to cope with the crisis…

interact

complex

CBF-A CBF-C

CBF-B CBF-A-CBF-C complex

associates

Relation Extraction Examples

Disease Outbreaks relation

  • Extract tuples of entities that are related in predefined way

Relation Extraction

„We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“

[From AliBaba]

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

relation extraction approaches
Relation Extraction Approaches

Knowledge engineering

  • Experts develop rules, patterns:
    • Can be defined over lexical items: “<company> located in <location>”
    • Over syntactic structures: “((Obj <company>) (Verb located) (*) (Subj <location>))”
  • Sophisticated development/debugging environments:
    • Proteus, GATE

Machine learning

  • Supervised: Train system over manually labeled data
    • Soderland et al. 1997, Muslea et al. 2000, Riloff et al. 1996, Roth et al 2005, Cardie et al 2006, Mooney et al. 2005, …
  • Partially-supervised: train system by bootstrapping from “seed” examples:
    • Agichtein & Gravano 2000, Etzioni et al., 2004, Yangarber & Grishman 2001, …
    • “Open” (no seeds): Sekine et al. 2006, Cafarella et al. 2007, Banko et al. 2007
  • Hybrid or interactive systems:
    • Experts interact with machine learning algorithms (e.g., active learning family) to iteratively refine/extend rules and patterns
    • Interactions can involve annotating examples, modifying rules, or any combination

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

open information extraction banko et al ijcai 2007
Open Information Extraction [Banko et al., IJCAI 2007]
  • Self-Supervised Learner:
    • All triples in a sample corpus (e1, r, e2) are considered potential “tuples” for relation r
    • Positive examples: candidate triplets generated by a dependency parser
    • Train classifier on lexical features for positive and negative examples
  • Single-Pass Extractor:
    • Classify all pairs of candidate entities for some (undetermined) relation
    • Heuristically generate a relation name from the words between entities
  • Redundancy-Based Assessor:
    • Estimate probability that entities are related from co-occurrence statistics
  • Scalability
    • Extraction/Indexing
      • No tuning or domain knowledge during extraction, relation inclusion determined at query time
      • 0.04 CPU seconds pre sentence, 9M web page corpus in 68 CPU hours
      • Every document retrieved, processed (parsed, indexed, classified) in a single pass
    • Query-time
      • Distributed index for tuples by hashing on the relation name text
  • Related efforts: [Cucerzan and Agichtein 2005], [Pasca et al. 2006], [Sekine et al. 2006], [Rozenfeld and Feldman 2006], …

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

event extraction
Event Extraction
  • Similar to Relation Extraction, but:
    • Events can be nested
    • Significantly more complex (e.g., more slots) than relations/template elements
    • Often requires coreference resolution, disambiguation, deduplication, and inference
  • Example: an integrated disease outbreak event [Hatunnen et al. 2002]

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

event extraction integration challenges
Event Extraction: Integration Challenges
  • Information spans multiple documents
    • Missing or incorrect values
    • Combining simple tuples into complex events
    • No single key to order or cluster likely duplicates while separating them from similar but different entities.
    • Ambiguity: distinct physical entities with same name (e.g., Kennedy)
  • Duplicate entities, relation tuples extracted
    • Large lists with multiple noisy mentions of the same entity/tuple
    • Need to depend on fuzzy and expensive string similarity functions
    • Cannot afford to compare each mention with every other.
  • See Part II of KDD 2006 Tutorial “Scalable Information Extraction and Integration” -- scaling up integration: http://www.scalability-tutorial.net/

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

other information extraction tutorials
Other Information Extraction Tutorials

See these tutorials for more details:

  • R. Feldman, Information Extraction – Theory and Practice, ICML 2006http://www.cs.biu.ac.il/~feldman/icml_tutorial.html
  • W. Cohen, A. McCallum, Information Extraction and Integration: an Overview, KDD 2003 http://www.cs.cmu.edu/~wcohen/ie-survey.ppt

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

summary accuracy of extraction tasks
Summary: Accuracy of Extraction Tasks
  • Errors cascade (errors in entity tag cause errors in relation extraction)
  • This estimate is optimistic:
    • Primarily for well-established (tuned) tasks
    • Many specialized or novel IE tasks (e.g. bio- and medical- domains) exhibit lower accuracy
    • Accuracy for all tasks is significantly lower for non-English

[Feldman, ICML 2006 tutorial]

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

multilingual information extraction
Multilingual Information Extraction
  • Active research area, beyond the scope of this talk. Nevertheless, a few (incomplete) pointers are provided.
  • Closely tied to machine translation and cross-language information retrieval efforts.
  • Language-independent named entity tagging and related tasks at CoNLL:
    • 2006: multi-lingual dependency parsing (http://nextens.uvt.nl/~conll/)
    • 2002, 2003 shared tasks: language independent Named Entity Tagging (http://www.cnts.ua.ac.be/conll2003/ner/)
  • Global Autonomous Language Exploitation program (GALE):
    • http://www.darpa.mil/ipto/Programs/gale/concept.htm
  • Interlingual Annotation of Multilingual Text Corpora (IAMTC)
    • Tools and data for building MT and IE systems for six languages
    • http://aitc.aitcnet.org/nsf/iamtc/index.html
  • REFLEX project: NER for 50 languages
    • Exploit for training temporal correlations in weekly aligned corpora
    • http://l2r.cs.uiuc.edu/~cogcomp/wpt.php?pr_key=REFLEX
  • Cross-Language Information Retrieval (CLEF)
    • http://www.clef-campaign.org/

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

outline23
Outline
  • Overview of Information Extraction
    • Entity tagging
    • Relation extraction
    • Event Extraction
  • Scaling up Information Extraction
    • Focus on scaling up to large collections (where data mining and ML techniques shine)
    • Other dimensions of scalability

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

scaling information extraction to the web
Scaling Information Extraction to the Web
  • Dimensions of Scalability
    • Corpus size:
      • Applying rules/patterns is expensive
      • Need efficient ways to select/filter relevant documents
    • Document accessibility:
      • Deep web: documents only accessible via a search interface
      • Dynamic sources: documents disappear from top page
    • Source heterogeneity:
      • Coding/learning patterns for each source is expensive
      • Requires many rules (expensive to apply)
    • Domain diversity:
      • Extracting information for any domain, entities, relationships
      • Some recent progress (e.g., see slide 17)
      • Not the focus of this talk

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

scaling up information extraction
Scaling Up Information Extraction
  • Scan-based extraction
    • Classification/filtering to avoid processing documents
    • Sharing common tags/annotations
  • General keyword index-based techniques
    • QXtract, KnowItAll
  • Specialized indexes
    • BE/KnowItNow, Linguist’s Search Engine
  • Parallelization/distributed processing
    • IBM WebFountain, UIMA, Google’s Map/Reduce

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

efficient scanning for information extraction

Classifier

Filter documents

Efficient Scanning for Information Extraction

Text Database

Extraction

System

  • 80/20 rule: use few simple rules to capture majority of the instances [Pantel et al. 2004]
  • Train a classifier to discard irrelevant documents without processing [Grishman et al. 2002]
    • (e.g., the Sports section of NYT is unlikely to describe disease outbreaks)
  • Share base annotations (entity tags) for multiple extraction tasks

filtered

Retrieve docs from database

Process documents

Extract output tuples

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

exploiting keyword and phrase indexes
Exploiting Keyword and Phrase Indexes
  • Generate queries to retrieve only relevant documents
  • Data mining problem!
  • Some methods in literature:
    • Traversing Query Graphs [Agichtein et al. 2003]
    • Iteratively refine queries [Agichtein and Gravano 2003]
    • Iteratively partition document space [Etzioni et al., 2004]
  • Case studies: QXtract, KnowItAll

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

simple strategy iterative set expansion
Simple Strategy: Iterative Set Expansion

Text Database

Extraction

System

Query

Generation

Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q

Process retrieved documents

Extract tuplesfrom docs

Augment seed tuples with new tuples

Query database with seed tuples

(e.g., <Malaria, Ethiopia>)

(e.g., [Ebola AND Zaire])

Time for retrieving a document

Time for processing a document

Time for answering a query

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

reachability via querying
Reachability via Querying

[Agichtein et al. 2003b]

ReachabilityGraph

Tuples

Documents

t1

t1

d1

<SARS, China>

t2

t3

d2

t2

<Ebola, Zaire>

t3

d3

t4

t5

<Malaria, Ethiopia>

t4

d4

t1retrieves document d1that contains t2

<Cholera, Sudan>

t5

d5

<H5N1, Vietnam>

Upper recall limit: determined by the size of the biggest connected component

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

reachability graph for diseaseoutbreaks
Reachability Graph for DiseaseOutbreaks

DiseaseOutbreaks, New York Times 1995

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

getting around reachability limits
Getting Around Reachability Limits
  • KnowItAll [Etzioni et al., WWW 2004]:
    • Add keywords to partition documents into retrievable disjoint sets
    • Submit queries with parts of extracted instances
  • QXtract [Agichtein and Gravano 2003a]
    • General queries with many matching documents
    • Assumes many documents retrievable per query

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

qxtract agichtein and gravano 2003

User-Provided Seed Tuples

QXtract [Agichtein and Gravano 2003]

Seed Sampling

  • Get document sample with “likely negative” and “likely positive” examples.
  • Label sample documents usinginformation extraction systemas “oracle.”
  • Train classifiers to “recognize”useful documents.
  • Generate queries from classifiermodel/rules.

Information Extraction

Classifier Training

Query Generation

Queries

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

using generic indexes summary
Using Generic Indexes: Summary
  • Order of magnitude speed up in runtime
  • But: keyword indexes are approximate (so the queries are not precise)
  • Require many documents to retrieve
  • Can we do better?

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

index structures for information extraction
Index Structures for Information Extraction
  • Bindings Engine [Cafarella and Etzioni 2005]
  • Indexing and querying entities: [K. Chakrabarti et al. 2006]
  • IBM Avatar project:
    • http://www.almaden.ibm.com/cs/projects/avatar/
  • Other indexing schemes:
    • Linguist’s search engine (P. Resnik): http://lse.umiacs.umd.edu:8080/
    • FREE: Indexing regular expressions: [Cho and Rajagolapan, ICDE 2002]
    • Indexing and querying linguistic information in XML: [Bird et al., 2006]

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

bindings engine be cafarella and etzioni 2005

<offset>

3

AdjTleft

such

NPright

Philadelphia

Bindings Engine (BE) [Cafarella and Etzioni 2005]
  • Variabilized search query language
  • Integrates variable/type data with inverted index, minimizing query seeks
    • Index <NounPhrase>, <Adj-Term> terms
  • Key idea: neighbor index
    • At each position in the index, store “neighbor text” – both lexemes and tags

Query: “cities such as <NounPhrase>”

#docs

docid0

pos0

docid1

pos1

docid#docs-1

pos#docs-1

19

#posns

#posns

pos0

pos0

neighbor0

pos1

pos1

neighbor1

pos#pos-1

pos#pos-1

12

blk_offset

#neighbors

neighbor0

str0

neighbor1

str1

Result: in document 19: “I love cities such as Philadelphia.”

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

related approach k chakrabarti et al 2006
Related Approach: [K. Chakrabarti et al. 2006]
  • Support “relationship” keyword queries over indexed entities
  • Top-K support for early processing termination

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

workload driven indexing s chakrabarti et al 2006 indexing thousands of entity types
Workload-Driven Indexing [S. Chakrabarti et al. 2006]Indexing Thousands of Entity Types

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

workload driven indexing continued
Workload-Driven Indexing (continued)

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

parallelization adaptive processing
Parallelization/Adaptive Processing
  • Parallelize processing:
    • WebFountain [Gruhl et al. 2004]
    • UIMA architecture
    • Map/Reduce

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

ibm webfountain
IBM WebFountain

[Gruhl et al. 2004]

  • Dedicated share-nothing 256-node cluster
  • Blackboard annotation architecture
  • Data pipelined and streamed past each “augmenter” to add annotations
  • Merge and index annotations
  • Index both tokens and annotations
  • Between 25K-75K entities per second

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

uima ibm research
UIMA (IBM Research)
  • Unstructured Information Management Architecture (UIMA)
    • http://www.research.ibm.com/UIMA/
  • Open component software architecture for development, composition, and deployment of text processing and analysis components.
  • Run-time framework allows to plug in components and applications and run them on different platforms. Supports distributed processing, failure recovery, …
  • Scales to millions of documents – incorporated into IBM OmniFind, grid computing-ready
  • The UIMA SDK (freely available) includes a run-time framework, APIs, and tools for composing and deploying UIMA components.
  • Framework source code also available on Sourceforge:
    • http://uima-framework.sourceforge.net/

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

map reduce dean ghemawat osdi 2004
Map/Reduce ([Dean & Ghemawat, OSDI 2004])

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

map reduce continued
Map/Reduce (continued)
  • General framework
    • Scales to 1000s of machines
    • Recently implemented in Nutch and other open source efforts
  • “Maps” nicely to information extraction
    • Map phase:
      • Parse individual documents
      • Tag entities
      • Propose candidate relation tuples
    • Reduce phase
      • Merge multiple mentions of same relation tuple
      • Resolve co-references, duplicates

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

summary
Summary
  • Presented a brief overview of information extraction
  • Techniques and ideas to scale up information extraction
    • Scan-based techniques (limited impact)
    • Exploiting general indexes (limited accuracy)
    • Building specialized index structures (most promising)
  • Scalability is a data mining problem
    • Querying graphs  link discovery
    • Workload mining for index optimization
    • Automatically optimizing for specific text mining application
    • Considerations for building integrated data mining systems

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction

references and supplemental materials
References and Supplemental Materials
  • References, slides, and additional information available at:

http://www.mathcs.emory.edu/~eugene/kdd-webinar/

  • Will also post answers to questions

Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction