Ie approaches
Sponsored Links
This presentation is the property of its rightful owner.
1 / 17

IE approaches PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

IE approaches. Traditional IE (from NLP and CL) Using syntactic and semantic constraints Wrapper (independently developed for WWW) Using delimiter-based extraction patterns This paper Soft Pattern + IR(PRF) + summarization (sentence retrieval/ranking, MMR) techniques.

Download Presentation

IE approaches

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Ie approaches

IE approaches

  • Traditional IE (from NLP and CL)

    • Using syntactic and semantic constraints

  • Wrapper (independently developed for WWW)

    • Using delimiter-based extraction patterns

  • This paper

    • Soft Pattern + IR(PRF) + summarization (sentence retrieval/ranking, MMR) techniques

Unsupervised learning of soft patterns for generating definitions from online news

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  • IE from QA perspective

  • Research question: finding definition sentence for terms or person names;

  • Previous approaches:

    • hand-crafted rules (previous paper) or

    • supervised learning

  • Research method:

    • unsupervised soft patterns +IR + summarization

  • External tools needed: commercial pos tagger and syntactic chunker (NP, VP)

Soft patterns

Soft Patterns

  • A virtual vector representation (window size 3)

    • <Slot-w, ……, Slot-2, Slot-1, SCH_TERM , Slot1, Slot2, ……Slotw : Pa>

  • Slot: a vector of tokens with their probabilities of occurrence

    • <(tokeni1, weighti1), (tokeni2, eighti2) ……(tokenim, weightim): Sloti>

  • Token: word, punctuation or syntactic tag (substituted?)

Ie approaches

Soft Patterns Emerged from Text

Soft patterns matching process


Test sentence

Tagging, chunking, substitution

Tagging, chunking, substitution

Pa instances

<token-w, ……, token-2, token-1, SCH_TERM, token1, token2, …… tokenw : S>

S instance

Probability estimate

Soft patternsPa

Soft Patterns Matching Process

Matching:1) bag-of-words similarity using Naive Bayes2) sequences fidelity using bigram model3) weighing patterns by their overall weight

Soft patterns matching

Soft Patterns Matching

  • bag-of-words similarity using Naive Bayes

  • sequences fidelity using bigram model

Where is Pa?

Manual Tuning alpha?

System architecture

System Architecture

Search Term

IR, anaphora resolution

Final sentenceselection

Input relevant sentences

Redundancy removal: MMR


Matched candidatesentences as definition

Reranking by pattern matching

Ranked sentences

Top n by PRF

SP generation

Pseudo-relevance feedback or assumption?

Centroid word selection

Centroid Word Selection

  • Which sentences are mostly likely to contain a definition?

    • Local centroid words (summarization techniques)

    • For each word, compute its mutual info with search term

Summary of the techniques employed

Summary of the techniques employed

  • Core: soft pattern generalization and matching

  • Others:

    • Heavy use of summarization techniques

      • MMR for redundancy removal

      • Sentence Ranking/Retrieval

    • Shallow NLP

      • POS tagging and syntactic chunker

Evaluation for information extraction

Evaluation for Information Extraction

Evaluation for definition extraction

Evaluation for Definition Extraction

  • Test data:

    • TREC QA corpus

    • Online news (heuristics leaning to news text)

  • Experiment:

    • Comparison to HCR and centroid-based statistical method (baseline)

    • F5-measure

Ie approaches

Evaluation for TREC collection

Ie approaches

Evaluation for Web Corpus

Questions for this paper

Questions for this paper

  • Chunker-variate performance? (NP, VP)

  • Manual tuning parameter (alpha, delta)?

  • Void PRF?

  • Question selection: seed for pattern generation

  • Is it “patterns” or just one pattern at all?

  • Arbitrary window size?

  • Is it really “unsupervised learning?”

    • Part of data used for rule induction

  • Can SP+PRF really beat HCR?



  • Line Eikvil. Information Extraction from World Wide Web. Norwegian Computing Center Technical Report 1999

  • William Cohen and Andrew McCallum. Information Extraction from World Wide Web. Kdd tutorial 2003

  • Stephen Soderland. Learning Information Extraction Rules from Semi-structured and Free-text. Machine Learning (1) 1999

  • Fuchun Peng. Models for Information Extraction. Technical Report (2000 or 2001?)

  • Douglas E. Appelt and David J. Israel. Introduction to Information Extraction Technologies. IJCAI’99 Tutorial.

  • Login