Efficient Information Extraction from Dynamic Corpora Using Advanced Programs

Optimizing Complex Extraction Programs over evolving Text Data • Authors : • Fei Chen University of Wisconsin-Madison, Madison, WI, USA • Byron J. Gao Texas State University-San Marcos, San Marcos, TX, USA • AnHai Doan University of Wisconsin-Madison, Madison, WI, USA • Jun Yang Duke University, Durham, NC, USA • Raghu Ramakrishnan Yahoo! Research, Santa Clara, CA, USA • Presented by : Yogendra Godbole

Introduction • Motivation • Traditional IE method: Static • Practical conditions: Dynamic corpus • DBlife(10000+URLs,120+MB corpus snapshot.) • Enterprise Intranet • Problem • How to efficiently extract information based on Dynamic corpora

Problem Definition • Concepts • Data pages, Extractors, Mentions • An extractor E:p→R(a1,a2,…,an) extracts mentions of relation R from page p. A mention of R is a tuple(m1,m2,…,mn,)such that mi is either a mention of attribute ai or nil. • Examples • Assumptions • Extract mentions from each single data pages

Methods • Concepts • Extractor scope • Let s.start and s.end be the start and end character positions of a string s in a page p. We say an extractor E has scope α iff for any mention m = (m1, . . . ,mn) produced by E, (maxi mi.end − mini mi.start) < α, where mi.start and mi.end are the start and end character positions of attribute mention mi in page p. • Extractor Context • The β-context of mention m in page p is the string p[(m.start−β)..(m.end+ β)], i.e., the string of m being extended on both sides by β characters. We say extractor E has context β iff for any m and p′ obtained by perturbing the text of p outside the β- context of m, applying E to p′ still produces m as a mention. • Challenges • Matchers (Find overlapping)

Problem Definition (cont) • Let P1, . . . , Pn be consecutive snapshots of a text corpus, ρ be an IE program written in xlog, E1, . . . ,Em be the IE blackboxes (i.e., IE predicates) in ρ, and (α1, β1), . . . , (αm, βm) be the estimated scopes and contexts for the blackboxes, respectively. Develop a solution to execute ρ over corpus snapshot Pn+1 with minimal cost, by reusing extraction results over P1, . . . , Pn.

Solutions • CAPTURING IE RESULTS • Level of Reuse • IE Results to Capture • Storing Captured IE Results • REUSING CAPTURED IE RESULTS • Scope of Mention Reuse • Overall Processing Algorithm • Identifying Reuse with Matchers • SELECTING A GOOD IE PLAN • Searching for Good Plans • Cost Model

Evaluation(DataSet)

Experimental Results

Sources : • ACM : http://portal.acm.org/citation.cfm?doid=1559845.1559881 • Overview of SIGMOD 2009 idke.ruc.edu.cn/seminars/2009/07.04/SIGMOD2009%20Overview.ppt

Efficient Information Extraction from Dynamic Corpora Using Advanced Programs

Efficient Information Extraction from Dynamic Corpora Using Advanced Programs

Presentation Transcript

Understanding Complex Text

Text Extraction from Big Data

Complex Text

Information Extraction: Distilling Structured Data from Unstructured Text.

Comprehending complex text:

Data extraction

Information extraction from text

Information extraction from text

Information extraction from text

Medical text extraction

Evolving Residency Programs - Tracks

Data Integration and Extraction over Molecular Biological Data

Optimizing Statistical Information Extraction Programs Over Evolving Text

Information extraction from text

Information extraction from text

Information extraction from text

Data Extraction

Information extraction from text

Data Extraction