90 likes | 157 Views
This study delves into optimizing information extraction programs over evolving text data by proposing techniques to efficiently extract information from dynamic corpora. The authors present concepts, methods, challenges, and solutions involving extractors, scopes, contexts, matchers, and IE program execution. The research focuses on capturing and reusing IE results, selecting optimal extraction plans, and evaluating cost models. Experimental results are discussed and relevant sources provided.
E N D
Optimizing Complex Extraction Programs over evolving Text Data • Authors : • Fei Chen University of Wisconsin-Madison, Madison, WI, USA • Byron J. Gao Texas State University-San Marcos, San Marcos, TX, USA • AnHai Doan University of Wisconsin-Madison, Madison, WI, USA • Jun Yang Duke University, Durham, NC, USA • Raghu Ramakrishnan Yahoo! Research, Santa Clara, CA, USA • Presented by : Yogendra Godbole
Introduction • Motivation • Traditional IE method: Static • Practical conditions: Dynamic corpus • DBlife(10000+URLs,120+MB corpus snapshot.) • Enterprise Intranet • Problem • How to efficiently extract information based on Dynamic corpora
Problem Definition • Concepts • Data pages, Extractors, Mentions • An extractor E:p→R(a1,a2,…,an) extracts mentions of relation R from page p. A mention of R is a tuple(m1,m2,…,mn,)such that mi is either a mention of attribute ai or nil. • Examples • Assumptions • Extract mentions from each single data pages
Methods • Concepts • Extractor scope • Let s.start and s.end be the start and end character positions of a string s in a page p. We say an extractor E has scope α iff for any mention m = (m1, . . . ,mn) produced by E, (maxi mi.end − mini mi.start) < α, where mi.start and mi.end are the start and end character positions of attribute mention mi in page p. • Extractor Context • The β-context of mention m in page p is the string p[(m.start−β)..(m.end+ β)], i.e., the string of m being extended on both sides by β characters. We say extractor E has context β iff for any m and p′ obtained by perturbing the text of p outside the β- context of m, applying E to p′ still produces m as a mention. • Challenges • Matchers (Find overlapping)
Problem Definition (cont) • Let P1, . . . , Pn be consecutive snapshots of a text corpus, ρ be an IE program written in xlog, E1, . . . ,Em be the IE blackboxes (i.e., IE predicates) in ρ, and (α1, β1), . . . , (αm, βm) be the estimated scopes and contexts for the blackboxes, respectively. Develop a solution to execute ρ over corpus snapshot Pn+1 with minimal cost, by reusing extraction results over P1, . . . , Pn.
Solutions • CAPTURING IE RESULTS • Level of Reuse • IE Results to Capture • Storing Captured IE Results • REUSING CAPTURED IE RESULTS • Scope of Mention Reuse • Overall Processing Algorithm • Identifying Reuse with Matchers • SELECTING A GOOD IE PLAN • Searching for Good Plans • Cost Model
Sources : • ACM : http://portal.acm.org/citation.cfm?doid=1559845.1559881 • Overview of SIGMOD 2009 idke.ruc.edu.cn/seminars/2009/07.04/SIGMOD2009%20Overview.ppt