1 / 17

Modeling Query-Based Access to Text Databases

Modeling Query-Based Access to Text Databases. Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University. Extracting Structured Information “Buried” in Text Documents.

davidalopez
Download Presentation

Modeling Query-Based Access to Text Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Query-Based Access to Text Databases Eugene AgichteinPanagiotis IpeirotisLuis Gravano Computer Science Department Columbia University

  2. Extracting Structured Information “Buried” in Text Documents May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Information Extraction System (e.g., NYU’s Proteus)

  3. InformationExtraction System Extracted Tuples Extracting All “Tuples” of a Relation from a Text Database • Naïve approach: feed every document to information extraction system. At 7 secs./document,Proteus takes over 8 days for 100K documents • Only a tiny fraction of documents contains tuples  Processing every document is inefficient • Many databases are not crawlable (scannable), but available only via a search engine. Search engines can help:efficiency and accessibility

  4. A Query-Based Strategy for Information Extraction[Agichtein and Gravano, ICDE 2003] 1While seed has unprocessed tuple t 2Retrieve up to MaxResults documents using query derived from t 3Extract new tuples te from these documents 4Augment seed with te 0Start with some seed tuples (e.g., <“May 1995”, “Ebola”, “Zaire”>) seed t0 t1 t2 Potential problem: May run out of tuples (and queries)  incomplete relation!

  5. Iterative Methods Sometimes (but not Always) “Succeed” seed seed SUCCESS! FAIL  Can we predict if a query-based strategy will succeed?

  6. Model: Querying Graph Tokens Documents • Tokens: Tuple attributes <“May 1995”, “Ebola”, “Zaire”> • Each Token (as query) retrieves documents • Documents contain tokens t1 d1 d2 t2 t3 d3 t4 d4 t5 d5

  7. Model: ReachabilityGraph Tokens Documents t1 t1 d1 t2 t3 d2 t2 t3 d3 t4 t5 t4 d4 t1retrieves document d1that contains t2 t2, t3, and t4 “reachable” from t1 t5 d5

  8. Core Out In (strongly connected) Model: Connected Components t1 t3 t2 t4 Tokens not in Core, but are reachable from Core Tokens not in Core but from which Core is reachable

  9. Components of Reachability Graph Core In Out t0 (strongly connected) Out In Core How many tokens are in the largest Core + Out? Out In Core

  10. Model: Power-law Graphs • Conjecture: Degree distribution in the reachability graph follows power-law: #(nodes with degree k) ≈ O(k-β) (i.e., many nodes with small degree, a few nodes with large degree) • Power-law random graphs are expected to have at most onegiant connected component (~Core+In+Out). Other connected components are small.

  11. Model: Reachability Reachability : Fraction of tokens in the largest Core + Out (Power law allows to ignore small components) Core t0 In Out (strongly connected)

  12. Estimating Reachability • In a power-law random graph G a giant component CG emerges if the average outdegree d > 1 • Graph theory results predict relative size of CG [Chung and Lu, Annals of Combinatorics, 2002 ] Estimate reachability asrelative size of CG, which reduces to estimating average outdegree of reachability graph

  13. Estimating Reachability Using Sampling(estimate average outdegree) • Choose S random seedtokens • Query the database for seed • Extract tokens to compute the reachability graph edges for seed tokens. • Estimate d as average outdegree of seed tokens. • Estimate reachability Tokens Documents t1 t1 d1 d2 t2 t3 t3 d3 t4 d4 t2 t2 d =1.5 t5 d5 t4

  14. Experimental Results: Verifying the “Power-law” Conjecture Task 1: NYT DiseaseOutbreaks(Date, Disease, Location) New York Times, 1995 |T|= 8,859 |D|=137,000 Follows the power-law distribution

  15. Experimental Results:Estimating Reachability by Sampling • Approximate reachability isestimated with S = 50 tokens • The reachability correctly predicts performance of query-based information extractionstrategy • If the estimated reachability is too low,can switch to a different strategy early

  16. Future Work Tokens Documents • What if we have only limited access to the database? • Limit on number of queries • Limit on number of documents retrieved • Not modelled by reachability graph, but can be modelled using properties of querying graph t1 d1 d2 t2 t3 d3 t4 d4 t5 d5

  17. Summary • Presented graph model for query-based algorithms: • for Information Extraction • for Constructing Database Content Summaries • Showed that querying and reachability graphs can be used to analyze such algorithms • Presented single reachability metric to predict success of iterative query-based algorithms • Presented and verified conjecture that reachability graphs for these algorithms follow the power law • Presented efficient techniques for estimating reachability by exploiting properties of power-law random graphs

More Related