270 likes | 395 Views
This study presents an innovative framework for extracting and integrating data from imprecise web sources. It addresses challenges such as unsupervised learning, automatic extraction, scalability, and the handling of uncertain data without predefined labels. The framework combines various techniques and evaluates these methods against numerous data-intensive domains, including finance and sports, providing insights into effective data retrieval processes. The integration of multiple templates allows for improved accuracy and efficiency in data aggregation.
E N D
Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre (Creative CommonsLicense, see last slide)
Data-intensive websites target Website Template1 Database Template2 Template3
Flint goal Last Min Max StockQuote … Volume 52high Open
System architecture Flint Web Search [WIDM08] Data Extraction Data Integration The Web
Novel contribution Data Extraction Data Integration • Unsupervised • Automatic • Scalable • No knowledgeavailable • Unsupervised • Automatic • Scalable • Uncertain Data • No labelsavailable • No corpus available WebTables [Vldb08] Cimple [Vldb07] MetaQuerier [Cidr05] PayGo [Cidr07] RoadRunner [Vldb01] ExAlg [Sigmod03] TurboWrapper [Vldb07]
Data Extraction AAPL, GOOG, MSFT, INTC, … 128.09, 439.54, 34.89, 112.37, … 127.81, 439.25, 32.13, 111.01, … 132.43, 443.82, 33.67, 114.32, … 0.50%, -0.38%, 1.23%, 3.92%, -1.65%, … Add AAPL toYour Portfolio, Add GOOG toYour Portfolio, Add MSFT toYour Portfolio, Add INTC toYour Portfolio, … …
Data Extraction HTML fragments taken from two pages belonging to the same website: Extractionerror! ? /html/body/table/tr[1]/td[2] 1,132,228 , 1,735,857 /html/body/table/tr[2]/td[2] $20.66 , $414.58 /html/body/table/tr[3]/td[2] $11.70 , $247.30 /html/body/table/tr[4]/td[2] $20.72 , $414.06 /html/body/table/tr[5]/td[2] $0.02 , 99,494,200 /html/body/table/tr[6]/td[2] 4,732,600 , null
Data Integration 10 33 16 4 25 10 AA GO MS (min) (max) (stock)
Data Integration t=0.5 t=0.5 t=0.5 10 33 16 4 25 10 AA GO MS (max) (min) (stock)
Data Integration t=0.5 t=0.5 t=0.5 10 33 16 4 25 10 AA GO MS (max) (min) (stock) 10 33 16 4 25 10 AA GO MS 1.0 1.0 1.0 (min) (max) (stock)
Data Integration t=0.5 t=0.5 t=0.5 10 33 16 4 25 10 AA GO MS (max) (min) (stock) 10 33 16 4 25 10 AA GO MS (min) (max) (stock)
Data Integration t=0.5 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 AA GO MS AA GO MS (max) (max) (min) (min) (stock) (stock) 4 25 10 AA GO MS 6 26 12 0.6 1.0 (min) (stock) (price) 1.0
Data Integration t=0.5 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS ? (max) (max) (min) (min) (price) (stock) (stock) 4 25 10 AA GO MS 1.0 (min) (stock) 1.0
Data Integration t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS (max) (max) (min) (min) (price) (stock) (stock) 4 25 10 AA GO MS 1.0 (min) (stock)
Data Integration t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS (max) (max) (min) (min) (price) (stock) (stock) 4 25 10 AA GO MS 1.0 (min) (stock)
Data Integration t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS AA GO MS (max) (max) (min) (min) (price) (stock) (stock) (stock) 4 25 10 (min)
Wrapper Refinement t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS AA GO MS ? ? (max) (max) (min) (min) (min) (price) (stock) (stock) (stock) 0.0 0.0 0.3 (weak) 0.3 (weak) 10 null 10 (min/max)
Wrapper Refinement //td[contains(text(),‘Open')]/../td[2] //td[contains(text(),‘Open')]/../../tr[5]/td[1] //td[contains(text(),‘Open')]/../../tr[5]/td[2] //td[contains(text(),‘High')]/../td[2] … matching value nearby template tokens
Wrapper Refinement t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS AA GO MS (max) (max) (min) (min) (min) (price) (stock) (stock) (stock) 1.0 1.0 //td[contains(text(),‘Max')]/../td[2] 10 33 16 4 2510 //td[contains(text(),‘Min')]/../td[2] (max) (min) 10 null 10 (min/max)
Wrapper Refinement t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 10 33 16 4 25 10 4 25 10 4 25 10 4 2510 6 26 12 AA GO MS AA GO MS AA GO MS (max) (max) (max) (min) (min) (min) (min) (price) (stock) (stock) (stock) 10 null 10 (min/max)
Experimental Results(100 websites for each domain) • Soccer domain • (45,714 pages) • Attribute |m| • Name 90 • Birth Date 61 • Height 54 • Nationality 48 • Club 43 • Position 43 • Weight 34 • League 14 • Videogame domain • (49,262 pages) • Attribute |m| • Title 86 • Publisher 59 • Developer 45 • Genre 28 • ESRB rating 40 • Release Date 9 • Platform 9 • # Players 6 • Finance domain • (57,623 pages) • Attribute |m| • Stock Symbol 84 • Price Change 73 • % Change 73 • Volume 52 • Day Low 43 • Day High 41 • Last Price 29 • Open Price 24
Demo • Found Websites • Integrated Data
the end! http://flint.dia.uniroma3.it
License • This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.