1 / 27

Data Extraction and Integration from Imprecise Web Sources

Data Extraction and Integration from Imprecise Web Sources. Lorenzo Blanco , Mirko Bronzi, Valter Crescenzi , Paolo Merialdo , Paolo Papotti Università degli Studi Roma Tre (Creative Commons License , see last slide). Data-intensive websites. Data-intensive websites. target.

luana
Download Presentation

Data Extraction and Integration from Imprecise Web Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre (Creative CommonsLicense, see last slide)

  2. Data-intensive websites

  3. Data-intensive websites target Website Template1 Database Template2 Template3

  4. Flint goal Last Min Max StockQuote … Volume 52high Open

  5. System architecture Flint Web Search [WIDM08] Data Extraction Data Integration The Web

  6. Novel contribution Data Extraction Data Integration • Unsupervised • Automatic • Scalable • No knowledgeavailable • Unsupervised • Automatic • Scalable • Uncertain Data • No labelsavailable • No corpus available WebTables [Vldb08] Cimple [Vldb07] MetaQuerier [Cidr05] PayGo [Cidr07] RoadRunner [Vldb01] ExAlg [Sigmod03] TurboWrapper [Vldb07]

  7. Data Extraction

  8. Data Extraction

  9. Data Extraction AAPL, GOOG, MSFT, INTC, … 128.09, 439.54, 34.89, 112.37, … 127.81, 439.25, 32.13, 111.01, … 132.43, 443.82, 33.67, 114.32, … 0.50%, -0.38%, 1.23%, 3.92%, -1.65%, … Add AAPL toYour Portfolio, Add GOOG toYour Portfolio, Add MSFT toYour Portfolio, Add INTC toYour Portfolio, … …

  10. Data Extraction HTML fragments taken from two pages belonging to the same website: Extractionerror! ? /html/body/table/tr[1]/td[2] 1,132,228 , 1,735,857 /html/body/table/tr[2]/td[2] $20.66 , $414.58 /html/body/table/tr[3]/td[2] $11.70 , $247.30 /html/body/table/tr[4]/td[2] $20.72 , $414.06 /html/body/table/tr[5]/td[2] $0.02 , 99,494,200 /html/body/table/tr[6]/td[2] 4,732,600 , null

  11. Data Integration 10 33 16 4 25 10 AA GO MS (min) (max) (stock)

  12. Data Integration t=0.5 t=0.5 t=0.5 10 33 16 4 25 10 AA GO MS (max) (min) (stock)

  13. Data Integration t=0.5 t=0.5 t=0.5 10 33 16 4 25 10 AA GO MS (max) (min) (stock) 10 33 16 4 25 10 AA GO MS 1.0 1.0 1.0 (min) (max) (stock)

  14. Data Integration t=0.5 t=0.5 t=0.5 10 33 16 4 25 10 AA GO MS (max) (min) (stock) 10 33 16 4 25 10 AA GO MS (min) (max) (stock)

  15. Data Integration t=0.5 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 AA GO MS AA GO MS (max) (max) (min) (min) (stock) (stock) 4 25 10 AA GO MS 6 26 12 0.6 1.0 (min) (stock) (price) 1.0

  16. Data Integration t=0.5 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS ? (max) (max) (min) (min) (price) (stock) (stock) 4 25 10 AA GO MS 1.0 (min) (stock) 1.0

  17. Data Integration t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS (max) (max) (min) (min) (price) (stock) (stock) 4 25 10 AA GO MS 1.0 (min) (stock)

  18. Data Integration t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS (max) (max) (min) (min) (price) (stock) (stock) 4 25 10 AA GO MS 1.0 (min) (stock)

  19. Data Integration t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS AA GO MS (max) (max) (min) (min) (price) (stock) (stock) (stock) 4 25 10 (min)

  20. Wrapper Refinement t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS AA GO MS ? ? (max) (max) (min) (min) (min) (price) (stock) (stock) (stock) 0.0 0.0 0.3 (weak) 0.3 (weak) 10 null 10 (min/max)

  21. Wrapper Refinement //td[contains(text(),‘Open')]/../td[2] //td[contains(text(),‘Open')]/../../tr[5]/td[1] //td[contains(text(),‘Open')]/../../tr[5]/td[2] //td[contains(text(),‘High')]/../td[2] … matching value nearby template tokens

  22. Wrapper Refinement t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS AA GO MS (max) (max) (min) (min) (min) (price) (stock) (stock) (stock) 1.0 1.0 //td[contains(text(),‘Max')]/../td[2] 10 33 16 4 2510 //td[contains(text(),‘Min')]/../td[2] (max) (min) 10 null 10 (min/max)

  23. Wrapper Refinement t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 10 33 16 4 25 10 4 25 10 4 25 10 4 2510 6 26 12 AA GO MS AA GO MS AA GO MS (max) (max) (max) (min) (min) (min) (min) (price) (stock) (stock) (stock) 10 null 10 (min/max)

  24. Experimental Results(100 websites for each domain) • Soccer domain • (45,714 pages) • Attribute |m| • Name 90 • Birth Date 61 • Height 54 • Nationality 48 • Club 43 • Position 43 • Weight 34 • League 14 • Videogame domain • (49,262 pages) • Attribute |m| • Title 86 • Publisher 59 • Developer 45 • Genre 28 • ESRB rating 40 • Release Date 9 • Platform 9 • # Players 6 • Finance domain • (57,623 pages) • Attribute |m| • Stock Symbol 84 • Price Change 73 • % Change 73 • Volume 52 • Day Low 43 • Day High 41 • Last Price 29 • Open Price 24

  25. Demo • Found Websites • Integrated Data

  26. the end! http://flint.dia.uniroma3.it

  27. License • This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.

More Related