1 / 14

Tomasz Kaczmarek The Poznan University of Economics Poland

myPortal: Robust Extraction and Aggregation of Web Content Marek Kowalkiewicz, Tomasz Kaczmarek, Witold Abramowicz. Tomasz Kaczmarek The Poznan University of Economics Poland. Background. Personalized access to information Dynamic content on web pages

enid
Download Presentation

Tomasz Kaczmarek The Poznan University of Economics Poland

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. myPortal: Robust Extraction and Aggregation of Web ContentMarek Kowalkiewicz, Tomasz Kaczmarek, Witold Abramowicz Tomasz Kaczmarek The Poznan University of Economics Poland

  2. Background • Personalized access to information • Dynamic content on web pages • Various techniques for content extraction: • Based on unique ID • Using contextual information • Document tree analysis

  3. myPortal vision • Ability to extract content blocks from HTML pages • Easy aggregation • Client side technology – no server side investments necessary • Stress on: • Robustness • Ease of use

  4. My portal My portal My portal

  5. Absolute XPath Relative XPath Extraction technique Extraction based on HTML DOM tree

  6. Visual query specification Reference element Extracted content

  7. Aggregation of content

  8. Done: Extract content from any HTML page Record POST, GET parameters, cookies Access search results (via GET or POST) from search engines – subscription like service Work in progress: Deal with multi-stage login or query mechanisms – like obtaining bank account info Deal with information from multiple DOM tree branches in single query Functionality

  9. Other (technical) problems • HTML code quality – HTML Tidy • WYSIWYG for aggregation • Robustness • Multiple occurrences of reference element • Document structure changes between reference and extracted elements • Deletion / change in the reference element

  10. Research on robustness • Purpose: to check if relative XPath expressions are more robust than absolute XPath

  11. Research method • Empirical tests on multiple portals • Manual query preparation for absolute and relative queries • Comparison of results in three categories: • Accurate extraction • Lack of result • Inaccurate extraction • Based on historical versions of portal sites obtained from Web Archive

  12. Robustness comparison

  13. Average robustness

  14. Thank you! t.kaczmarek@kie.ae.poznan.pl

More Related