Tomasz Kaczmarek The Poznan University of Economics Poland

myPortal: Robust Extraction and Aggregation of Web ContentMarek Kowalkiewicz, Tomasz Kaczmarek, Witold Abramowicz Tomasz Kaczmarek The Poznan University of Economics Poland

Background • Personalized access to information • Dynamic content on web pages • Various techniques for content extraction: • Based on unique ID • Using contextual information • Document tree analysis

myPortal vision • Ability to extract content blocks from HTML pages • Easy aggregation • Client side technology – no server side investments necessary • Stress on: • Robustness • Ease of use

My portal My portal My portal

Absolute XPath Relative XPath Extraction technique Extraction based on HTML DOM tree

Visual query specification Reference element Extracted content

Aggregation of content

Done: Extract content from any HTML page Record POST, GET parameters, cookies Access search results (via GET or POST) from search engines – subscription like service Work in progress: Deal with multi-stage login or query mechanisms – like obtaining bank account info Deal with information from multiple DOM tree branches in single query Functionality

Other (technical) problems • HTML code quality – HTML Tidy • WYSIWYG for aggregation • Robustness • Multiple occurrences of reference element • Document structure changes between reference and extracted elements • Deletion / change in the reference element

Research on robustness • Purpose: to check if relative XPath expressions are more robust than absolute XPath

Research method • Empirical tests on multiple portals • Manual query preparation for absolute and relative queries • Comparison of results in three categories: • Accurate extraction • Lack of result • Inaccurate extraction • Based on historical versions of portal sites obtained from Web Archive

Robustness comparison

Average robustness

Thank you! t.kaczmarek@kie.ae.poznan.pl

Tomasz Kaczmarek The Poznan University of Economics Poland

Tomasz Kaczmarek The Poznan University of Economics Poland

Presentation Transcript

NORTH AMERICA/EUROPE Airport Conference Poznan, Poland

DOJI TEAM Filip Nowacki Jakub Ryfa Poznan University of Economics (Poland)

Prof. Marek Kwiek Center for Public Policy Poznan University, Poznan, Poland kwiekm@amu.edu.pl www.policy.hu/kwiek

Faculty of Chemistry, Adam Mickiewicz University, Poznan, Poland

Faculty of Chemistry, Adam Mickiewicz University, Poznan, Poland

Faculty of Chemistry, Adam Mickiewicz University, Poznan, Poland

Faculty of Chemistry, Adam Mickiewicz University, Poznan, Poland

Professor Marek Kwiek Center for Public Policy Poznan University, Poznan, Poland kwiekm@amu.pl

Faculty of Chemistry, Adam Mickiewicz University, Poznan, Poland

Anna Kukla-Gryz Department of Economics, Warsaw University, Poland

Welcome to Poznan University of Economics:

Faculty of Chemistry, Adam Mickiewicz University, Poznan, Poland

Professor Marek Kwiek Center for Public Policy Poznan University, Poznan, Poland kwiekm@amu.pl

Radosław Dylewski (dradek@ifa.amu.pl) Adam Mickiewicz University, Poznan, Poland

Welcome to the Poznan University of Medical Sciences

Faculty of Chemistry, Adam Mickiewicz University, Poznan, Poland

SURGEON: Witold Szyfter Poznan University of Medical Sciences Poznan (Poland)

Professor Marek Kwiek, Director, Center for Public Policy Studies, Poznan University, Poland

Tomasz Ozorowski, M.D . University of Medical Sciences, Poznan , Poland

FABRIC Meeting Poznan, Poland

Welcome to Poznan University of Economics:

Professor Marek Kwiek Center for Public Policy Poznan University, Poznan, Poland kwiekm@amu.pl