Automatically Extracting Data Records from Web PagesPresenter: Dheerendranath Mundlurudnm8925@cacs.louisiana.edu http://www.ucs.louisiana.edu/~dnm8925 Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu Jayasimha R. Katukuri Saygin Celebi Laboratory for Internet Computing Center for Advanced Computer Studies University of Louisiana at Lafayette, Lafayette, LA
Agenda • Introduction • Proposed Solution: Path-based Information Extractor • Experiments • Conclusions and Future Work
Introduction World Wide Web: Largest known repository of documents containing diverse content used by people from diverse backgrounds. Few characteristics of Web include: • Huge size • Easily accessible • Hyperlinked • Dynamic • Diverse coverage – science, politics, education, etc. • Increasing at a tremendous rate • Noisy - advertisements, mirror sites, etc.
Web Mining: Leverage the Value of Web • Web mining aims to discover useful knowledge from the Web • Characteristics of Web such as heterogeneity, increasing size, noise, etc. makes Web mining a challenging task • Web mining can be classified into [Kosala 00, Liu 04]: • Web content mining: Extracting and discovering useful information or knowledge from Web page contents • Web structure mining: Discovering useful knowledge from the structure of hyperlinks e.g., used by Google • Web usage mining: Discovering useful knowledge from user access log files e.g., used by Amazon.com • Web mining is a multidisciplinary field: • Data mining, information retrieval, databases, machine learning, information extraction, natural language processing, etc.
Structured Data Extraction • Structured data extraction deals with extracting information displayed in a regular structureas such information is perceived to represent the essential content in a Web page e.g., list of products in an e-commerce Web page. [Liu 04] • Few example applications: • Online comparative shopping engines (e.g., nextag.com) • Metasearch engines (e.g., dogpile.com) • Modern Business Intelligence systems (e.g., intelliseek.com)
Path-based Information Extractor (PIE) • PIE is an automatic data extraction system whose goal is to automatically extract data records present in Web search response pages. [Mundluru 05a, Mundluru 05b] • PIE also eliminates any “noisy” content such as advertisements, navigation links, etc.
Few Observations Observation 1: Data records displayed in a particular region of a Web page are contiguous and are formatted using similar HTML tags. [Liu 03] Observation 2: A group of similar data records belonging to a particular region are always present under the same parent node in the tag tree. [Liu 03] Observation 3: Every record present in most search response pages has at least one hyperlink. Usually, title of the retrieved document is displayed in the form of a hyperlink, which points to the retrieved document. In this work, we refer to such a hyperlink as a record link.
Experiments Experiment Setup: • Evaluated the proposed system by comparing it with two state-of-the-art record extraction systems: MDR [Liu 03] and ViNTs [Zhao 05] • All three systems were tested on a total of 60 Web pages (having 873 data records) taken from 60 Web sources • The 60 Web sources include: • general-purpose search engines e.g., Google, Yahoo • e-commerce sites e.g., drugstore.com, clevershoppers.com • other special-purpose search engines e.g., mit.edu, breastcancer.org • PIE was developed in Java
Experiments • Evaluation Measures Used: • Recall = Total number of target data records correctly extracted Total number of target data records • Precision = Total number of target data records correctly extracted Total number of data records extracted • Results:
Conclusions & Future Work Conclusions: • Automatic data extraction is extremely important for systems such as online comparative search engines, metasearch engines, business intelligence solutions, etc. • A very effective system called PIE has been proposed for automatically extracting data records from Web pages. • Experiments showed that PIE outperformed MDR and ViNTs, which are two state-of-the-art record extraction systems that are being used in two software companies. Future Work: • Improving the effectiveness in extracting records • Extracting attributes in each data record e.g., product name, price, etc. • Performing large-scale experiments • Building applications such as online comparative shopping engines, metasearch engines, etc.
References [Mundluru 05a] D. Mundluru, J. Katukuri, and S. Celebi. Automatically Mining Result Records from Search Engine Response Pages. Proceedings of 5th IEEE International Conference on Data Mining (ICDM), 749 – 753, Houston, November 2005 . [Mundluru 05b] D. Mundluru, Z. Wu, V. Raghavan, J. Katukuri, and S. Celebi. Automatically Mining Search Result Records. Technical Report CACS-TR-2005-3-1, Center for Advanced Computer Studies, University of Louisiana at Lafayette, 2005. [Kosala 00] R. Kosala and H. Blockeel. Web Mining Research: A Survey. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 2(1), 1-15, 2000. [Liu 04] B. Liu and K. Chang. Editorial: Special Issue on Web Content Mining. SIGKDD Explorations, 6(2), 1-4, December 2004. [Liu 03] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 601-606, Washington, D.C., August 2003. [Zhao 05] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully Automatic Wrapper Generation for Search Engines. Proceedings of the 14th International World Wide Web Conference, 66-75, Chiba, Japan, May 2005.