Loading in 2 Seconds...
Loading in 2 Seconds...
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru email@example.com http://www.ucs.louisiana.edu/~dnm8925. Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu
Automatically Extracting Data Records from Web PagesPresenter: Dheerendranath Mundlurudnm8925@cacs.louisiana.edu http://www.ucs.louisiana.edu/~dnm8925
Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu
Jayasimha R. Katukuri Saygin Celebi
Laboratory for Internet Computing
Center for Advanced Computer Studies
University of Louisiana at Lafayette, Lafayette, LA
World Wide Web: Largest known repository of documents containing diverse content used by people from diverse backgrounds.
Few characteristics of Web include:
Observation 1: Data records displayed in a particular region of a Web page are contiguous and are formatted using similar HTML tags. [Liu 03]
Observation 2: A group of similar data records belonging to a particular region are always present under the same parent node in the tag tree. [Liu 03]
Observation 3: Every record present in most search response pages has at least one hyperlink. Usually, title of the retrieved document is displayed in the form of a hyperlink, which points to the retrieved document. In this work, we refer to such a hyperlink as a record link.
Total number of target data records
Total number of data records extracted
[Mundluru 05a] D. Mundluru, J. Katukuri, and S. Celebi. Automatically Mining Result Records from Search Engine Response Pages. Proceedings of 5th IEEE International Conference on Data Mining (ICDM), 749 – 753, Houston, November 2005 .
[Mundluru 05b] D. Mundluru, Z. Wu, V. Raghavan, J. Katukuri, and S. Celebi. Automatically Mining Search Result Records. Technical Report CACS-TR-2005-3-1, Center for Advanced Computer Studies, University of Louisiana at Lafayette, 2005.
[Kosala 00] R. Kosala and H. Blockeel. Web Mining Research: A Survey. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 2(1), 1-15, 2000.
[Liu 04] B. Liu and K. Chang. Editorial: Special Issue on Web Content Mining. SIGKDD Explorations, 6(2), 1-4, December 2004.
[Liu 03] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 601-606, Washington, D.C., August 2003.
[Zhao 05] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully Automatic Wrapper Generation for Search Engines. Proceedings of the 14th International World Wide Web Conference, 66-75, Chiba, Japan, May 2005.