slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu PowerPoint Presentation
Download Presentation
Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu

Loading in 2 Seconds...

play fullscreen
1 / 15

Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu - PowerPoint PPT Presentation


  • 203 Views
  • Uploaded on

Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru dnm8925@cacs.louisiana.edu http://www.ucs.louisiana.edu/~dnm8925. Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu' - oshin


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Automatically Extracting Data Records from Web PagesPresenter: Dheerendranath Mundlurudnm8925@cacs.louisiana.edu http://www.ucs.louisiana.edu/~dnm8925

Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu

Jayasimha R. Katukuri Saygin Celebi

Laboratory for Internet Computing

Center for Advanced Computer Studies

University of Louisiana at Lafayette, Lafayette, LA

agenda
Agenda
  • Introduction
  • Proposed Solution: Path-based Information Extractor
  • Experiments
  • Conclusions and Future Work
introduction
Introduction

World Wide Web: Largest known repository of documents containing diverse content used by people from diverse backgrounds.

Few characteristics of Web include:

  • Huge size
  • Easily accessible
  • Hyperlinked
  • Dynamic
  • Diverse coverage – science, politics, education, etc.
  • Increasing at a tremendous rate
  • Noisy - advertisements, mirror sites, etc.
web mining leverage the value of web
Web Mining: Leverage the Value of Web
  • Web mining aims to discover useful knowledge from the Web
  • Characteristics of Web such as heterogeneity, increasing size, noise, etc. makes Web mining a challenging task
  • Web mining can be classified into [Kosala 00, Liu 04]:
    • Web content mining: Extracting and discovering useful information or knowledge from Web page contents
    • Web structure mining: Discovering useful knowledge from the structure of hyperlinks e.g., used by Google
    • Web usage mining: Discovering useful knowledge from user access log files e.g., used by Amazon.com
  • Web mining is a multidisciplinary field:
    • Data mining, information retrieval, databases, machine learning, information extraction, natural language processing, etc.
structured data extraction
Structured Data Extraction
  • Structured data extraction deals with extracting information displayed in a regular structureas such information is perceived to represent the essential content in a Web page e.g., list of products in an e-commerce Web page. [Liu 04]
  • Few example applications:
    • Online comparative shopping engines (e.g., nextag.com)
    • Metasearch engines (e.g., dogpile.com)
    • Modern Business Intelligence systems (e.g., intelliseek.com)
path based information extractor pie
Path-based Information Extractor (PIE)
  • PIE is an automatic data extraction system whose goal is to automatically extract data records present in Web search response pages. [Mundluru 05a, Mundluru 05b]
  • PIE also eliminates any “noisy” content such as advertisements, navigation links, etc.
few observations
Few Observations

Observation 1: Data records displayed in a particular region of a Web page are contiguous and are formatted using similar HTML tags. [Liu 03]

Observation 2: A group of similar data records belonging to a particular region are always present under the same parent node in the tag tree. [Liu 03]

Observation 3: Every record present in most search response pages has at least one hyperlink. Usually, title of the retrieved document is displayed in the form of a hyperlink, which points to the retrieved document. In this work, we refer to such a hyperlink as a record link.

experiments
Experiments

Experiment Setup:

  • Evaluated the proposed system by comparing it with two state-of-the-art record extraction systems: MDR [Liu 03] and ViNTs [Zhao 05]
  • All three systems were tested on a total of 60 Web pages (having 873 data records) taken from 60 Web sources
  • The 60 Web sources include:
    • general-purpose search engines e.g., Google, Yahoo
    • e-commerce sites e.g., drugstore.com, clevershoppers.com
    • other special-purpose search engines e.g., mit.edu, breastcancer.org
  • PIE was developed in Java
experiments13
Experiments
  • Evaluation Measures Used:
    • Recall = Total number of target data records correctly extracted

Total number of target data records

    • Precision = Total number of target data records correctly extracted

Total number of data records extracted

  • Results:
conclusions future work
Conclusions & Future Work

Conclusions:

  • Automatic data extraction is extremely important for systems such as online comparative search engines, metasearch engines, business intelligence solutions, etc.
  • A very effective system called PIE has been proposed for automatically extracting data records from Web pages.
  • Experiments showed that PIE outperformed MDR and ViNTs, which are two state-of-the-art record extraction systems that are being used in two software companies.

Future Work:

  • Improving the effectiveness in extracting records
  • Extracting attributes in each data record e.g., product name, price, etc.
  • Performing large-scale experiments
  • Building applications such as online comparative shopping engines, metasearch engines, etc.
references
References

[Mundluru 05a] D. Mundluru, J. Katukuri, and S. Celebi. Automatically Mining Result Records from Search Engine Response Pages. Proceedings of 5th IEEE International Conference on Data Mining (ICDM), 749 – 753, Houston, November 2005 .

[Mundluru 05b] D. Mundluru, Z. Wu, V. Raghavan, J. Katukuri, and S. Celebi. Automatically Mining Search Result Records. Technical Report CACS-TR-2005-3-1, Center for Advanced Computer Studies, University of Louisiana at Lafayette, 2005.

[Kosala 00] R. Kosala and H. Blockeel. Web Mining Research: A Survey. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 2(1), 1-15, 2000.

[Liu 04] B. Liu and K. Chang. Editorial: Special Issue on Web Content Mining. SIGKDD Explorations, 6(2), 1-4, December 2004.

[Liu 03] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 601-606, Washington, D.C., August 2003.

[Zhao 05] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully Automatic Wrapper Generation for Search Engines. Proceedings of the 14th International World Wide Web Conference, 66-75, Chiba, Japan, May 2005.