Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu

Automatically Extracting Data Records from Web PagesPresenter: Dheerendranath Mundlurudnm8925@cacs.louisiana.edu http://www.ucs.louisiana.edu/~dnm8925 Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu Jayasimha R. Katukuri Saygin Celebi Laboratory for Internet Computing Center for Advanced Computer Studies University of Louisiana at Lafayette, Lafayette, LA

Agenda • Introduction • Proposed Solution: Path-based Information Extractor • Experiments • Conclusions and Future Work

Introduction World Wide Web: Largest known repository of documents containing diverse content used by people from diverse backgrounds. Few characteristics of Web include: • Huge size • Easily accessible • Hyperlinked • Dynamic • Diverse coverage – science, politics, education, etc. • Increasing at a tremendous rate • Noisy - advertisements, mirror sites, etc.

Web Mining: Leverage the Value of Web • Web mining aims to discover useful knowledge from the Web • Characteristics of Web such as heterogeneity, increasing size, noise, etc. makes Web mining a challenging task • Web mining can be classified into [Kosala 00, Liu 04]: • Web content mining: Extracting and discovering useful information or knowledge from Web page contents • Web structure mining: Discovering useful knowledge from the structure of hyperlinks e.g., used by Google • Web usage mining: Discovering useful knowledge from user access log files e.g., used by Amazon.com • Web mining is a multidisciplinary field: • Data mining, information retrieval, databases, machine learning, information extraction, natural language processing, etc.

Web Mining & Web Content Mining Classification

Structured Data Extraction • Structured data extraction deals with extracting information displayed in a regular structureas such information is perceived to represent the essential content in a Web page e.g., list of products in an e-commerce Web page. [Liu 04] • Few example applications: • Online comparative shopping engines (e.g., nextag.com) • Metasearch engines (e.g., dogpile.com) • Modern Business Intelligence systems (e.g., intelliseek.com)

Sample response page from Google

Sample response page from drugstore.com

Path-based Information Extractor (PIE) • PIE is an automatic data extraction system whose goal is to automatically extract data records present in Web search response pages. [Mundluru 05a, Mundluru 05b] • PIE also eliminates any “noisy” content such as advertisements, navigation links, etc.

Few Observations Observation 1: Data records displayed in a particular region of a Web page are contiguous and are formatted using similar HTML tags. [Liu 03] Observation 2: A group of similar data records belonging to a particular region are always present under the same parent node in the tag tree. [Liu 03] Observation 3: Every record present in most search response pages has at least one hyperlink. Usually, title of the retrieved document is displayed in the form of a hyperlink, which points to the retrieved document. In this work, we refer to such a hyperlink as a record link.

Record Extraction Algorithm

Experiments Experiment Setup: • Evaluated the proposed system by comparing it with two state-of-the-art record extraction systems: MDR [Liu 03] and ViNTs [Zhao 05] • All three systems were tested on a total of 60 Web pages (having 873 data records) taken from 60 Web sources • The 60 Web sources include: • general-purpose search engines e.g., Google, Yahoo • e-commerce sites e.g., drugstore.com, clevershoppers.com • other special-purpose search engines e.g., mit.edu, breastcancer.org • PIE was developed in Java

Experiments • Evaluation Measures Used: • Recall = Total number of target data records correctly extracted Total number of target data records • Precision = Total number of target data records correctly extracted Total number of data records extracted • Results:

Conclusions & Future Work Conclusions: • Automatic data extraction is extremely important for systems such as online comparative search engines, metasearch engines, business intelligence solutions, etc. • A very effective system called PIE has been proposed for automatically extracting data records from Web pages. • Experiments showed that PIE outperformed MDR and ViNTs, which are two state-of-the-art record extraction systems that are being used in two software companies. Future Work: • Improving the effectiveness in extracting records • Extracting attributes in each data record e.g., product name, price, etc. • Performing large-scale experiments • Building applications such as online comparative shopping engines, metasearch engines, etc.

References [Mundluru 05a] D. Mundluru, J. Katukuri, and S. Celebi. Automatically Mining Result Records from Search Engine Response Pages. Proceedings of 5th IEEE International Conference on Data Mining (ICDM), 749 – 753, Houston, November 2005 . [Mundluru 05b] D. Mundluru, Z. Wu, V. Raghavan, J. Katukuri, and S. Celebi. Automatically Mining Search Result Records. Technical Report CACS-TR-2005-3-1, Center for Advanced Computer Studies, University of Louisiana at Lafayette, 2005. [Kosala 00] R. Kosala and H. Blockeel. Web Mining Research: A Survey. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 2(1), 1-15, 2000. [Liu 04] B. Liu and K. Chang. Editorial: Special Issue on Web Content Mining. SIGKDD Explorations, 6(2), 1-4, December 2004. [Liu 03] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 601-606, Washington, D.C., August 2003. [Zhao 05] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully Automatic Wrapper Generation for Search Engines. Proceedings of the 14th International World Wide Web Conference, 66-75, Chiba, Japan, May 2005.

Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu

Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu

Presentation Transcript

Fuzzy Interpretation of Discretized Intervals Dr. Xindong Wu

Dr. Chuck Nelson I-Chun Wu NOHVCC August 2010

BEHAVIOURAL PERSPECTIVE By Dr. Vijay Kumar

Speaker: Ya -Ling Wang Adviser: Dr. Quincy Wu (Solomon)

Presented by: Miguel Cabral Advised by: Dr. Fan Wu

Bjarne Berg Homework #1 for Dr. Wu Data warehousing

Dr. Carl R. Nassar, Dr. Zhiqiang Wu, and David A. Wiegandt

Dr. Vijay Raghavan Defense Advanced Research Projects Agency Information Exploitation Office

VOIP Presented by Vijay Reddy Mara Advisor: Dr. Ravi Mukkamala

By Dr. V.Shubhalaxmi , Dy. Director, BNHS and Vijay Barve

Dr. Vijay Raghavan Defense Advanced Research Projects Agency Information Exploitation Office

Dr. Raghavan Srinivasan, HSRC

Dr. Tanga , Dr. Natukunda ,Dr. Kabonesa

Vijay Gandhi Masters Student Advisor Dr. Shashi Shekhar Committee Members Dr. Bradley Carlin

Dr. Cathérine Mei β ner a Dr. Arne R. Gravdahl a Dr. Xuan Wu b

Dr. Vijay Kumar Michigan Professional Physical Therapist

Dr. Vijay Kumar West Branch Physical Therapist

Dr. Vijay Kumar Ogemaw Professional Physical Therapist

Primary Care Physician in Gilroy – Dr Samuel Wu

Dr Vasanth Vijay-Global Peace propagator

Garbhagudi | Dr. Asha S Vijay |ElaWoman