Paper 37 m ining web pages for d ata r ecords mdr
Download
1 / 30

paper 37 - PowerPoint PPT Presentation


  • 306 Views
  • Uploaded on

Paper 37 M ining Web Pages for D ata R ecords (MDR). Liu, Bing; Grossman, Robert; Yanhong Zhai University of Illinois at Chicago IEEE Intelligent Systems , 1 1 Volume: 19, Issue: 6, Pages: 49-55. 2004/11/1 Professors: 陳彥良 許秉瑜 教授 Presented by: 狄宇昌 2006 Data Mining. Outline.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'paper 37 ' - Jimmy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Paper 37 m ining web pages for d ata r ecords mdr l.jpg

Paper 37 Mining Web Pages for Data Records (MDR)

Liu, Bing; Grossman, Robert; Yanhong Zhai

University of Illinois at Chicago

IEEE Intelligent Systems, 11

Volume: 19, Issue: 6, Pages: 49-55. 2004/11/1

Professors: 陳彥良 許秉瑜 教授

Presented by: 狄宇昌

2006 Data Mining


Outline l.jpg
Outline

  • Introduction

  • Related work (MDR, Omini, IEPAD, Wrapper)

  • Mining data regions

    • Comparing generalized nodes (CombComp)

    • Determining data regions (FindDRs)

  • Identifying data records (FindRecords)

  • Experiment results


Introduction 1 5 l.jpg
Introduction(1/5)

  • Extract information from Web pages help provide value-added services. Such as:

    • Customizable Web information gathering

    • Comparative shopping

    • Metasearching

  • MDR (mining data records) exploit

    • Web page structure

    • A string-matching

    • Mine contiguous and noncontiguous data records


Introduction 2 5 l.jpg
Introduction(2/5)

  • Current approach 1, supervised learning

    • require substantial human effort

  • Current approach 2, Automatic techniques perform poorly

  • Only assume relevant items are in a contiguous Web page

  • Few researchers exploited the nested of HTML structures


Introduction 3 5 l.jpg
Introduction(3/5)

  • MDR (mining data records)

    • An automatic technique finds all data records formed by table and form related HTML tags

    • Such as, table, form, tr, td, and so on

    • MDR outperformed other existing systems


Introduction 4 5 l.jpg
Introduction(4/5)

  • MDR base on two observations of web pages layout

  • Observation one:

    • Similar objects appear in a contiguous region of a page

    • Data regions are formatted with similar HTML tags


Introduction 5 5 l.jpg
Introduction(5/5)

  • Observation two

    • A tag tree, the nested structured of HTML tags in a Web page

    • Data records in a specific region under one parent node

      • As figure 1b, each notebook is wrapped in 5 tr nodes under the same parent nodetbody


Related work 1 2 l.jpg
Related work(1/2)

  • Researchers have developed several approaches for mining data records from Web pages

  • Omini (Object Mining and Extraction system)

    • use a set of heuristics and a manually constructed domain ontology


Related work 2 2 l.jpg
Related work(2/2)

  • IEPAD (Information Extraction based on Pattern Discovery )

    • A automatic method that uses sequence alignment to find patterns representing a set of data records

  • Wrapper induction

    • Wrapper is a program that extract data from a Web site and put in a DB

    • learns extraction rules using manually labeled training examples


Mdr technique l.jpg
MDR Technique

  • 3 Steps as:

  • Build an HTML tag tree of the page

  • Mine all data regions in the page by using observations and edit distance string algorithm

  • Identify data records from each data region


Mining data regions 1 2 l.jpg
Mining data regions(1/2)

  • First, mine generalized nodes

  • a sequence of adjacent generalized nodes form a data region p.16

The Node pair (14,15),(16,17), and (18,19) are generalized nodes of length 2

Node 5,6 are generalized nodes of length 1

Node 8,9,10 are generalized nodes of length 1


Mining data regions 2 2 l.jpg
Mining data regions(2/2)

  • A data region contains two or more generalized nodes with properties:

    • They have the same parent

    • They have the same length (the same number of child nodes in the tag tree)

    • They are adjacent

    • The normalizededit distance between them is less than a fixed threshold


Comparing generalized nodes 1 7 l.jpg
Comparing generalized nodes (1/7)

  • The mining algorithm must answer two question below:

    • Q1.Where does the 1st generalized node of a data region start?

    • Q2.How many tag nodes (components) are in the generalized nodes in each data region


Comparing generalized nodes 2 7 l.jpg
Comparing generalized nodes(2/7)

  • K: the maximum number of tag nodes in a generalized node. K is small (less than 10)

    • Answer 1: find a data region starting from each tag node sequentially

    • Answer 2: try 1-node, 2-node, …, K-node combination


Comparing generalized nodes 3 7 l.jpg
Comparing generalized nodes(3/7)

  • The number of comparisons is not large for two reason:

    • Compare only the child nodes of the same parent node. E.g., in figure 2 no need to compare node 8 and node 13

    • Some comparison performed for earlier nodes are the same as for later nodes. Therefore, no need to do them twice.


Comparing generalized nodes 4 7 l.jpg
Comparing generalized nodes(4/7)

  • The figure 3 has 10 nodes below a parent node p.

  • A generalized node can have a maximum of three components, K=3


Comparing generalized nodes 5 7 l.jpg
Comparing generalized nodes(5/7)

  • Starting from Node 1, we compute these comparisons:

    • (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8),

      (8, 9), (9, 10)

    • (1-2, 3-4), (3-4, 5-6), (5-6, 7-8), (7-8, 9-10)

    • (1-2-3, 4-5-6), (4-5-6, 7-8-9)

  • Starting from Node 2, we compute only

    • (2-3, 4-5), (4-5, 6-7), (6-7, 8-9)

    • (2-3-4, 5-6-7), (5-6-7, 8-9-10)

  • Starting from Node 3, we only need to compute one string comparison: (3-4-5, 6-7-8).

  • No need to start from any other nodes after node 3 because of “K=3”


Comparing generalized nodes 6 7 l.jpg
Comparing generalized nodes(6/7)

The algorithm won’t search for the data regions if the subtree’s depth from Node is 1 or 2


Comparing generalized nodes 7 7 l.jpg
Comparing generalized nodes(7/7)

  • Total number of nodes in the tag tree is N

  • Without considering string comparison, the complexity of CombComp is O(NK)

  • Because K is relatively small, the CombComp algorithm linear in N


Determining data region 1 2 l.jpg
Determining data region(1/2)

  • Procedure FindDRs report

    • the entire area as data region

    • each row as a generalized node

    • contains eight data records


Determining data region 2 2 l.jpg
Determining data region(2/2)

  • Two main issues affect the final decisions

    • If a lower-level data region is within a higher-level data region, we report higher-level data region.

    • In a data region, we only report only the smallest generalized nodes


Identifying data records 1 5 l.jpg
Identifying data records(1/5)

  • Data region  Generalized node  Data Record (object)

  • A generalized node might contain one or more data records


Identifying data records 2 5 l.jpg
Identifying data records(2/5)

  • Noncontiguous object description

    • HTML code: Name 1, Name2, Description 1, Description 2, Name 3, Name 4, Description 3, Description 4


Identifying data records 3 5 l.jpg
Identifying data records(3/5)

  • Finding noncontiguous data records

    • Group the corresponding children of Node 1 and 2

    • Join Node 5 and node 7 to form one

    • Join Node 6 and node 8 to form another


Identifying data records 4 5 l.jpg
Identifying data records(4/5)

  • Data record not in any data regions

    • Row 1, 2, 3 at same level, row 1, 2 (two generalized node form a data region)

    • Object 5 won’t be covered by a data region


Identifying data records 5 5 l.jpg
Identifying data records(5/5)

  • Finding Object 5, an odd number of objects in a table and HTML tag tree

  • Use Object 4 (or any of the four object) to match each tag string of the children of the sibling nodes of r1 and r2


Experiment result 1 2 l.jpg
Experiment result (1/2)

  • Evaluate MDR and compare with Omini and IEPAD

  • Implement and debug MDR by using pages from Amazon, Yahoo, and Hewlett-Packard Web site

  • Default edit distance threshold 0.3, and no tuning for new pages or Web sites


Experiment result 2 2 l.jpg
Experiment result (2/2)

  • Use standard precision and recall measures

  • Omini and IEPADonly work well with simple page

    • Pages with many similar data records and little noise


Current future work l.jpg
Current & future work

  • Currently, two practical applications

    • Extract consumer product reviews from online merchant sites

    • A more effective technique for extracting individual data fields from data record

  • Future work

    • Study the problem of extracting information from text document that are much less structured than HTML


ad