Paper 37 m ining web pages for d ata r ecords mdr
1 / 30

Paper 37 M ining Web Pages for D ata R ecords (MDR) - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: Shopping

Paper 37 M ining Web Pages for D ata R ecords (MDR). Liu, Bing; Grossman, Robert; Yanhong Zhai University of Illinois at Chicago IEEE Intelligent Systems , 1 1 Volume: 19, Issue: 6, Pages: 49-55. 2004/11/1 Professors: 陳彥良 許秉瑜 教授 Presented by: 狄宇昌 2006 Data Mining. Outline.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Paper 37 M ining Web Pages for D ata R ecords (MDR)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Paper 37 Mining Web Pages for Data Records (MDR)

Liu, Bing; Grossman, Robert; Yanhong Zhai

University of Illinois at Chicago

IEEE Intelligent Systems, 11

Volume: 19, Issue: 6, Pages: 49-55. 2004/11/1

Professors: 陳彥良 許秉瑜 教授

Presented by: 狄宇昌

2006 Data Mining


  • Introduction

  • Related work (MDR, Omini, IEPAD, Wrapper)

  • Mining data regions

    • Comparing generalized nodes (CombComp)

    • Determining data regions (FindDRs)

  • Identifying data records (FindRecords)

  • Experiment results


  • Extract information from Web pages help provide value-added services. Such as:

    • Customizable Web information gathering

    • Comparative shopping

    • Metasearching

  • MDR (mining data records) exploit

    • Web page structure

    • A string-matching

    • Mine contiguous and noncontiguous data records


  • Current approach 1, supervised learning

    • require substantial human effort

  • Current approach 2, Automatic techniques perform poorly

  • Only assume relevant items are in a contiguous Web page

  • Few researchers exploited the nested of HTML structures


  • MDR (mining data records)

    • An automatic technique finds all data records formed by table and form related HTML tags

    • Such as, table, form, tr, td, and so on

    • MDR outperformed other existing systems


  • MDR base on two observations of web pages layout

  • Observation one:

    • Similar objects appear in a contiguous region of a page

    • Data regions are formatted with similar HTML tags


  • Observation two

    • A tag tree, the nested structured of HTML tags in a Web page

    • Data records in a specific region under one parent node

      • As figure 1b, each notebook is wrapped in 5 tr nodes under the same parent nodetbody

Related work(1/2)

  • Researchers have developed several approaches for mining data records from Web pages

  • Omini (Object Mining and Extraction system)

    • use a set of heuristics and a manually constructed domain ontology

Related work(2/2)

  • IEPAD (Information Extraction based on Pattern Discovery )

    • A automatic method that uses sequence alignment to find patterns representing a set of data records

  • Wrapper induction

    • Wrapper is a program that extract data from a Web site and put in a DB

    • learns extraction rules using manually labeled training examples

MDR Technique

  • 3 Steps as:

  • Build an HTML tag tree of the page

  • Mine all data regions in the page by using observations and edit distance string algorithm

  • Identify data records from each data region

Mining data regions(1/2)

  • First, mine generalized nodes

  • a sequence of adjacent generalized nodes form a data region p.16

The Node pair (14,15),(16,17), and (18,19) are generalized nodes of length 2

Node 5,6 are generalized nodes of length 1

Node 8,9,10 are generalized nodes of length 1

Mining data regions(2/2)

  • A data region contains two or more generalized nodes with properties:

    • They have the same parent

    • They have the same length (the same number of child nodes in the tag tree)

    • They are adjacent

    • The normalizededit distance between them is less than a fixed threshold

Comparing generalized nodes (1/7)

  • The mining algorithm must answer two question below:

    • Q1.Where does the 1st generalized node of a data region start?

    • Q2.How many tag nodes (components) are in the generalized nodes in each data region

Comparing generalized nodes(2/7)

  • K: the maximum number of tag nodes in a generalized node. K is small (less than 10)

    • Answer 1: find a data region starting from each tag node sequentially

    • Answer 2: try 1-node, 2-node, …, K-node combination

Comparing generalized nodes(3/7)

  • The number of comparisons is not large for two reason:

    • Compare only the child nodes of the same parent node. E.g., in figure 2 no need to compare node 8 and node 13

    • Some comparison performed for earlier nodes are the same as for later nodes. Therefore, no need to do them twice.

Comparing generalized nodes(4/7)

  • The figure 3 has 10 nodes below a parent node p.

  • A generalized node can have a maximum of three components, K=3

Comparing generalized nodes(5/7)

  • Starting from Node 1, we compute these comparisons:

    • (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8),

      (8, 9), (9, 10)

    • (1-2, 3-4), (3-4, 5-6), (5-6, 7-8), (7-8, 9-10)

    • (1-2-3, 4-5-6), (4-5-6, 7-8-9)

  • Starting from Node 2, we compute only

    • (2-3, 4-5), (4-5, 6-7), (6-7, 8-9)

    • (2-3-4, 5-6-7), (5-6-7, 8-9-10)

  • Starting from Node 3, we only need to compute one string comparison: (3-4-5, 6-7-8).

  • No need to start from any other nodes after node 3 because of “K=3”

Comparing generalized nodes(6/7)

The algorithm won’t search for the data regions if the subtree’s depth from Node is 1 or 2

Comparing generalized nodes(7/7)

  • Total number of nodes in the tag tree is N

  • Without considering string comparison, the complexity of CombComp is O(NK)

  • Because K is relatively small, the CombComp algorithm linear in N

Determining data region(1/2)

  • Procedure FindDRs report

    • the entire area as data region

    • each row as a generalized node

    • contains eight data records

Determining data region(2/2)

  • Two main issues affect the final decisions

    • If a lower-level data region is within a higher-level data region, we report higher-level data region.

    • In a data region, we only report only the smallest generalized nodes

Identifying data records(1/5)

  • Data region  Generalized node  Data Record (object)

  • A generalized node might contain one or more data records

Identifying data records(2/5)

  • Noncontiguous object description

    • HTML code: Name 1, Name2, Description 1, Description 2, Name 3, Name 4, Description 3, Description 4

Identifying data records(3/5)

  • Finding noncontiguous data records

    • Group the corresponding children of Node 1 and 2

    • Join Node 5 and node 7 to form one

    • Join Node 6 and node 8 to form another

Identifying data records(4/5)

  • Data record not in any data regions

    • Row 1, 2, 3 at same level, row 1, 2 (two generalized node form a data region)

    • Object 5 won’t be covered by a data region

Identifying data records(5/5)

  • Finding Object 5, an odd number of objects in a table and HTML tag tree

  • Use Object 4 (or any of the four object) to match each tag string of the children of the sibling nodes of r1 and r2

Experiment result (1/2)

  • Evaluate MDR and compare with Omini and IEPAD

  • Implement and debug MDR by using pages from Amazon, Yahoo, and Hewlett-Packard Web site

  • Default edit distance threshold 0.3, and no tuning for new pages or Web sites

Experiment result (2/2)

  • Use standard precision and recall measures

  • Omini and IEPADonly work well with simple page

    • Pages with many similar data records and little noise

Current & future work

  • Currently, two practical applications

    • Extract consumer product reviews from online merchant sites

    • A more effective technique for extracting individual data fields from data record

  • Future work

    • Study the problem of extracting information from text document that are much less structured than HTML

  • Login