paper 37 m ining web pages for d ata r ecords mdr
Download
Skip this Video
Download Presentation
Paper 37 M ining Web Pages for D ata R ecords (MDR)

Loading in 2 Seconds...

play fullscreen
1 / 30

Paper 37 M ining Web Pages for D ata R ecords (MDR) - PowerPoint PPT Presentation


  • 306 Views
  • Uploaded on

Paper 37 M ining Web Pages for D ata R ecords (MDR). Liu, Bing; Grossman, Robert; Yanhong Zhai University of Illinois at Chicago IEEE Intelligent Systems , 1 1 Volume: 19, Issue: 6, Pages: 49-55. 2004/11/1 Professors: 陳彥良 許秉瑜 教授 Presented by: 狄宇昌 2006 Data Mining. Outline.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Paper 37 M ining Web Pages for D ata R ecords (MDR)' - Jimmy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
paper 37 m ining web pages for d ata r ecords mdr

Paper 37 Mining Web Pages for Data Records (MDR)

Liu, Bing; Grossman, Robert; Yanhong Zhai

University of Illinois at Chicago

IEEE Intelligent Systems, 11

Volume: 19, Issue: 6, Pages: 49-55. 2004/11/1

Professors: 陳彥良 許秉瑜 教授

Presented by: 狄宇昌

2006 Data Mining

outline
Outline
  • Introduction
  • Related work (MDR, Omini, IEPAD, Wrapper)
  • Mining data regions
    • Comparing generalized nodes (CombComp)
    • Determining data regions (FindDRs)
  • Identifying data records (FindRecords)
  • Experiment results
introduction 1 5
Introduction(1/5)
  • Extract information from Web pages help provide value-added services. Such as:
    • Customizable Web information gathering
    • Comparative shopping
    • Metasearching
  • MDR (mining data records) exploit
    • Web page structure
    • A string-matching
    • Mine contiguous and noncontiguous data records
introduction 2 5
Introduction(2/5)
  • Current approach 1, supervised learning
    • require substantial human effort
  • Current approach 2, Automatic techniques perform poorly
  • Only assume relevant items are in a contiguous Web page
  • Few researchers exploited the nested of HTML structures
introduction 3 5
Introduction(3/5)
  • MDR (mining data records)
    • An automatic technique finds all data records formed by table and form related HTML tags
    • Such as, table, form, tr, td, and so on
    • MDR outperformed other existing systems
introduction 4 5
Introduction(4/5)
  • MDR base on two observations of web pages layout
  • Observation one:
    • Similar objects appear in a contiguous region of a page
    • Data regions are formatted with similar HTML tags
introduction 5 5
Introduction(5/5)
  • Observation two
    • A tag tree, the nested structured of HTML tags in a Web page
    • Data records in a specific region under one parent node
      • As figure 1b, each notebook is wrapped in 5 tr nodes under the same parent nodetbody
related work 1 2
Related work(1/2)
  • Researchers have developed several approaches for mining data records from Web pages
  • Omini (Object Mining and Extraction system)
    • use a set of heuristics and a manually constructed domain ontology
related work 2 2
Related work(2/2)
  • IEPAD (Information Extraction based on Pattern Discovery )
    • A automatic method that uses sequence alignment to find patterns representing a set of data records
  • Wrapper induction
    • Wrapper is a program that extract data from a Web site and put in a DB
    • learns extraction rules using manually labeled training examples
mdr technique
MDR Technique
  • 3 Steps as:
  • Build an HTML tag tree of the page
  • Mine all data regions in the page by using observations and edit distance string algorithm
  • Identify data records from each data region
mining data regions 1 2
Mining data regions(1/2)
  • First, mine generalized nodes
  • a sequence of adjacent generalized nodes form a data region p.16

The Node pair (14,15),(16,17), and (18,19) are generalized nodes of length 2

Node 5,6 are generalized nodes of length 1

Node 8,9,10 are generalized nodes of length 1

mining data regions 2 2
Mining data regions(2/2)
  • A data region contains two or more generalized nodes with properties:
    • They have the same parent
    • They have the same length (the same number of child nodes in the tag tree)
    • They are adjacent
    • The normalizededit distance between them is less than a fixed threshold
comparing generalized nodes 1 7
Comparing generalized nodes (1/7)
  • The mining algorithm must answer two question below:
    • Q1.Where does the 1st generalized node of a data region start?
    • Q2.How many tag nodes (components) are in the generalized nodes in each data region
comparing generalized nodes 2 7
Comparing generalized nodes(2/7)
  • K: the maximum number of tag nodes in a generalized node. K is small (less than 10)
    • Answer 1: find a data region starting from each tag node sequentially
    • Answer 2: try 1-node, 2-node, …, K-node combination
comparing generalized nodes 3 7
Comparing generalized nodes(3/7)
  • The number of comparisons is not large for two reason:
    • Compare only the child nodes of the same parent node. E.g., in figure 2 no need to compare node 8 and node 13
    • Some comparison performed for earlier nodes are the same as for later nodes. Therefore, no need to do them twice.
comparing generalized nodes 4 7
Comparing generalized nodes(4/7)
  • The figure 3 has 10 nodes below a parent node p.
  • A generalized node can have a maximum of three components, K=3
comparing generalized nodes 5 7
Comparing generalized nodes(5/7)
  • Starting from Node 1, we compute these comparisons:
    • (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8),

(8, 9), (9, 10)

    • (1-2, 3-4), (3-4, 5-6), (5-6, 7-8), (7-8, 9-10)
    • (1-2-3, 4-5-6), (4-5-6, 7-8-9)
  • Starting from Node 2, we compute only
    • (2-3, 4-5), (4-5, 6-7), (6-7, 8-9)
    • (2-3-4, 5-6-7), (5-6-7, 8-9-10)
  • Starting from Node 3, we only need to compute one string comparison: (3-4-5, 6-7-8).
  • No need to start from any other nodes after node 3 because of “K=3”
comparing generalized nodes 6 7
Comparing generalized nodes(6/7)

The algorithm won’t search for the data regions if the subtree’s depth from Node is 1 or 2

comparing generalized nodes 7 7
Comparing generalized nodes(7/7)
  • Total number of nodes in the tag tree is N
  • Without considering string comparison, the complexity of CombComp is O(NK)
  • Because K is relatively small, the CombComp algorithm linear in N
determining data region 1 2
Determining data region(1/2)
  • Procedure FindDRs report
    • the entire area as data region
    • each row as a generalized node
    • contains eight data records
determining data region 2 2
Determining data region(2/2)
  • Two main issues affect the final decisions
    • If a lower-level data region is within a higher-level data region, we report higher-level data region.
    • In a data region, we only report only the smallest generalized nodes
identifying data records 1 5
Identifying data records(1/5)
  • Data region  Generalized node  Data Record (object)
  • A generalized node might contain one or more data records
identifying data records 2 5
Identifying data records(2/5)
  • Noncontiguous object description
    • HTML code: Name 1, Name2, Description 1, Description 2, Name 3, Name 4, Description 3, Description 4
identifying data records 3 5
Identifying data records(3/5)
  • Finding noncontiguous data records
    • Group the corresponding children of Node 1 and 2
    • Join Node 5 and node 7 to form one
    • Join Node 6 and node 8 to form another
identifying data records 4 5
Identifying data records(4/5)
  • Data record not in any data regions
    • Row 1, 2, 3 at same level, row 1, 2 (two generalized node form a data region)
    • Object 5 won’t be covered by a data region
identifying data records 5 5
Identifying data records(5/5)
  • Finding Object 5, an odd number of objects in a table and HTML tag tree
  • Use Object 4 (or any of the four object) to match each tag string of the children of the sibling nodes of r1 and r2
experiment result 1 2
Experiment result (1/2)
  • Evaluate MDR and compare with Omini and IEPAD
  • Implement and debug MDR by using pages from Amazon, Yahoo, and Hewlett-Packard Web site
  • Default edit distance threshold 0.3, and no tuning for new pages or Web sites
experiment result 2 2
Experiment result (2/2)
  • Use standard precision and recall measures
  • Omini and IEPADonly work well with simple page
    • Pages with many similar data records and little noise
current future work
Current & future work
  • Currently, two practical applications
    • Extract consumer product reviews from online merchant sites
    • A more effective technique for extracting individual data fields from data record
  • Future work
    • Study the problem of extracting information from text document that are much less structured than HTML
ad