structured data extraction from web based on partial tree alignment l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Structured Data Extraction From Web Based on Partial Tree Alignment PowerPoint Presentation
Download Presentation
Structured Data Extraction From Web Based on Partial Tree Alignment

Loading in 2 Seconds...

play fullscreen
1 / 24

Structured Data Extraction From Web Based on Partial Tree Alignment - PowerPoint PPT Presentation


  • 985 Views
  • Uploaded on

Structured Data Extraction From Web Based on Partial Tree Alignment by Yanhong zhai and Bing Liu Introduction A large amount of information on the Web is contained in regularly structured data objects Which are data records retrieved from databases.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Structured Data Extraction From Web Based on Partial Tree Alignment' - jacob


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
structured data extraction from web based on partial tree alignment

Structured Data Extraction From Web Based on Partial Tree Alignment

by

Yanhong zhai and Bing Liu

introduction
Introduction
  • A large amount of information on the Web is contained in regularly structured data objects
    • Which are data records retrieved from databases.
  • Such Web data records are important because
    • They often present the essential information of their host pages, e.g., lists of products and services.
  • Applications: integrated and value-added services,
    • e.g., Comparative shopping, meta-search & query, etc.
existing methods
Existing Methods
  • Wrapper Programming languages
    • This approach provides some languages to facilitate the construction of data extraction programs.
  • Wrapper Induction
    • This approach use machine learning techniques

to learn data extraction rules from set

of manually labeled examples.

  • Automatic Extraction
    • This approach is based on the idea of automatic pattern discovery.
proposed method
Proposed Method:
  • DEPTA (Data extraction based on partial tree alignment
  • This method consists of two steps:

1)Identifying individual records in a page.

2)Aligning and extracting data items from

the Identified records.

architecture of depta system
Architecture of DEPTA System

Input: a web page

DOM Tree Builder

Data Region Identifier

Data Records Identifier

Output: Data Tables

Data Items Extractor

data record identification
DATA RECORD IDENTIFICATION
  • MDR: Mining Data Records
  • Given a single page with multiple data records, MDR extracts data records ,but not data items(step1).
  • MDR is based on
    • two observations about data records in a Web page and
    • a tree matching algorithm
  • Consider both
    • Contiguous
    • non – contiguous records
two observations
Two Observations
  • A group of data recordsare presented
    • In a contiguous region (a data region) of a page and
    • are formatted using similar HTML tags
  • A set of similar data records are formed by some child sub trees of the same parent node.
dom tree of the previous page
DOM tree of the previous page:

TABLE

TBODY

TR

TR

TR

TR

TR

TR

TD

TD

TD

Data

record2

Data record1

TD

TD

TD

TD

TD

the approach
The approach
  • Given a page ,
    • Building the Dom Trees Based on

Visual Information

    • Mining Data Regions
    • Identifying Data Records

Rendering (or Visual) information is very

useful in the whole process.

building dom trees based on visual information
Building Dom Trees Based on Visual Information

1.<table>

2.<tr>

3.<td>data1</td>

4.<td>data2</td>

5.<tr>

6.<td>data3</td>

7.<td>data4</td>

8.</tr>

9.</table>

Left right top bottom

table

100 300 200 400

100 300 200 300

100 300 200 400

200 300 200 300

tr

tr

100 300 300 400

100 200 300 400

tr

tr

tr

tr

200 300 200 400

enhanced simple tree matching
Enhanced Simple Tree Matching

T1

T2

p

p

T2

T1

p

p

a

a

a

a

a

a

a

b

a

b

<data1>

<data2>

<data3>

<data2>

<data3>

<data4>

c

c

g

c

<data1>

<data2>

<data1>

Wrong alignment

Correct alignment

(b)

(a)

Alignment using tags only can produce wrong alignments

Two trees with more than one possible matches

mining data regions
Mining Data Regions
  • Find every data region with similar data records.

Definition:A generalized node (or a node combination)

of length r consists of r (r≥1)nodes in the HTML tag tree

with the following two properties:

1. the nodes all have the same parent and

2. the nodes are adjacent.

Definition: A data region is a collection of two or more

generalized nodes with the following properties :

1.The generalized nodes all have the same parent.

2.The generalized nodes are all adjacent.

3.Adjacent generalized nodes are similar.

determining data regions
Determining Data Regions
  • To find each data region , the algorithm needs to find the following .

1. Where does the first generalized node of the data region start?

    • Try to start from each child node under a parent

2. How many tag nodes or components does a generalized node have?

    • We try: one node, two node,., K node combinations
an illustration of generalized nodes and data regions
An illustration of generalized nodes and data regions

Shades nodes are generalized nodes

data regions

1

2

3

4

5

6

7

8

9

10

Region 1

Region 2

11

12

13

14

15

16

17

19

18

Region 3

identifying data records
Identifying Data Records
  • A generalized node may not

be a data record.

  • Extra mechanisms are

needed to identify true

atomic objects

  • Some highlights:

contiguous

non-contiguous data records

Name1

Description

of object 1

Name2

Description

of object2

Name3

Description

of object3

Name4

Description

of object4

Name1

Name2

Description

Of object 1

Description

Of object2

Name3

Name4

Description

Of object 3

Description

Of object4

depta extract data from data records
DEPTA: Extract Data from Data Records
  • Once a list of data records are identified, we can align and extract items in them
  • Multiple tree alignment:
  • We need multiple alignment as we have multiple data records
  • Most multiple alignment methods work like hierarchical clustering , and require n2 pair wise matching.
    • Too expensive
  • Optimal alignment/ matching is exponential
  • A partial tree matching algorithm is proposed in Depta to perform multiple tree alignment
the partial tree alignment approach
The partial Tree Alignment Approach
  • Choose a seed tree: A seed tree , denoted by Ts, is picked with the maximum number of data items.
  • Tree matching:
  • For each unmatched tree Ti (i≠s),
    • Match Ts and Tr
    • Each pair of matched nodes are linked (aligned)
    • For each unmatched node nj in Ti do
      • Expand Ts by inserting n into Ts if a position for insertion can be uniquely determined in Ts.
  • The expanded seed tree Ts is then used in subsequent matching.
illustration of partial tree alignment
Illustration of partial tree alignment

TS

Ti

p

p

a

b

e

b

c

d

e

New part of Ts

Insertion is possible

p

a

b

c

d

e

Ts

p

Ti

p

Insertion is not

possible

a

b

e

a

x

e

a complete example
A complete example

Ts

=

T1

p

T2

p

T3

p

…..

X

b

d

b

n

c

k

g

b

c

d

h

k

Ts

p

No node inserted

X

b

d

New

Ts

p

C, h and k inserted

T2 is matched again

X

b

c

d

h

k

T2

p

b

n

c

g

k

p

X

b

n

c

d

h

k

g

output data table
Output data table

….

X

b

n

c

d

h

K

g

T1

1

1

1

….

1

1

1

1

1

T2

1

1

1

1

1

T3

  • The final tree may also be used to match and extract data from other

similar pages

conclusion
Conclusion
  • Existing techniques either inaccurate or make several assumptions.
  • Our method does not make these assumptions
  • Our technique consists of two steps
    • Identifying data records
    • Aligning corresponding data items from multiple data records.
  • Step1 is based on visual cues
  • Step2 is based on partial tree aligment
references
References
  • [1]. Arasu, A. and Garcia-Molina, H. Extracting Structured Data
  • from Web Pages. SIGMOD-03, 2003.
  • [2]. Baeza-Yates, R. Algorithms for string matching: A survey.
  • ACM SIGIR Forum, 23(3-4):34-58, 1989.
  • [3]. Barton, G., Sternberg, M. A strategy for the rapid multiple
  • alignment of protein sequences: confidence levels from
  • tertiary structure comparisons. J. Mol. Biol. 1987, 327-337.
  • [4]. Bar-Yossef, Z. and Rajagopalan, S. Template Detection via
  • Data Mining and its Applications, WWW 2002, 2002.
  • [5]. Buttler, D., Liu, L., Pu, C. A fully automated extraction
  • system for the World Wide Web. IEEE ICDCS-21, 2001.
  • [6]. Carrillo, H., Lipman, D. The multiple sequence alignment
  • problem in biology. SIAM J. Applied Math., 1988;48(5).
  • [7]. Chakrabarti, S. Mining the Web: Discovering Knowledge
  • from Hypertext Data. Morgan Kaufmann Publishers, 2002.
  • [8]. Chang, C. and Lui, S-L. IEPAD: Information extraction
  • based on pattern discovery. WWW-10, 2001.
  • [9]. Chen, H.-H., Tsai, S.-C., and Tsai, J.-H. Mining tables from
  • large scale html texts. COLING-00, 2000.
  • [10]. Chen, W. New algorithm for ordered tree-to-tree correction
  • problem. Journal of Algorithms, 40:135–158, 2001.
  • [11]. Cohen, W., Hurst, M., and Jensen, L. A flexible learning