1 / 37

IEPAD: Information Extraction based on Pattern Discovery

IEPAD: Information Extraction based on Pattern Discovery. Chia-Hui Chang National Central University, Taiwan http://www.csie.ncu.edu.tw/~chia. Outline. Introduction Problem definition Related Work System architecture Extraction rule generation Experiments Summary and future work.

mitch
Download Presentation

IEPAD: Information Extraction based on Pattern Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan http://www.csie.ncu.edu.tw/~chia

  2. Outline • Introduction • Problem definition • Related Work • System architecture • Extraction rule generation • Experiments • Summary and future work

  3. Introduction • Web information integration • multi-search engines, e.g. Metacrawler • shopping agents • etc. • Common tasks • Data collection • Information extraction

  4. Information Extraction • Information Extraction (IE) • Input: Html pages • Output: A set of records

  5. Related Work • Extractor Generation • Hand-coded wrappers by observation • Machine learning based approach • WIEN (Kushmeric), 1997 • SoftMealy (Hsu), 1998 • STALKER (Muslea), 1999 • Fully automatic approach • Embley et al, 1999 • Chang et al, 2000

  6. System Architecture Rule Generator Html Page Patterns Pattern Viewer Users Extractor ExtractionResults Extraction Rule Html Pages

  7. Pattern Discovery based IE • Motivation • Display of multiple records often forms a repeated pattern • The occurrences of the pattern are spaced regularly and adjacently • Now the problem becomes ... • Find regular and adjacent repeats in a string

  8. HTML Page Token Translator A Token String PAT Tree Constructor PAT trees and Maximal Repeats Validator Advenced Patterns Rule Composer Extraction Rules The Rule Generator • Translator • PAT tree construction • Pattern validator • Rule Composer

  9. 1. Web Page Translation • Encoding of HTML source • Rule 1: Each tag is encoded as a token • Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) • HTML Example: <B>Congo</B><I>242</I><BR> <B>Egypt</B><I>20</I><BR> • Encoded token string T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

  10. Various Encoding Schemes

  11. Example of BL Encoding • Encoding scheme=Block-Level Tags 1’. Only block-level tags are considered, each tag is encoded as a token 2. Any text between two tags are translated to a special token called TEXT (denoted by a underscore) <dl><dt><b>1.</b> <b><a ...>MGI 2.4 - Mouse <em>Genome</em> … </a> <dd>The Mouse <b>Genome</b> Informatics (MGI) ..<br> <span>URL:www.informatics.jax.org/ </span><br> <a ...> …</a><a ...>…</a><img src=…><a ...>…</a> Facts about:<a> …</a></dl> <dl> <dt> _ <dd> _ <br> _ <br> _ </dl> 1 5 9 64 68

  12. 2. PAT Tree Construction • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) • PAT tree: binary suffix tree • A Patricia treeconstructed over all possible suffix strings of a text • Example T(<B>) 000 T(</B>) 001 T(<I>) 010 T(</I>) 011 T(<BR>) 100 T(_) 110 • 000110001010110011100 • 000110001010110011100

  13. The Constructed PAT Tree

  14. Definition of Maximal Repeats • Let a occurs in S in position p1, p2, p3, …, pk • a is left maximal if there exists at least one (i, j) pair such that S[pi-1]S[pj-1] • a is right maximal if there exists at least one (i, j) pair such that S[pi+|a|]S[pj+|a|] • a is a maximal repeat if it it both left maximal and right maximal

  15. Finding Maximal Repeats • Definition: • Let’s call character S[pi-1] the left character of suffix pi • A node  is left diverse if at least two leaves in the ’s subtree have different left characters • Lemma: • The path labels of an internal node  in a PAT tree is a maximal repeat if and only if  is left diverse

  16. 3. Pattern Validator • Suppose a maximal repeat  are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence. • Characteristics of a Pattern • Regularity: Variance coefficient • Adjacency: Density

  17. Pattern a No V(a)<0.5 Discard Yes No 0.25<D(a)<1.5 Discard Yes Pattern a Pattern Validator (Cont.) • Basic Screening For each maximal repeat a, compute V(a) and D(a) a) check if the pattern’s variance: V(a) < 0.5 b) check if the pattern’s density: 0.25 < D(a) < 1.5

  18. 4. Rule Composer • Occurrence partition • Flexible variance threshold control • Multiple string alignment • Increase density of a pattern a, occurrences No Occurrence Partition V(a)<0.1 No V(a)<0.5 Discard Yes No 0.25<D(a)<1.5 Discard Yes Multiple String Alignment Yes D(a)<1 a’ No a

  19. Occurrence Partition • Problem • Some patterns are divided into several blocks • Ex: Lycos, Excite with large regularity • Solution • Clustering of the occurrences of such a pattern Clustering V(P)<0.1 No P Discard Yes Check density

  20. Multiple String Alignment • Problem • Patterns with density less than 1 can extract only part of the information • Solution • Align k-1 substrings among the k occurrences • A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

  21. Multiple String Alignment (Cont.) • Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” • If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'': a d c w b d a d c x b - a d c x b d • The extraction pattern can be generalized as “adc[w|x]b[d|-]”

  22. Pattern Viewer • Java-application based GUI • Web based GUI • http://140.115.155.102/WebIEPAD/

  23. The Extractor • Matching the pattern against the encoding token string • Knuth-Morris-Pratt’s algorithm • Boyer-Moore’s algorithm • Alternatives in a rule • matching the longest pattern • What are extracted? • The whole record

  24. Experiment Setup • Fourteen sources: search engines • Performance measures • Number of patterns • Retrieval rate and Accuracy rate • Parameters • Encoding scheme • Thresholds control

  25. # of Patterns Discovered Using BlockLevel Encoding • Average 117 maximal repeats in our test Web pages

  26. Translation • Average page length is 22.7KB

  27. Accuracy and Retrieval Rate

  28. Accuracy and Retrieval Rate

  29. Summary • IEPAD: Information Extraction based on Pattern Discovery • Rule generator • The extractor • Pattern viewer • Performance • 97% retrieval rate and 94% accuracy rate

  30. Problems • Guarantee high retrieval rate instead of accuracy rate • Generalized rule can extract more than the desired data • Only applicable when there are several records in a Web page, currently

  31. Final • Acknowledgement • We would like to thank Lee-Feng Chien, Ming-Jer Lee and Jung-Liang Chen for providing their PAT tree code for us. • Reference • Chang, C.H. and Lui, S.C. IEPAD: Information Extraction based on Pattern Discovery, WWW10, May. 2001, Hong Kong.

  32. Future Work • Interface for choosing a pattern • http://www.csie.ncu.edu.tw/~chia/webiepad/ • Multi-level extraction • From record boundary extraction to attribute value extraction • Extractors in Java and C++

  33. Rule Format level 1 encoding scheme: rule level 2 encoding scheme: rule for block 1 level 2 encoding scheme: rule for block 2 ... level 2 encoding scheme, rule for block k level 1 block 1, level 2 block no for attribute 1 level 1 block 1, level 2 block no for attribute 2 ... level 1 block 1, level 2 block no for attribute t K個 block t個attribute

  34. Example(cont.) Line 0: Blocklevel.h, <DL><DT>String<DD>String<BR>String<BR>String<BR>String</DD></DL> Line 1: Alltag.h, rule for block 1 Line 2: Alltag.h, rule for block 2 ... Line k: Alltag.h, rule for block k Line k+1: level 1 block no, level 2 block no for attribute 1 Line k+2: level 1 block no, level 2 block no for attribute 2 ... Line k+t: level 1 block no, level 2 block no for attribute t Demo ex: 3, 2 ex: 5, all ex: 5, 1 3

  35. Congo Example

  36. Performance Evaluation • Definition: • A pattern is said to enumerate a record if the overlapping percentage between the record and the pattern is greater than  • Three Measures • Retrieval Rate • Accuracy Rate • Matching Percentage

  37. Illustration • Let Gi,j denotes the ordered occurrences pi, pi+1, ..., pj S=, i=1; Forj=1 tok-1 do If R(Gi,j+1) > then If R(Gi,j) < mthen S= S{Gi,j}; endif i= j+1; endif endf

More Related