1 / 16

R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites. Valter Crescenzi Giansalvatore Mecca Paolo Merialdo. VLDB 2001. Overview. Automatically generates a wrapper from large structured Web pages Supports nested structures

beata
Download Presentation

R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001

  2. Overview • Automatically generates a wrapper from large structured Web pages • Supports nested structures • Efficient approach to large, complex pages with regular structures

  3. Approach • Given a set of example pages • Generate a Union-free Regular Expression (UFRE) • Find the least upper bounds on the RE lattice to generate a wrapper • Reduces to find the least upper bound on two UFRES

  4. Matching/Mismatching • Start with the first page and create a RE that defines the wrapper • Match each successive sample against the wrapper • Mismatches result in generalizations of the regular expression • Types of mismatches • String mismatches • Tag mismatches

  5. Example Pages

  6. Example #PCDATA • String mismatches are used to discover fields of the documents • Wrapper is generated by replacing “John Smith” with #PCDATA

  7. Example (Cont.) #PCDATA Tag Mismatches :Discovering Optionals • First check to see if mismatch is caused by an iterator • If not, could be an optional field in wrapper or sample • Cross search used to determine possible optionals • Image field determined to be optional • (<img src=…/>)?

  8. Example (Cont.) #PCDATA (<IMG src=…/>)? Tag Mismatches :Discovering Optionals • First check to see if mismatch is caused by an iterator • If not, could be an optional field in wrapper or sample • Cross search used to determine possible optionals • Image field determined to be optional • (<img src=…/>)?

  9. Tag Mismatches :Discovering Iterators • Assume mismatch is caused by repeated elements in a list • Match possible squares against earlier squares • Generalize the wrapper by finding all contiguous repeated occurrences • (<li><i>Title:</i>#PCDATA</li>)+ Example (Cont.) #PCDATA (<IMG src=…/>)? #PCDATA #PCDATA

  10. Extracted Result

  11. Recursive Example

  12. Complexity

  13. Discussion • Assumptions • Pages are well-structured • Want to extract at the level of entire fields • Structure can be modeled without disjunctions • Search Space for explaining mismatches is huge • Uses a number of heuristics to prune space • Limited backtracking • Limit on number of choices to explore • Patterns can not be delimited by optionals • Will result in pruning possible wrappers

  14. Experimental Result

  15. Comparison with Other Works

  16. X means the information extraction system has the capability; X* means the information extraction system has the ability as long as the training corpus can accommodate the required training data; ? Shows that the systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability.

More Related