1 / 22

Grid Based Data Integration with Automatic Wrapper Generation

Grid Based Data Integration with Automatic Wrapper Generation. Xuan Zhang Gagan Agrawal Ohio State University. Overall Goal. Tools for data integration driven by: Data explosion Data size & number of data sources New analysis tools and need for workflows Autonomous resources

oakley
Download Presentation

Grid Based Data Integration with Automatic Wrapper Generation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid Based Data Integration with Automatic Wrapper Generation Xuan Zhang Gagan Agrawal Ohio State University

  2. Overall Goal • Tools for data integration driven by: • Data explosion • Data size & number of data sources • New analysis tools and need for workflows • Autonomous resources • Heterogeneous data representation & various interfaces

  3. Motivation (Contd.) • Other Issues: • Frequent updates to data formats • Flat-file datasets • Ad-hoc sharing of data

  4. Current Approaches • Manually written wrappers • Problems • O(N2) wrappers needed, O(N) for a single updates • Portability of wrappers in a distributed environment • Mediator-based integration systems • Problems • Need a common intermediate format • Unnecessary data transformation • Integration using web/grid services • Needs all tools to be web-services (all data in XML?)

  5. Our Approach • Automatically generate wrappers • One layout descriptor per resource • Stand-alone wrapper programs • For integrated DBs, (grid) workflow systems • Transform data in files of arbitrary formats • No domain- or format-specific heuristics • Layout information provided by users

  6. Our Approach (Contd.) • Help write layout descriptors using data mining techniques (dils 2005, bibe 2005) • Particularly attractive for • Data grid environments and workflows • flat-file datasets • ad hoc data sharing

  7. Our Approach: Advantages • Advantages: • No need to write wrappers while integrating data or creating workflows • Only one descriptor per resource needed • No unnecessary transformations / storage • New resources can be integrated on-the-fly

  8. Our Approach: Challenges • Description language • Format and logical view of data in flat files • Easy to interpret and write • Wrapper generation and execution • Correspondence between data items • Separating wrapper analysis and execution • Interactive tools for writing layout descriptors • What data mining techniques to use ? (dils 2005, bibe 2005)

  9. Wrapper Generation System Overview Layout Descriptor Schema Descriptors Parser Mapping Generator Data Entry Representation Schema Mapping Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter Synchronizer

  10. Suitability for a Grid Environment • Wrapper analysis can be implemented as a grid service • Very low execution costs • Wrapper execution modules are task-independent • Just need to port three modules on different systems

  11. Assumptions for the Current Prototype • One tabular, the other semi-structured • Both datasets are stored record-wise • Order of records not disturbed • Suitable for bioinformatics Semi-structured tabular

  12. Layout Description Language • Goal • To describe data in arbitrary flat file format • Easy to interpret and write • Components: • Schema description • Layout description • Example: FASTA

  13. Key observations on data layout Strings of variable length Delimiters widely used Data fields divided into variables Repetitive structures Key tokens “constant string” LINESIZE [optional] <repeating> … Layout Description Language … >seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 …

  14. Layout Description Language … >seq1 comment1\n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … • Component I: Schema Description [FASTA] //Schema Name ID = string //Data type definitions DESCRIPTION = string SEQ = string

  15. Layout Description Language … >seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … • Component II: Layout Description … LOOP ENTRY 1:EOF:1 { “>” ID “ ” DESCRIPTION < “\n” SEQ > “\n” | EOF } …

  16. Reference table TRANSFAC … FA factor1_name … RA reference1.1_authors … RA reference1.2_authors … RA reference1.3_authors … Mapping Cardinality One-to-multiple data field One-to-one data field

  17. Analyzing Application • Goals - WRAPINFO • Summarize all application related information necessary for the wrapper • Represent the information in look-up tables and constant parameters • Represent the information in a platform-independent format, XML

  18. Wrapper Generated Value buffer one_to_multiple_values RA FA RA Output dataset Input dataset Dataset buffer DataReader DataWriter one_to_one_values run load run halt Synchronizer

  19. Wrapper Generated • Suitable for data grid • Three general modules • DataReader • Extract one data field value • Write value to the value buffer if useful • DataWriter • Write one data field value • Remove value from list in the value buffer • Synchronizer • Switch between calling DataReader and DataWriter • Manage dataset buffer • Application specific information in WRAPINFO

  20. Experimental Results (in logarithm) (in logarithm) TRANSFAC-to-Reference Problem

  21. Experimental Results SWISSPROT-to-FASTA Problem

  22. Summary • Automatically generated wrappers can perform well • Wrapper task analysis and wrapper execution can be separated • Key Open Question: • How hard it is to write layout descriptors ? • Can we make the process semi-automatic ? • Data mining techniques seem quite promising (dils 2005, bibe 2005)

More Related