1 / 26

A Tool for Supporting Integration Across Multiple Flat-File Datasets

A Tool for Supporting Integration Across Multiple Flat-File Datasets. Xuan Zhang, Gagan Agrawal Ohio State University. Outline. Motivation System Overview System Implementation Languages Query Execution Experiments. Motivation. Biological researches ask for

Download Presentation

A Tool for Supporting Integration Across Multiple Flat-File Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

  2. Outline • Motivation • System Overview • System Implementation • Languages • Query Execution • Experiments

  3. Motivation • Biological researches ask for • Accessing multiple heterogeneous data sources • Lack of common data model, data format • Tracking multiple objects • A motivating example: protein sequence analysis

  4. An Example To predict protein function >unknown sequence. ……MCMFGSSVIECPNPRIWFVWPYEFPLFLLPGGDRMEI…… NCBI protein-protein BLAST service List of similar sequences Clustering Analysis (ClustalW, RiPE, etc)

  5. Current Solution • Manual • Copy-and-paste keyword search • Format conversion programs • NCBI link-out • Database • Load data (BLAST output, sequence database) • Parse input; Re-format output

  6. Our Approach • Join request between BLAST output and SWISSPROT (sequence database) • Data maintained in flat files • Query specification and data description are high-level, declarative • Data parsing and query processing are behind the scene

  7. Advantage • Retrieve multiple pieces of information all at once • Data easily available • Declarative languages only • High flexibility • Low over-head

  8. System Overview query Source/target names Query parser Metadata collection Dataset descriptors Descriptor parser mappings Schema & Layout information Application analyzer QUERYINFOR Source data files Target Data file DataReader DataWriter synchronizer

  9. Outline • Motivation • System Overview • System Implementation • Languages • Query Language • Metadata Description Language • System • Query Analysis • Query Execution • Experiments

  10. Query Language • Declarative, SQL-like • Projection, selection, cross product, join queries • Example AUTOWRAP POSTBLAST FROM BLASTP, SWISSPROT BY BLASTP.SP_ID = SWISSPROT.ID WHERE POSTBLAST.QUERY = BLASTP.QUERY POSTBLAST.SP_AC = BLASTP.SP_AC POSTBLAST.SP_ID = BLASTP.SP_ID POSTBLAST.FULL_DESCR = SWISSPROT.DE POSTBLAST.SEQUENCE = SWISSPORT.SQ POSTBLAST.SCORE = BLASTP.SCORE POSTBLAST.E_VALUE = BLASTP.E_VALUE Target dataset Source datasets Join criteria Attribute pairs

  11. Metadata Description Language • One descriptor for each flat file dataset. • Schema • Layout • Re-usable by different queries • Can be learned semi-automatically using data mining techniques • Example: BLAST output

  12. Schema Descriptors • Written in XML DTD format • Example <?xml version='1.0' encoding='UTF-8'?> <!ELEMENT BLASTP (QUERY, SEQUENCE*)> <!ELEMENT QUERY (#PCDATA)> <!ELEMENT SEQUENCE (SP_AC, SP_ID, DESCR, SCORE, E_VALUE)> <!ELEMENT SP_AC (#PCDATA)> <!ELEMENT SP_ID (#PCDATA)> <!ELEMENT DESCR (#PCDATA)> <!ELEMENT SCORE (#PCDATA)> <!ELEMENT E_VALUE (#PCDATA)>

  13. Layout Descriptors • Example DATASET "BLASTP" { DATATYPE {BLASTP} DATASPACELINESIZE = 90 { … … } DATA {data/Blast_htm.txt} } Dataset name Schema name File layout File location

  14. Layout descriptor Actual data file Description of File Layout "BLASTP" VERSION … … "Query=" QUERY "\nDatabase:" DB_NAME < "\nsp|" SP_AC "|" SP_ID " " DESCR " " SCORE " " E_VALUE > "\n\nALIGNMENT" DUMMY BLASTP2.2.11 [Jun-05-2005] Reference: … … RID: … … Query=Random 50 residue protein sequence. Database:Non-redundant SwissProt sequences 175,661 sequences; 64,716,374 total letters Score E Sequences producing significant alignments: (Bits) Value sp|P11884|AL1A1_SHEEPModification methylase MwoI (N-4 cytosin...30.01.5 sp|P00352|AL1A1_HUMANOxygen-independent coproporphyrinogen II...28.1 5.7 sp|P40530|YIE2_YEASTOxygen-independent coproporphyrinogen II...28.1 5.7 ALIGNMENTS >sp|P11884|AL1A1_SHEEP Modification methylase MwoI (N-4 cytosine-specific … …

  15. Query Analysis query Source/target names Query parser Metadata collection Dataset descriptors Descriptor parser mappings Schema & Layout information Application analyzer Application analyzer QUERYINFOR Source data files Target Data file DataReader DataWriter synchronizer

  16. Terminology • DLM-VAR node/pair • a pairing of a delimiter and an attribute value • E.g "Query=" QUERY • Reach-ability • DLM-VAR node r is reachable from node a iff configuration “ar” is allowed by the layout description • Regular v.s Semi-structured Attribute • Regular: fixed number of values per entry • Semi-structured: various number of values per entry • Number v.s Index • Label for layout node v.s schema node • 1 index/number, 1+ number/index

  17. Application Analyzer • Label Schema and Layout Tree • Query analysis • Record layout information • Delimiter look-up table • Draw correspondence between schema and layout • Label look-up table • Collect constant values in query • Pseudo-label look-up table • Calculate reachable nodes • Reachable look-up table • Other information Parameters QUERYINFOR

  18. QUERY-PROC Structure • Three general action modules • DataReader • DataWriter • Synchronizer • One query-specific data module • QUERYINFOR QUERYINFOR Source 1 Target DataReader DataWriter Source 2 Synchronizer

  19. QUERY-PROC Structure (cont.) • One value buffer • Configuration vary from query to query • Accessible to three general modules QUERY SP_AC SP_ID BLASTP SCORE E_VALUE Source 1 Source 2 Regular Semi-structured

  20. QUERY-PROC Action • DataReader • Extract attribute value • Start: Delimiter look-up table • End: Reachable look-up table • Fill value buffer: Label look-up table • DataWriter • Retrieve from value buffer: Label look-up table • Write target file: Delimiter look-up table • Truncate or wrap: Reachable look-up table + label look-up table

  21. QUERY-PROC Action (cont.) • Synchronizer • Set up pseudo-attributes: Pseudo label look-up table • Call DataReader on source 1 and 2, Call DataWriter on target: Parameters • Test join conditions: Parameters • Clean value buffer: Parameters

  22. Outline Outline • Motivation • System Overview • System Implementation • Languages • System • Experiments

  23. Post-BLAST Query • Enhance BLAST output • Join query between BLAST output and SWISSPROT • Results in FASTA format • 2 modes • UNIQUE: halt once a match is found in source 2 • ALL: search all source 2 entries

  24. Chip-supplement Query • Look up microarray genes information • Join query between protein array and yeast genome database • Results in tabular form • 2 queries • Chip-Supplement: array join genome • Chip-Supplement-Sorted: genome join array

  25. OMIM-plus Query • Add reverse links of proteins to disease database • Join query between OMIM database and SWISSPROT database • Results in OMIM form • 86.38 seconds/entry * 12,158 OMIM entry = 291.7 hours

  26. Summary • A data integration tool • Answers query on flat-file datasets • Light-weighted • Modest programming efforts • No DBMS • Various flat file formats supported

More Related