1 / 15

An XML-Framework to Handle Proprietary Data Records… and beyond

An XML-Framework to Handle Proprietary Data Records… and beyond. Amarnath Gupta Richard Marciano. Earlier Collections & Desktop Records. E-mail: E-Mail Postings – SDSC  1 Million records Tiger92: Tiger/Line ’92 – Census  51K records

raanan
Download Presentation

An XML-Framework to Handle Proprietary Data Records… and beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An XML-Framework to Handle Proprietary Data Records… and beyond Amarnath Gupta Richard Marciano

  2. Earlier Collections & Desktop Records • E-mail: E-Mail Postings – SDSC  1 Million records • Tiger92: Tiger/Line ’92 – Census  51K records • 104th : 104th Congress Bills – House  12K files • VAD97: 105th Congress Roll Call Votes – House  1.3K files • EAP: Electronic Archive Project – NARA  12K files • Vietnam: Combat Area Casualties Current File (CACCF) – NARA  58K records • Patent: Patent Data – USPTO  2 Million records • AMICO: Image Collection – CDL  56K records • JTIC:Joint Interoperability Test Command – Defense  0.7K files • OFFICE AUTOMATION SUITE (Desktop files: MS Word, etc.) NARA Presentation 2001

  3. Approach Taken for JITC • “BLOBBING” Desktop Data • IBM DB2 database integrated w. HPSS archival storage system • Metadata stored in columns of relational tables • Binary objects stored in columns as BLOBS • IMPLICATION: • Word, PowerPoint, Excel data and others stored in a Software-Dependent Format! NARA Presentation 2001

  4. More Recent Collections & Desktop Records • Senate: 99 files – 46 MB • Y2K: 3,961 files – 337 MB • Census40:20 files – 291 MB • Herbicide:4 files – 8 MB • Challenger:7 files – 37 MB • SORTIA:2 files – 9 MB • CACCF-CACCH: 4 files – 21 MB • CACTA:21 files – 760 MB • FoodSurvey:45 files – 2 MB ______________________ __________ 4,163 files -- 3.3 GB • 95% of all files = • software-dependent desktop files NARA Presentation 2001

  5. Towards Software-Independence • Using XML as a systematic approach for wrapping of information content from proprietary compound formats: examples of Word, Excel, PowerPoint (increasingly, these tools have XML export functions – if not build XML wrapper) • Towards building customizable Archivist Workbench tools for software-independent records • Generalizing the approach to other classes of records NARA Presentation 2001

  6. Simple Documents • What is a simple document? • All elements are text • May contain explicit syntactic structure (HTML tags, titles, sections, paragraphs …) • May contain hyperlinks • Document generating software may produce implicit, but extractable syntactic structure (“meta” tags, font and paragraph style tags, …) • Technologies to transform simple documents into meaningful XML structures exist • Grammar-based processors (Omnimark, Minerva …) • Customizable to produce reasonable syntactic wrappers NARA Presentation 2001

  7. Compound Documents • Documents that are primarily text, but may explicitly embed • Semi-text objects • Comments • Collaboration notes • Mathematical formulas • Media objects in a variety of formats • Vendor-specific digital objects • Spreadsheets • Presentation slides • Other documents • What is involved in preserving these documents for the long term? NARA Presentation 2001

  8. Why XML for Compound Documents • Create a Microsoft Word Document • Save it as a Web Page <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns="http://www.w3.org/TR/REC-html40"> <head> … <link rel=File-List href="./BIRN_files/filelist.xml"> … <title>INTRODUCTION TO THIS REVISED SUBMISSION</title> <!--[if gte mso 9]><xml> <o:DocumentProperties> <o:Author>MGH-NMR Center</o:Author> <o:Template>Normal</o:Template> <o:LastAuthor>Amarnath Gupta</o:LastAuthor> <o:Revision>2</o:Revision> … <w:WordDocument> NARA Presentation 2001

  9. Why XML for Compound Documents <body lang=EN-US style='tab-interval:.5in'> <div class=Section1> … <p class=MsoNormal><b>Sub-Aim X.1</b> <a style='mso-comment-reference:AG_1'>Characterize &amp; correct for spatial distortion in image data</a><![if !supportAnnotations]> … <p class=MsoCommentText><!--[if supportFields]><ins cite="mailto:Amarnath%20Gupta" datetime="2001-03-28T22:17"> … <a href="#_msoanchor_1" class=msocomoff>[AG1]</a><![endif]></span></ins></span></span><ins cite="mailto:Amarnath%20Gupta" datetime="2001-03-28T22:17"> Mark, do you think this can be used as a service?</ins></p> Similarly, <p class=MsoNormal style='text-indent:.5in'><b>Aim Y.</b> To develop <ins cite="mailto:Amarnath%20Gupta" datetime="2001-03-28T22:20">XML-based </ins>data formats and specifications to facilitate interoperability of software tools, and sharing of data and computational resources across the research network.</p> NARA Presentation 2001

  10. Why XML for Compound Documents <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> … <link rel=File-List href="./Page_files/filelist.xml"> <div id="Sample spreadsheet_23102" align=center x:publishsource="Excel"> <table x:str border=0 cellpadding=0 cellspacing=0 width=857 style='border-collapse: collapse;table-layout:fixed;width:643pt'> … <tr height=17 style='height:12.75pt'> <td height=17 class=xl1523102 colspan=2 style='height:12.75pt'>Research</td> <td class=xl1523102 align=right x:num>5</td> <td class=xl1523102></td> <td class=xl2423102 align=right x:num="750000">$750,000.00</td> </tr> <tr height=17 style='height:12.75pt'> <td height=17 class=xl2523102 style='height:12.75pt'>Total</td> <td class=xl1523102></td> <td class=xl1523102 align=right x:num x:fmla="=SUM(C5:C8)">38</td> <td class=xl1523102></td> <td class=xl2423102 align=right x:num="4275000" x:fmla="=SUM(E5:E8)">$4,275,000.00</td> </tr> <td><![endif]><img width=451 height=275 src="./Page_files/Sample%20spreadsheet_23102_image001.gif" v:shapes="_x0000_s1025" class=shape v:dpi="96"><![if !vml]></td> NARA Presentation 2001

  11. Equivalently • A Web (graph) of components <p stylename="Normal" align="left" fontsize="24"> <string fontsize="24"> <shppict line_space=“4”> </> <nonshppict /> </string> </p> Image_Index_Table XML-1 <picture> <Word_picture version=“97-2000”/> <begin_picture> <picture_properties shape_id=“1025> <shape_property shapeType=“75”/> … </picture_properties> <picture scalex=“100” scaley=“100” … <data> </data> Image Data 89504e470d0a1a0a0000000d49484452000000f310000003008030000008f1e628b00000300504c5445c0c0c0ff9900fe9901fd9901 XML-2 NARA Presentation 2001

  12. What XML Buys Us • An Abstract Model • Consider a compound document to be a tree (or graph) of components • Each component has a uniform type • At the point in the document where an embedded (child) object branches off, create a dummy object of the same type as the parent document • Annotate the edges of the tree with the type and destination of the following node (e.g., the image data object may be separately stored) • If multiple objects share the same embedded objects connect them by reference • If needed, create additional indices for specific elements NARA Presentation 2001

  13. Figure 1. Information structure for mapping temporal data to concept spaces. Compound Documents vs. Websites An Abstract Model for the Roosevelt Presidential Library Web Site NARA Presentation 2001

  14. Infrastructure IndependenceBack to the Basics in Computer Science? • An infrastructure independent representation of a compound document is: • A complex-object data model • A set of persistent • Container data structures • Index structures • Identification of the components • Data extraction methods and rules that need to be persistent • Retrieval techniques for expressive and efficient access to compound documents through the data structures above NARA Presentation 2001

  15. Enabling Sciences and Education: DLESE,NSDL • DLESE is conceived as an information system dedicated to the collection, enhancement, and distribution of materials that facilitate learning about the Earth system at all educational levels. • collections of high-quality materials for instruction • access to Earth data sets • discovery and distribution systems to efficiently find and use materials • services to help users most effectively create and use materials • communication networks to facilitate interactions and collaborations NARA Presentation 2001

More Related