Integration of mzXML and mzData Formats: Reference Implementation of Open-Source MS Data Interchange Conversion Software Joshua M. Tasman 1 , Eric W. Deutsch 1 , James S. Eddes 1 , David D. Shteynberg 1 , Patrick G. A. Pedrioli 2 , Jimmy K. Eng 1 , Ruedi Aebersold 1
Integration of mzXML and mzData Formats: Reference Implementation
of Open-Source MS Data Interchange Conversion Software
Joshua M. Tasman1, Eric W. Deutsch1, James S. Eddes1, David D. Shteynberg1, Patrick G. A. Pedrioli2, Jimmy K. Eng1, Ruedi Aebersold1
1Institute for Systems Biology, Seattle, WA; 2Institute for Molecular Systems Biology (ETH), Zurich, Switzerland
Integration with Existing Proteomics Pipeline
This poster presents work on raw-data to xml formats. Once these xml files are available, only slight modifications to the existing open-source Trans-proteomic Pipeline tools are necessary; the tools rely on common parsers, RAMP (C++) and JRAP (Java), which can be extended to support the new dataXML file format.
While the project initially began as an update of existing C++ code, the C# language became the language of choice for the project. Several reasons informed this change. For one, the language has stronger automatic support for safety features such as garbage collection and array checking. Secondly, C# provides facilities for easing the task of working with 3rd party code. “Dot-net” assemblies can of course be easily incorporated. For dealing with older methods, such as those providing COM and DLL, Microsoft IDE-provided tools can auto-generate bridging code to access these from the C#.
The Thermo raw file format was chosen as the initial implementation simply due to familiarity with their application programming interface. Actually, the availably or lack of vendor support is the greatest issue facing expansion of the project.
Because of great differences in API style between vendors, questions have been raised as to the efficiency of the adaptor design pattern used in this project by fellow developers.
We present a prototype open-source framework for converting vendor-specific raw MS/MS data files to open-source XML formats. The mzXML format (developed by SPC/ISB) and the PSI consortium’s dataXML formats are both target outputs. Currently conversion is designed to accept Themo's RAW data format, but the project is designed to be extendable to other input formats. The dataXML format is still in flux, but is nearing final ratification. Once this happens, with minor modifications to some supporting programs, data converted to dataXML format can be supported by the rest of the SPC/ISB's open-source Trans-proteomic Pipeline toolchain.
Existing TPP Pipeline Tools:
Web display/Interaction, Quantation, etc
* Implementation of additional input formats
* Additional vendor support: As vendors become more open with their APIs for accessing raw data, implementation of projects like this one can proceed much more easily. Additionally, through documentation can allow
* Cross-platform support: If vendors move towards software libraries that operate entirely with the .net framework, and allowed required libraries to be copied, the code could be executed on linux and Mac OSX platforms.
Driven in large part by recent rapid advances in proteomics, the need for a vendor-independent means of accurate and robust representation and exchange for mass spectroscopy data has become apparent. Two major formats have emerged: mzXML, developed at the Institute for Systems Biology (ISB) and highly integrated into the Trans-proteomic Pipeline (TPP) software tool chain, and mzData, developed by the HUPO Proteomics Standards Initiative (PSI) MS working group. Both the proteomics research community and instrument vendors would clearly benefit from a single standard. Recently, the PSI-MS group, the ISB, and instrument vendors collaborated to produce a draft specification for a unified data format, tentatively titled "dataXML", with the intention of combining the best features of the mzXML and msData formats. For example, the dataXML format allows additional information not encoded in the xml schema to be included in the file through the use of supplemental controlled vocabularies. Here, we present work towards an open-source reference implementation for converters from raw data to both the mzXML and dataXML formats, which could be extended to other formats as well.
Software will be available at
mzXML: A common open representation of mass spectrometry data and its application to proteomics research.Nat Biotechnol. 2004 Nov;22(11):1459-66.
PSI-MS: Mass Spectrometry Standards Working Group http://psidev.info/
. . .