SCD Research Data For UCAR Data Management Working Group January 10, 2001 Steven Worley Scientific Computing Division Data Support Section
Four Categories of Data Service • User Profile • Data Content • Data Access
Four Categories of Data Service • Archives directly from the MSS • Accessible to all with NCAR computing accounts • Web accessible online data server • Information interface for all data • Individual requests • Customized on per request basis • Data preparation for large projects • E.g. Reanalyses at ECMWF and NCEP
User profile, online data server • Users based on network address domain, data for 1995-1998 • ~ 20K unique addresses per year
User profile, individual request • Requests excluding CD-ROMS • Based on 1998-1999 data • 28% U.S. Univ. (179 of 638) • 11% Foreign Univ. (69) • 27% Foreign Non-Univ. (171) • 34% U.S. Gov. and Commercial (219) (remarkably, some foreign and government sources find it desirable to acquire their own data from SCD/DSS)
User profile, all users All users by year, excluding online category
User profile, finding the data • Peer and colleague recommendations • Acknowledgements in publications • WWW searches and perusing
Quick look at DSS Information Interface • Website, dss.ucar.edu • Top level information and dataset groupings • Oceanographic datasets by Category
Important improvements for the Information Interface • More top level documents to guide users to the “best” datasets • For improved searches • Carefully worded .html <title> .. </title> • Pages with introductory text that clearly defines the dataset with keywords that promote discovery. • .html <meta tag>..</meta tag>, note, not all search engines boost ranking based on these.
User profile, compliments • Fast service, requests receive prompt action. • Staff with scientific knowledge to offer assistance and guidance. • Flexible system – can adapt to meet users requirements.
What makes this system work • The data records and files remain in simple structures • This way the archive should always be accessible to programs written with low level languages • The data can survive evolutions in OS systems and software, 50-years is not too much. • Programs can be written that allow fast and efficient manipulation of large collections. • Internal checksum keys can be strategically placed to insure data integrity – at any level.
User profile, complaints • All the data is not online – even though this quite impractical – 12+ TB • All the data is not in their favorite format, IDL, HDF, netCDF, GrIB, ASCII, GIS, Binary, .xls , Matlab, etc. • “Can I just get the piece I need?” • “Do you mean I need to know some FORTRAN or C Language?”
User Profile, skill set • Best skill set for our users includes knowing some FORTRAN and/or C. • Trend; more and more people are requesting data in application environment specific formats • Will the next generation scientist know a basic computing language?
Data Content, size and characteristic • Veritable smorgasbord of data. • Overall size, 12+ TB • 500+ distinct datasets • Many historical observations from the atmosphere, and ocean • Many operational analyses and reanalyses • Dataset sizes, < 1 MB to several TB • Many original formats. GrIB is dominate in our analyses and reanalyses datasets
Data Content, metadata management • Primarily, metadata is managed on our online information server. Each dataset has a WWW page. • All dataset WWW pages are automatically formed. • Corrections, addition, and changes are made to text files manipulated under a Unix change and control system. • Advantage: history of all changes and data files associated with the dataset, and the WWW pages are always current.
Data Content, metadata management • Have considerable amounts of hard copy references and metadata. - We are making scanned images of these now.
Data Content, long term archive and security • Small datasets and irreplaceable observations and analyses have two copies on the MSS • Although we cannot guarantee they reside on separate cartridges • Files are write password protected – prevents accidental overwrites. • We have been fortunate to have a very reliable MSS and our success will continue to rely on it in the future.
Data Content, long term archive and security • Areas of concern • We don’t have adequate offsite backups • At least critical observations should be protected from catastrophe at the Mesa Lab • In the event of loss of single copy large datasets we rely on other centers for replacement • This needs to be discussed more nationally • Redistribution may have restrictions or be costly
Data Content, long term archive and security • Areas of concern, continued • Must always remain on guard so important data are not lost due to short sighted policy decisions. • Must participate in national and international projects so that the archive content is continually refreshed with the most scientifically important data, at low cost.
Data Access, aids to access • Maintain FORTRAN code to read all data files • Sometimes for many platforms (Unix, PC) • The MSS file location is defined for all datasets, and is available online. • Staff specialist are assigned and identified for each dataset
Data Access, most frequent • NCEP/NCAR Global Atmospheric Reanalysis, 2.6 TB • How? • MSS • WWW (monthly means) • CDROMS • FTP • Various Tape Media (large capacity)
Data Access, largest barrier • Discovering what is available • Gaining access to the MSS collection (when they don’t have a computing account) • Not having experience with low level languages, e.g. FORTRAN and C/C++
Data Access, product development • Yes we do, and we feel it is very important! • Why? • Can QC the data and identify problems early • Can reorganize into logical collection, or create popular subsets. • Reduce the volume of large collections to manageable size for users • Saves many users extra work
Data Access, improvements for scientific advancement • Minimize the barriers that inhibit discovery – metadata problem. • Supply the data in the users favorite format or provide tools that can convert the data where it is practical and efficient. • Place more data, and valuable higher level data products on line