1 / 15

Research Data Storage WG

Research Data Storage WG. RDSWG Initial visits to stakeholder groups March/April 2010. Remit. The question at highest level:.

dore
Download Presentation

Research Data Storage WG

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research Data Storage WG RDSWG Initial visits to stakeholder groups March/April 2010

  2. Remit

  3. The question at highest level: What could be done (in the context of storage and data management) in respect of services, facilities or policies which would help your research be better, more competitive and be more successful.

  4. Group members - Peter Clarke (Chair) - Jean Ritchie (Deputy Chair) - Chris Adie (IS, CSE ALD) - Abdul Majhoti (IS, CHSS ALD) - Marshall Dozier (IS, CMVM ALD) - Tony Weir (IS) - Sevi Rodriguez (CHSS) - David Reimer (CHSS) - Colin Higgs (CSE) - Paul Palmer (CSE) - David Perry (CMVM) - Mayank Dutia (CMVM).

  5. Foreseen way of working Capture initial knowledge From WG & documents Capture on powerpoint Visit initial set of college groups to solicit views Survey peer institutions Other external documents Draft document Visit other groups (to be determined) Produce draft strategy Solicit comments on draft Open forum Produce final strategy

  6. Groups to visit • Initial set • CHSS: CSC & CCPAG • CMVM: IT WG • CSE: CC&IT, CCPAG • Second set • College Research Committees • ITPF • ECDF users • DCC • EDINA • Roslin • Edinburgh College of Art • ????

  7. Where are we working in RDSWG ?????? “Backup service”, “Research Dataset curation Service”,..” NFS, AFS, ISCSI, Samba…etc. Disk, Tape, Terabytes, Gigabytes…etc..

  8. “Axes” (not orthogonal or independent or complete yet) * Volume of research data to be stored (KiloBytes ⇔ TeraBytes) * Cost (cheap commodity ⇔ high quality more expensive) * Lifetime of data (weeks, months, years, forever..) * Primary repository (must be robust) ⇔ Scratch (can easily reproduce if data lost) * Not backed ⇔ backed up * Line-proximity (online disk ⇔ tape) * Physical-proximity (defined by bandwidth*latency from storage to compute) * Visibility level (public access outside of UOE ⇔ limited to specific users ) * Security level (must be very secure  dont really care if its copied) * Access control flexibility (highly central or fully devolved to groups) * How presented to user, e.g. as a windows folder, as NAS, abstract service ??? * Value (£) of data, irreplacability of data

  9. The following are items picked up so far which we feel are LIKELY to be important for research storage. Some of these have emerged as being popular in the CSE and CMVM visits. They are roughly arranged in popularity order (roughly) arranged them. Picked up so far

  10. Archive Services To preserve research data and comply with funding agency requirements Preserving work of departing staff When you just dont know ... Federated Hierarchical Storage Management (FHSM) system (across School, College and Centre) Federating local storage with central storage in hierarchical manner Local level somtimes in groups, schools Secondary level perhaps at college or central level Tertiary level of archiving Includes backup as part of functionality Local (IT manager, PI?) management of tagging data appropriately Agile access control Flexible access control system needed – this cannot and should not be centralised Delegated authority as far as is appropriate – down to research group level

  11. Cost is very important A recurrent them is that many users would like to use some more centrally provided file store, but the headline rates currently quoted for the IS provided SAN services are often considered to be far too high. This leads to ad-hoc local solutions which appear to be cheaper at the point of use. University policy should mitigate practicalities which continue to drive ad-hoc and incoherent solutions. Cost models should be developed which recognise these more subjective practicalities.

  12. A Network File Store available from wherever you are (e.g. AFS, NFS4.., Dropbox) Desktop Laptop (including when remote) Home Central machines ECDF Multi platform : must be available to Windows desktops , Unix desktops, laptops (including Macs), compute clusters. Simple/Automated synchronisation of laptops when there is a network connection More and more crucial work is “on the move” from laptops – this will only increase. Research data is also papers and documents being prepared Security of important laptop work is therefore very important. Needs synchronisation of files on laptop with some other system ( e.g. university file system)

  13. Policy & Services to make Extra-Institutional collaboration as easy as possible Probably the majority of high profile research it conducted in collaborations which span institutions and countries. Accessibility of data to all collaborators, regardless of where physically stored is a high priority requirement. This means more use of central authentication Networks of knowledge Some groups are obliged to keep data locally, but would benefit from advice and expertise. The combined knowledge in the University of best practice is huge Need to make this available to all whether using central storage or building local storage This should be a first class service. Network of knowledge for use of databases in research. Many areas of research depend upon databases Currently much use is very simplistic and inefficient Harnessing the University knowledge here would be very helpful.

  14. Large range of file sizes to be dealt with It is very important to realise that not only are there requirements for very "large" datasets (many 10s of GigaBytes) but that also some research areas have a requirement at the extreme opposite end of the scale, namely a very large number of very small files (MRI imaging for example). Auditable levels of security and integrity Some data must be subject to a high level of security against unauthorised access some grants require formal “audit” of integrity Some data is irreplaceable E.g. large population study medical data used in long term development studies cannot be regenerated and has great value for a long time Storage for Instrument/Service generated data Everyone is familiar with users generated data, but there is a very different category of service/device/instrument generated data.  The file sizes can vary as can their quantity, the scale often being much greater than User Generated Data. It's normal for these devices to be an integral part of the research infrastructure and be very critical for research.

  15. We need to know what is important to your research Please comment/augment: against the axes listed; on the “picked up so far items”; what to add ? priorities ? In any other way you think is appropriate ? Over to you……

More Related